Extra characters Extracted with XPath and Python (html)

Posted by Nacari on Stack Overflow See other posts from Stack Overflow or by Nacari
Published on 2010-05-25T22:47:14Z Indexed on 2010/05/25 22:51 UTC
Read the original article Hit count: 223

Filed under:
|
|
|

I have been using XPath with scrapy to extract text from html tags online, but when I do I get extra characters attached. An example is trying to extract a number, like "204" from a <td> tag and getting [u'204']. In some cases its much worse. For instance trying to extract "1 - Mathoverflow" and instead getting [u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t ']. Is there a way to prevent this, or trim the strings so that the extra characters arent a part of the string? (using items to store the data). It looks like it has something to do with formatting, so how do I get xpath to not pick up that stuff?

© Stack Overflow or respective owner

Related posts about python

Related posts about html