I have been using XPath with scrapy to extract text from html tags online, but when I do I get extra characters attached. An example is trying to extract a number, like “204” from a <td> tag and getting [u'204']. In some cases its much worse. For instance trying to extract “1 – Mathoverflow” and instead getting [u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t ']. Is there a way to prevent this, or trim the strings so that the extra characters arent a part of the string? (using items to store the data). It looks like it has something to do with formatting, so how do I get xpath to not pick up that stuff?
I have been using XPath with scrapy to extract text from html tags online,
Share
What does the line of code look like that returns
[u'204']? It looks like what is being returned is a Python list containing a unicode string with the value you want. Nothing wront there–just subscript. As for the carriage returns, linefeeds and tabs, as Wai Yip Tung just answered, strip will take them out.Probably
Or if you are expecting several matches