I have been using XPath with scrapy to extract text from html tags online,

Question

0

Asked: May 15, 20262026-05-15T06:16:56+00:00 2026-05-15T06:16:56+00:00

I have been using XPath with scrapy to extract text from html tags online,

0

I have been using XPath with scrapy to extract text from html tags online, but when I do I get extra characters attached. An example is trying to extract a number, like “204” from a <td> tag and getting [u'204']. In some cases its much worse. For instance trying to extract “1 – Mathoverflow” and instead getting [u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t ']. Is there a way to prevent this, or trim the strings so that the extra characters arent a part of the string? (using items to store the data). It looks like it has something to do with formatting, so how do I get xpath to not pick up that stuff?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T06:16:56+00:00

What does the line of code look like that returns [u'204']? It looks like what is being returned is a Python list containing a unicode string with the value you want. Nothing wront there–just subscript. As for the carriage returns, linefeeds and tabs, as Wai Yip Tung just answered, strip will take them out.

Probably

my_answer = item1['Title'][0].strip()

Or if you are expecting several matches

for ans_i in item1['Title']:
    do_something_with( ans_i.strip() )

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have been using XPath with scrapy to extract text from html tags online,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply