I am using HtmlXPathSelector(response) object in Scrapy, I need to get two kinds of

Question

0

Editorial Team

Asked: June 13, 20262026-06-13T22:42:17+00:00 2026-06-13T22:42:17+00:00

I am using HtmlXPathSelector(response) object in Scrapy, I need to get two kinds of

0

I am using HtmlXPathSelector(response) object in Scrapy, I need to get two kinds of text formats:

My first text has the next format:

<p> Text, text, text, text, text, text, text, text, text </p>
<p>
<p> Text, text, text, text, text. </p>

My second text has the next format:

Text, text, text, text, text, text
<br>
<br>
Text, text, text..
<br>
<br>

when I use x.select('//div[@id="texto"]/text()').extract() but not the second… I get something like this:

'content': [u'\r\n          ',
                 u'\r\n',
                 ...
                 u'\r\n']

when I use x.select('//div[@id="texto"]/p/text()').extract() I get the second one but not the first:

How can I use a kind of rule in order to get both formats?

Update:

I get the solution with the next code, but I feel is a dirty solution:

content = x.select('//div[@id="nota_texto"]/p/text()').extract()
if content == []:
    data['content'] = x.select('//div[@id="nota_texto"]/text()').extract()
else:
    data['content'] = content

Update 2:

Is ok use double slash //, however now I am getting the contents of a table, because the HTML has the next format:

<div id="texto">
      <table>
        Undesired content
      </table>
       Desired content.
</div>

How to avoid get the ‘Undesired content’?

Update 3:

I received an answer by Steven Almeroth in the Scrapy Users Google Groups:

Use following-sibling:

x.select('id("texto")/table/following-sibling::node()').extract()

It works!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T22:42:18+00:00

So you want all text inside the div with id “texto” and all it’s children?
If that’s the case, this should work:

x.select('//div[@id="texto"]//text()').extract()

If that’s too general for you, you can match multiple xpaths using the | operator.

'<xpath1>|<xpath2>'

EDIT:

If the ‘//text()’ xpath gets more than what you want, you should be more specific.
This is where the | comes in. Try something like:

x.select('//div[@id="texto"]/text() | //div[@id="texto"]/p/text()')

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using HtmlXPathSelector(response) object in Scrapy, I need to get two kinds of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply