I am using HtmlXPathSelector(response) object in Scrapy, I need to get two kinds of text formats:
My first text has the next format:
<p> Text, text, text, text, text, text, text, text, text </p>
<p>
<p> Text, text, text, text, text. </p>
My second text has the next format:
Text, text, text, text, text, text
<br>
<br>
Text, text, text..
<br>
<br>
when I use x.select('//div[@id="texto"]/text()').extract() but not the second… I get something like this:
'content': [u'\r\n ',
u'\r\n',
...
u'\r\n']
when I use x.select('//div[@id="texto"]/p/text()').extract() I get the second one but not the first:
How can I use a kind of rule in order to get both formats?
Update:
I get the solution with the next code, but I feel is a dirty solution:
content = x.select('//div[@id="nota_texto"]/p/text()').extract()
if content == []:
data['content'] = x.select('//div[@id="nota_texto"]/text()').extract()
else:
data['content'] = content
Update 2:
Is ok use double slash //, however now I am getting the contents of a table, because the HTML has the next format:
<div id="texto">
<table>
Undesired content
</table>
Desired content.
</div>
How to avoid get the ‘Undesired content’?
Update 3:
I received an answer by Steven Almeroth in the Scrapy Users Google Groups:
Use following-sibling:
x.select('id("texto")/table/following-sibling::node()').extract()
It works!
So you want all text inside the div with id “texto” and all it’s children?
If that’s the case, this should work:
If that’s too general for you, you can match multiple xpaths using the
|operator.EDIT:
If the ‘//text()’ xpath gets more than what you want, you should be more specific.
This is where the
|comes in. Try something like: