I’ve asked a question on how to use lxml to parse a url and get <p> elements back. It is resolved. However, to fully achieve my goal, I need to consider the effect of other tags inside a <p>.
The accepted answer provided by Acorn to parse a url and get <p> back is:
import lxml.html
htmltree = lxml.html.parse('http://www.google.com/intl/en/about/corporate/index.html')
print htmltree.xpath('//p/text()')
However, htmltree.xpath('//p/text()'), if there are other tags inside the <p> paragraph, pieces will be returned and also text in between of other tags will be ignored.
E.g. for <p>Text1... <a href="/link.../">hyperlinked text..</a> Text2....
Currently, by using htmltree.xpath('//p/text()'), it is parsed into ['Text1...','Text2...'].
More intuitively, the expected result should be ['Text1... hyperlinked text.. Text2...'].
Hence I would like to know, what other methods I should use, to parse it into a whole and somehow fix the interruptions by other type of tags, e.g. <a>?
I have further looked into the lxml xpath documentation, and I suspect it is because of the /text() in //p/text(). But I am stuck here and have no clue what to change.
Yes,
/text()gets the immediate text element in that tag. Instead, get allptags and use.text_content()to get all the text in them. From lxml.html doc:So you will have something like this: