I’ve asked a question on how to use lxml to parse a url and

Question

0

Asked: May 26, 20262026-05-26T09:49:33+00:00 2026-05-26T09:49:33+00:00

I’ve asked a question on how to use lxml to parse a url and

0

I’ve asked a question on how to use lxml to parse a url and get  elements back. It is resolved. However, to fully achieve my goal, I need to consider the effect of other tags inside a .

The accepted answer provided by Acorn to parse a url and get  back is:

import lxml.html

htmltree = lxml.html.parse('http://www.google.com/intl/en/about/corporate/index.html')

print htmltree.xpath('//p/text()')

However, htmltree.xpath('//p/text()'), if there are other tags inside the  paragraph, pieces will be returned and also text in between of other tags will be ignored.

E.g. for Text1... <a href="/link.../">hyperlinked text..</a> Text2....

Currently, by using htmltree.xpath('//p/text()'), it is parsed into ['Text1...','Text2...'].
More intuitively, the expected result should be ['Text1... hyperlinked text.. Text2...'].

Hence I would like to know, what other methods I should use, to parse it into a whole and somehow fix the interruptions by other type of tags, e.g. <a>?

I have further looked into the lxml xpath documentation, and I suspect it is because of the /text() in //p/text(). But I am stuck here and have no clue what to change.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T09:49:33+00:00

Yes, /text() gets the immediate text element in that tag. Instead, get all p tags and use .text_content() to get all the text in them. From lxml.html doc:

.text_content():

Returns the text content of the element, including
the text content of its children, with no markup.

So you will have something like this:

import lxml.html

htmltree = lxml.html.parse('http://www.google.com/intl/en/about/corporate/index.html')

p_tags = htmltree.xpath('//p')
p_content = [p.text_content() for p in p_tags]

print p_content

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve asked a question on how to use lxml to parse a url and

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply