I am writing some HTML parsers using LXML Xpath feature. It seems to be working fine, but I have one main problem.
When parsing all the HTML <p> tags, there are words that use the tags <b>, <i> and etc. I need to keep those tags.
When parsing the HTML, for example;
<div class="ArticleDetail">
<p>Hello world, this is a <b>simple</b> test, which contains words in <i>italic</i> and others.
I have a <strong>strong</strong> tag here. I guess this is a silly test.
<br/>
Ops, line breaks.
<br/></p>
If I run this Python code;
x = lxml.html.fromstring("...html text...").xpath("//div[@class='ArticleDetail']/p")
for stuff in x:
print stuff.text_content()
This seems to work fine, but it removes all the other tags instead of p only.
Output:
Hello world, this is a simple test, which contains words in italic and others.
I have a strong tag here. I guess this is a silly test.
Ops, line breaks.
As you can see it removed all the <b>, <i> and <strong> tags. Is there anyway you can keep them?
You are currently retrieving only the text content, not the HTML content (which would include tags).
You want to retrieve all child nodes of your XPath match instead: