I’m trying to parse some html and I have some problem with this little

Question

0

Editorial Team

Asked: May 26, 20262026-05-26T10:29:49+00:00 2026-05-26T10:29:49+00:00

I’m trying to parse some html and I have some problem with this little

0

I’m trying to parse some html and I have some problem with this little html code.

XML:

<div>
    <p><span><a href="../url"></a></span></p>
    <h3 class="header"><a href="../url">Other</a></h3>
    <a href="../url">Other</a><br>
    <a class="aaaaa" href="../url">Indice</a>
    <p></p>               
</div>

code:

import urllib
from lxml import etree
import StringIO
resultado=urllib.urlopen('trozo.html')
html = resultado.read()
parser= etree.HTMLParser()
tree=etree.parse(StringIO.StringIO(html),parser)
xpath='/div/h3'
html_filtrado=tree.xpath(xpath)
print html_filtrado

When I print the code it appears [], and I suppose that It should be a list with <h3 class="header"><a href="../url">Other</a></h3> in it.
If I would have that list I would execute etree.tostring(html_filtrado) to see <h3 class="header"><a href="../url">Other</a></h3>.

So how can get this code?

<h3 class="header"><a href="../url">Other</a></h3>

Or only ../url ? which is the part I want!!

Thank you

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T10:29:50+00:00

The case is, that etree.HTMLParser() when receives HTML, it creates the full html DOM tree.
So, instead of what you intended, if you use etree.tostring(tree) you get

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
<p><span><a href="../url"/></span></p>
<h3 class="header"><a href="../url">Other</a></h3>
<a href="../url">Other</a><br/><a class="aaaaa" href="../url">Indice</a>
<p/>

So, the correct xpath would be ‘/html/body/div/h3’

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to parse some html and I have some problem with this little

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply