I’m trying to parse some html and I have some problem with this little html code.
XML:
<div>
<p><span><a href="../url"></a></span></p>
<h3 class="header"><a href="../url">Other</a></h3>
<a href="../url">Other</a><br>
<a class="aaaaa" href="../url">Indice</a>
<p></p>
</div>
code:
import urllib
from lxml import etree
import StringIO
resultado=urllib.urlopen('trozo.html')
html = resultado.read()
parser= etree.HTMLParser()
tree=etree.parse(StringIO.StringIO(html),parser)
xpath='/div/h3'
html_filtrado=tree.xpath(xpath)
print html_filtrado
When I print the code it appears [], and I suppose that It should be a list with <h3 class="header"><a href="../url">Other</a></h3> in it.
If I would have that list I would execute etree.tostring(html_filtrado) to see <h3 class="header"><a href="../url">Other</a></h3>.
So how can get this code?
<h3 class="header"><a href="../url">Other</a></h3>
Or only ../url ? which is the part I want!!
Thank you
The case is, that etree.HTMLParser() when receives HTML, it creates the full html DOM tree.
So, instead of what you intended, if you use etree.tostring(tree) you get
So, the correct xpath would be ‘/html/body/div/h3’