I’m trying to figure out the python lxml api, but am running into a peculiar problem. I’ve installed the following library versions:
- libxml2 : 2.7.8
- libxslt : 1.1.26
When I run the following code:
html = open('file.html', 'r')
context = etree.iterparse(StringIO(html), events=("start", "end"), html='true')
for event, element in context:
#do stuff
EDIT :
It turns out that it is a parsing error. I moved the html to a file(shown below)
<html>
<head></head>
<body>
<table>
<tr>
<td>image</td>
<a href="relative.phtml?with=querystring&blah=blah">blah\n(blah)</a></td>
<td> 35 </td>
<td> 28 </td>
<td><b>-7</b></td>
<td>
23,000 </td>
<td> 373,000 </td>
<td> 644,000 </td>
<td>+72.65%</td>
</tr>
<tr>
<td>image</td>
<td><a href="relative.phtml?with=querystring&blah=blah">blah\n(blah)</a></td>
<td> 35 </td>
<td> 28 </td>
<td><b>-7</b></td>
<td>
23,000 </td>
<td> 373,000 </td>
<td> 644,000 </td>
<td>+72.65%</td>
</tr>
</table>
</body>
</html>
I’m now getting this error:
for event, element in context:
File “iterparse.pxi”, line 515, in lxml.etree.iterparse.next
(src/lxml/lxml.etree.c:86484) File “parser.pxi”, line 565, in
lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64084)
lxml.etree.XMLSyntaxError: error parsing attribute name, line 1,
column 12
ORIGIN ERROR:
for event, element in context:
File “iterparse.pxi”, line 515, in lxml.etree.iterparse.next
(src/lxml/lxml.etree.c:86484) File “parser.pxi”, line 565, in
lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64084)
lxml.etree.XMLSyntaxError: htmlParseEntityRef: expecting ‘;’, line 7,
column 71
I thought I followed the tutorial from lxml’s site pretty closely here so I’m very confused. Could it be an installation problem?
The problem is that the HTML is malformed. To solve this, you can use BeautifulSoup (it’s able to parse this HTML) or sanitize the HTML before trying to parse it.
The problems I’ve found are:
&=>&tdtag after firstatag has to be removed since it doesn’t match any other openingtdtag.