I’m new to lxml and python. I’m trying to parse an html document. When I parse using the standard xml parser it will write the characters out correctly but I think it fails to parse as I have trouble searching it with xpath.
Example file being parsed:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>title</title>
</head>
<body>
<span id="demo">Garbléd charactérs</span>
</body>
</html>
Parsing code:
from lxml import etree
fname = 'output/so-help.html'
# parse
hparser = etree.HTMLParser()
htree = etree.parse(fname, hparser)
# garbled
htree.write('so-dumpu.html', encoding='utf-8')
# targets
demo_name = htree.xpath("//span[@id='demo']")
# garbled
print 'name: "' + demo_name[0].text
Terminal output:
name: "Garbléd charactérs
htree.write output:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>title</title></head><body>
<span id="demo">Garbléd charactérs</span>
</body></html>
the problem was that you tried to encode an already encoded data, what you need is to let
parser decode the data with utf-8.
* in your original code try demo_name[0].text.decode(‘utf-8’) and you will see
the right way to do it :