I have some html file with span elements:
<html>
<body>
<span class="one">Text</span>some text</br>
<span class="two">Привет</span>Текст на русском</br>
</body>
</html>
To get “some text” :
# -*- coding:cp1251 -*-
import lxml
from lxml import html
filename = "t.html"
fread = open(filename, 'r')
source = fread.read()
tree = html.fromstring(source)
fread.close()
tags = tree.xpath('//span[@class="one" and text()="Text"]') #This OK
print "name: ",tags[0].text
print "value: ",tags[0].tail
tags = tree.xpath('//span[@class="two" and text()="Привет"]') #This False
print "name: ",tags[0].text
print "value: ",tags[0].tail
This show:
name: Text
value: some text
Traceback: ... in line `tags = tree.xpath('//span[@class="two" and text()="Привет"]')`
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes
How to solve this problem?
lxml
(As observed, this is a bit dodgy between system encodings and apparently doesn’t work properly in Windows XP, though it did in Linux.)
I got it to work by decoding the source string –
tree = html.fromstring(source.decode('utf-8')):This means that the actual tree is all
unicodeobjects. If you just put the xpath parameter as aunicodeit finds 0 matches.BeautifulSoup
I prefer to use BeautifulSoup for any of this sort of stuff, anyway. Here is my interactive session; I saved the file in cp1251.
At the end of that, it’s possibly worth while considering trying
source.decode('cp1251')instead ofsource.decode('utf-8')when you’re taking it from the filesystem. lxml may actually work then.