I’m struggling with encodings and lxml. I’m reading in some html from a website and would like to search for a tag that includes a £ in its text using lxml. I can search the the tag(h3) and get the contents to print fine but if I try to search for the £ sign within the text I get a UnicodeDecodeError. I need to do the latter because it’s a more general case.
tree = lxml.html.fromstring(html)
# prints #£13,999
print tree.cssselect('h3')[0].text_content().encode("utf-8")
# generates "UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 0: ordinal not in range(128)"
# prints £13,999
print tree.cssselect('h3:contains(u"\xa3")')[0].text_content().encode('utf-8')
Any hep you can provide would be much appreciated… I’ve tried a several different things and this is driving me crazy!
I’m not experienced with neither python nor lxml, but the problem could be that the ‘h3’ string isn’t a unicode string and that the byte
a3isn’t a unicode code point by itself. You could try to replace:with: