I tried to do that, and I found this errors:
>>> import re
>>> x = 'Ingl\xeas'
>>> x
'Ingl\xeas'
>>> print x
Ingl�s
>>> x.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-5: unexpected end of data
>>> x.decode('utf8', 'ignore')
u'Ingl'
>>> x.decode('utf8', 'replace')
u'Ingl\ufffd'
>>> print x.decode('utf8', 'replace')
Ingl�
>>> print x.decode('utf8', 'xmlcharrefreplace')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
TypeError: don't know how to handle UnicodeDecodeError in error callback
When I use the print statement, I want that:
>>> print x
u'Inglês'
Any help is welcome.
You need to know how the input data is encoded before you decode it. In some of you’re attempts, you’re trying to decode it from UTF-8, but Python throws an exception because the input isn’t valid UTF-8. It looks like it might be latin-1. This works for me:
You mention “non-ASCII HTML”. If you’re writing a web server script and you’re getting data from an HTTP request, you should check the Content-Type header. In an ideal world, it will tell you which encoding the client is using for the data. Keep in mind that the client may be working incorrectly.
Hope that helps!