I am trying to work with several documents that all have various encodings – some utf-8, some ISO-8859-2, some ascii etc. Is there a reliable way of decoding to a standard encoding for processing?
I have tried the following:
import chardet
encoding = chardet.detect(text)
text = unicode(text,encoding['encoding']).decode(sys.getdefaultencoding(),'ignore')
With the above code I still get UnicodeEncodeError errors
Use
decodeto convert bytes to unicode, andencodeto convert unicode to bytes:Although I would recommend doing your processing on the unicode objects themselves, or UTF-8 encoded strings if you absolutely need to work with bytes.
sys.getdefaultencoding()is'ascii', which provides a very limited character set. See also: http://wiki.python.org/moin/DefaultEncoding