I am trying to work with several documents that all have various encodings –

Question

0

Asked: May 17, 20262026-05-17T22:23:12+00:00 2026-05-17T22:23:12+00:00

I am trying to work with several documents that all have various encodings –

0

I am trying to work with several documents that all have various encodings – some utf-8, some ISO-8859-2, some ascii etc. Is there a reliable way of decoding to a standard encoding for processing?

I have tried the following:

import chardet
encoding = chardet.detect(text)
text = unicode(text,encoding['encoding']).decode(sys.getdefaultencoding(),'ignore')

With the above code I still get UnicodeEncodeError errors

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T22:23:13+00:00

Use decode to convert bytes to unicode, and encode to convert unicode to bytes:

text.decode(encoding['encoding'], 'ignore').encode(sys.getdefaultencoding(), 'ignore')

Although I would recommend doing your processing on the unicode objects themselves, or UTF-8 encoded strings if you absolutely need to work with bytes. sys.getdefaultencoding() is 'ascii', which provides a very limited character set. See also: http://wiki.python.org/moin/DefaultEncoding

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to work with several documents that all have various encodings –

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply