I’m looking for a simple way of converting a user-supplied string to UTF-8. It

Question

0

Asked: June 8, 20262026-06-08T03:00:32+00:00 2026-06-08T03:00:32+00:00

I’m looking for a simple way of converting a user-supplied string to UTF-8. It

0

I’m looking for a simple way of converting a user-supplied string to UTF-8. It doesn’t have to be very smart; it should handle all ASCII byte strings and all Unicode strings (2.x unicode, 3.x str).

Since unicode is gone in 3.x and str changed meaning, I thought it might be a good idea to check for the presence of a decode method and call that without arguments to let Python figure out what to do based on the locale, instead of doing isinstance checks. Turns out that’s a not a good idea at all:

>>> u"één"
u'\xe9\xe9n'
>>> u"één".decode()
Traceback (most recent call last):
  File "<ipython-input-36-85c1b388bd1b>", line 1, in <module>
    u"één".decode()
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

My question is two-fold:

Why is there a unicode.decode method at all? I thought Unicode strings were considered “not encoded”. This looks like a sure way of getting doubly encoded strings.
How do I tackle this problem in a way that is forward-compatible with Python 3?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T03:00:33+00:00

It’s not useful to speak of “decoding” a unicode string. You want to encode it to bytes. unicode.decode is solely there for historical reasons; its semantics are meaningless. Therefore, it has been removed in Python 3.

However, the encode/decode semantics have historically been extended to include (character) string-to-string or byte-to-bytes encodings such as rot13 or bzip2. In Python 3.1, these pseudo encodings were removed, and reintroduced in Python 3.2.

In general, you should design your interfaces so that they either accept character or byte strings. An interface that accepts both (for reasons other than backwards compatibility) is a code smell, hard to test, prone to bugs (what if someone passes UTF-16 bytes?) and has questionable semantics in the first place.

If you must have an interface that accepts both character and byte strings, you can check for the presence of the decode method in Python 3. If you want your code to work in 2.x as well, you’ll have to use isinstance.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m looking for a simple way of converting a user-supplied string to UTF-8. It

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply