I’m looking for a simple way of converting a user-supplied string to UTF-8. It doesn’t have to be very smart; it should handle all ASCII byte strings and all Unicode strings (2.x unicode, 3.x str).
Since unicode is gone in 3.x and str changed meaning, I thought it might be a good idea to check for the presence of a decode method and call that without arguments to let Python figure out what to do based on the locale, instead of doing isinstance checks. Turns out that’s a not a good idea at all:
>>> u"één"
u'\xe9\xe9n'
>>> u"één".decode()
Traceback (most recent call last):
File "<ipython-input-36-85c1b388bd1b>", line 1, in <module>
u"één".decode()
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
My question is two-fold:
- Why is there a
unicode.decodemethod at all? I thought Unicode strings were considered “not encoded”. This looks like a sure way of getting doubly encoded strings. - How do I tackle this problem in a way that is forward-compatible with Python 3?
It’s not useful to speak of “decoding” a unicode string. You want to encode it to bytes.
unicode.decodeis solely there for historical reasons; its semantics are meaningless. Therefore, it has been removed in Python 3.However, the
encode/decodesemantics have historically been extended to include (character) string-to-string or byte-to-bytes encodings such as rot13 or bzip2. In Python 3.1, these pseudo encodings were removed, and reintroduced in Python 3.2.In general, you should design your interfaces so that they either accept character or byte strings. An interface that accepts both (for reasons other than backwards compatibility) is a code smell, hard to test, prone to bugs (what if someone passes UTF-16 bytes?) and has questionable semantics in the first place.
If you must have an interface that accepts both character and byte strings, you can check for the presence of the
decodemethod in Python 3. If you want your code to work in 2.x as well, you’ll have to useisinstance.