I start by creating a string variable with some non-ascii utf-8 encoded data on it:
>>> text = 'á' >>> text '\xc3\xa1' >>> text.decode('utf-8') u'\xe1'
Using unicode() on it raises errors…
>>> unicode(text) Traceback (most recent call last): File '<stdin>', line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
…but if I know the encoding I can use it as second parameter:
>>> unicode(text, 'utf-8') u'\xe1' >>> unicode(text, 'utf-8') == text.decode('utf-8') True
Now if I have a class that returns this text in the __str__() method:
>>> class ReturnsEncoded(object): ... def __str__(self): ... return text ... >>> r = ReturnsEncoded() >>> str(r) '\xc3\xa1'
unicode(r) seems to use str() on it, since it raises the same error as unicode(text) above:
>>> unicode(r) Traceback (most recent call last): File '<stdin>', line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Until now everything is as planned!
But as no one would ever expect, unicode(r, 'utf-8') won’t even try:
>>> unicode(r, 'utf-8') Traceback (most recent call last): File '<stdin>', line 1, in <module> TypeError: coercing to Unicode: need string or buffer, ReturnsEncoded found
Why? Why this inconsistent behavior? Is it a bug? is it intended? Very awkward.
The behaviour does seem confusing, but intensional. I reproduce here the entirety of the unicode documentation from the Python Built-In Functions documentation (for version 2.5.2, as I write this):
So, when you call
unicode(r, 'utf-8'), it requires an 8-bit string or a character buffer as the first argument, so it coerces your object using the__str__()method, and attempts to decode that using theutf-8codec. Without theutf-8, theunicode()function looks for a for a__unicode__()method on your object, and not finding it, calls the__str__()method, as you suggested, attempting to use the default codec to convert to unicode.