I’m confused. Consider this code working the way I expect:
>>> foo = u'Émilie and Juañ are turncoats.'
>>> bar = "foo is %s" % foo
>>> bar
u'foo is \xc3\x89milie and Jua\xc3\xb1 are turncoats.'
And this code not at all working the way I expect:
>>> try:
... raise Exception(foo)
... except Exception as e:
... foo2 = e
...
>>> bar = "foo2 is %s" % foo2
------------------------------------------------------------
Traceback (most recent call last):
File "<ipython console>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Can someone explain what’s going on here? Why does it matter whether the unicode data is in a plain unicode string or stored in an Exception object? And why does this fix it:
>>> bar = u"foo2 is %s" % foo2
>>> bar
u'foo2 is \xc3\x89milie and Jua\xc3\xb1 are turncoats.'
I am quite confused! Thanks for the help!
UPDATE: My coding buddy Randall has added to my confusion in an attempt to help me! Send in the reinforcements to explain how this is supposed to make sense:
>>> class A:
... def __str__(self): return "string"
... def __unicode__(self): return "unicode"
...
>>> "%s %s" % (u'niño', A())
u'ni\xc3\xb1o unicode'
>>> "%s %s" % (A(), u'niño')
u'string ni\xc3\xb1o'
Note that the order of the arguments here determines which method is called!
The Python Language Reference has the answer:
This works, because
foois aunicodeobject. This causes the above rule to take effect and results in a Unicode string.In this case,
foo2is anExceptionobject, which is obviously not aunicodeobject. So the interpreter tries to convert it to a normalstrusing your default encoding. This, apparently, isascii, which cannot represent those characters and bails out with an exception.Here it works again, because the format string is a
unicodeobject. So the interpreter tries to convertfoo2to aunicodeobject as well, which succeeds.As to Randall’s question: this surprises me too. However, this is according to the standard (reformatted for readability):
How such a
unicodeobject is created is left unclear. So both are legal:__str__, decode back to a Unicode string, and insert it into the output string__unicode__and insert the result directly into the output stringThe mixed behaviour of the Python interpreter is rather hideous indeed. I would consider this to be a bug in the standard.
Edit: Quoting the Python 3.0 changelog, emphasis mine: