I wish to seek some clarifications on Unicode and str methods in Python. After reading some explanation on Unicode, there are still couple of doubts I hope folks can help me on:
-
Am I right to say that when declaring a unicode string e.g
word=u'foo', python uses the encoding of the terminal and decodesfooin e.gUTF-8, and assigningwordthe hex representation in unicode? -
So, in general, is the process of printing out characters in a file, always decoding the byte stream according to the encoding to unicode representation, before displaying the mapped characters out?
-
In my terminal, Why does
'é'.lower()orstr('é')displays in hex'\xc3\xa9', whereas ‘a’.lower() does not?
First we should be clear we are talking about Python 2 only. Python 3 is different.
You need to decode it first, and then encode it and print. In Python 2, DON’T print out unicode directly! Otherwise, if the system is encoding it in an incompatitable way (like “ascii”), an exception will be raised.
You have to do all these explicitly.
The short answer is “a” doesn’t have to be represented in “\x61”, “a” is simply more readable. A longer answer: typically in the interactive shell, if you type a value and press enter, Python will show the repr() of your string. I think “repr” will try to print everything in ascii representation. For “a”, it’s already ascii, so it’s outputed directly. For str “é”, it’s UTF-8 encoded binary stream, so Python escape each byte and print as ‘xc3\xa9’