Ok so I have another Python Unicode problem. In IDLE windows 7,The following code:
uni = u"\u4E0D\u65E0"
binary = uni.encode("utf-8")
print binary
prints two chinese characters, 不无, the correct ones. However, if I replace the first line with
uni = u"\u65E0"
ie only the second character, it prints æ— instead. Altough if I replace it with only the first character
u"\u4E0D"
it gives the correct output 不
Is this a bug, or what am I doing wrong?
COMPLETE CODE:
uni = u"\u4E0D\u65E0"
binary = uni.encode("utf-8")
print binary
uni = u"\u65E0"
binary = uni.encode("utf-8")
print binary
uni = u"\u4E0D"
binary = uni.encode("utf-8")
print binary
OUTPUT:
不无
æ—
不
The unicode string
u"\u4E0D\u65E0"consists of the two text characters不and无.When a unicode string is encoded, it is converted into a sequence of bytes (not binary). Depending on what encoding is used, there may not be a one-to-one mapping of text characters to bytes. The “utf8” encoding, for instance, can use from one to three bytes to represent a single character:
Now, before a sequence of bytes can be printed, python (or IDLE) has to try to decode it. But since it has no way to know what encoding was used, it is forced to guess. For some reason, it appears that IDLE may have wrongly guessed “cp1252” for one of the examples:
Note that there are three characters in
text– the last one is a non-breaking space.EDIT
Strictly speaking, IDLE wrongly guesses “cp1252” for all three examples. The second one only “succeeds” because each byte coincidently maps to a valid text character (“cp1252” is an 8-bit, single-byte encoding). The other two examples contain the byte
\x8d, which is not defined in “cp1252”. For these cases, IDLE (eventually) falls back to “utf8”, which gives the correct output.