Ok so I have another Python Unicode problem. In IDLE windows 7,The following code:

Question

0

Asked: May 26, 20262026-05-26T16:48:10+00:00 2026-05-26T16:48:10+00:00

Ok so I have another Python Unicode problem. In IDLE windows 7,The following code:

0

Ok so I have another Python Unicode problem. In IDLE windows 7,The following code:

uni = u"\u4E0D\u65E0"
binary = uni.encode("utf-8")
print binary

prints two chinese characters, 不无, the correct ones. However, if I replace the first line with

uni = u"\u65E0"

ie only the second character, it prints æ— instead. Altough if I replace it with only the first character

u"\u4E0D"

it gives the correct output 不

Is this a bug, or what am I doing wrong?

COMPLETE CODE:

uni = u"\u4E0D\u65E0"

binary = uni.encode("utf-8")

print binary

uni = u"\u65E0"

binary = uni.encode("utf-8")

print binary

uni = u"\u4E0D"

binary = uni.encode("utf-8")

print binary

OUTPUT:

不无

æ—

不

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T16:48:11+00:00

The unicode string u"\u4E0D\u65E0" consists of the two text characters 不 and 无.

When a unicode string is encoded, it is converted into a sequence of bytes (not binary). Depending on what encoding is used, there may not be a one-to-one mapping of text characters to bytes. The “utf8” encoding, for instance, can use from one to three bytes to represent a single character:

>>> u'\u65E0'.encode('utf8')
'\xe6\x97\xa0'

Now, before a sequence of bytes can be printed, python (or IDLE) has to try to decode it. But since it has no way to know what encoding was used, it is forced to guess. For some reason, it appears that IDLE may have wrongly guessed “cp1252” for one of the examples:

>>> text = u'\u65E0'.encode('utf8').decode('cp1252')
>>> text
u'\xe6\u2014\xa0'
>>> print text
æ—

Note that there are three characters in text – the last one is a non-breaking space.

EDIT

Strictly speaking, IDLE wrongly guesses “cp1252” for all three examples. The second one only “succeeds” because each byte coincidently maps to a valid text character (“cp1252” is an 8-bit, single-byte encoding). The other two examples contain the byte \x8d, which is not defined in “cp1252”. For these cases, IDLE (eventually) falls back to “utf8”, which gives the correct output.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Ok so I have another Python Unicode problem. In IDLE windows 7,The following code:

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply