>>> a = "我" # chinese
>>> b = unicode(a,"gb2312")
>>> a.__class__
<type 'str'>
>>> b.__class__
<type 'unicode'> # b is unicode
>>> a
'\xce\xd2'
>>> b
u'\u6211'
>>> c = u"我"
>>> c.__class__
<type 'unicode'> # c is unicode
>>> c
u'\xce\xd2'
b and c are all unicode, but >>> b outputs u'\u6211', and >>> c outputs u'\xce\xd2', why?
When you enter
"我", the Python interpreter gets from the terminal a representation of that character in your local character set, which it stores in a string byte-for-byte because of the"". On my UTF-8 system, that’s'\xe6\x88\x91'. On yours, it’s'\xce\xd2'because you use GB2312. That explains the value of your variablea.When you enter
u"我", the Python interpreter doesn’t know which encoding the我character is in. What it does is pretty much the same as for an ordinary string: it stores the bytes of the character in a Unicode string, interpreting each byte as a Unicode codepoint, hence the wrong resultu'\xce\xd2'(or, on my box,u'\xe6\x88\x91').This problem only exists in the interactive interpreter. When you write Python scripts or modules, you can specify the encoding near the top and Unicode strings will come out right. E.g., on my system, the following prints the word liberté twice: