>>> a = 我 # chinese >>> b = unicode(a,gb2312) >>> a.class <type ‘str’>

Question

0

Asked: June 2, 20262026-06-02T15:57:27+00:00 2026-06-02T15:57:27+00:00

>>> a = 我 # chinese >>> b = unicode(a,gb2312) >>> a.class <type ‘str’>

0

>>> a = "我"  # chinese  
>>> b = unicode(a,"gb2312")  
>>> a.__class__   
<type 'str'>   
>>> b.__class__   
<type 'unicode'>  # b is unicode
>>> a
'\xce\xd2'
>>> b
u'\u6211' 

>>> c = u"我"
>>> c.__class__
<type 'unicode'>  # c is unicode
>>> c
u'\xce\xd2'

b and c are all unicode, but >>> b outputs u'\u6211', and >>> c outputs u'\xce\xd2', why?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T15:57:29+00:00

When you enter "我", the Python interpreter gets from the terminal a representation of that character in your local character set, which it stores in a string byte-for-byte because of the "". On my UTF-8 system, that’s '\xe6\x88\x91'. On yours, it’s '\xce\xd2' because you use GB2312. That explains the value of your variable a.

When you enter u"我", the Python interpreter doesn’t know which encoding the 我 character is in. What it does is pretty much the same as for an ordinary string: it stores the bytes of the character in a Unicode string, interpreting each byte as a Unicode codepoint, hence the wrong result u'\xce\xd2' (or, on my box, u'\xe6\x88\x91').

This problem only exists in the interactive interpreter. When you write Python scripts or modules, you can specify the encoding near the top and Unicode strings will come out right. E.g., on my system, the following prints the word liberté twice:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

print(u"liberté")
print("liberté")

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

>>> a = 我 # chinese >>> b = unicode(a,gb2312) >>> a.__class__ <type ‘str’>

Leave an answerCancel reply

1 Answer

>>> a = 我 # chinese >>> b = unicode(a,gb2312) >>> a.class <type ‘str’>

Leave an answer
Cancel reply