Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8721143
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 13, 20262026-06-13T07:10:23+00:00 2026-06-13T07:10:23+00:00

I have some questions about encoding in python 2.7. 1.The python code is as

  • 0

I have some questions about encoding in python 2.7.

1.The python code is as below,

#s = u"严"
s = u'\u4e25'
print 's is:', s
print 'len of s is:', len(s)
s1 = "a" + s
print 's1 is:', s1
print 'len of s1 is:', len(s1)

the output is:

s is: 严
len of s is: 1
s1 is: a严
len of s1 is: 2

I am confused that why the len of s is 1, how could 4e25 be stored in 1 byte? I also notice that USC-2 is 2-bytes long and USC-4 is 4-bytes long, why unicode string s‘s length is 1?

2.
(1)New a file named a.py with notepad++(Windows 7), and set the file’s encoding ANSI, code in a.py is as below:

# -*- encoding:utf-8 -*-
import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)

the output is:

ascii
s: 严
type of s: <type 'str'>

(2)New a file named b.py with notepad++(Windows 7), and set the file’s encoding UTF-8, code in b.py is as below:

# -*- encoding:gbk -*-
import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)

the output is:

  File "D:\pyws\code\\b.py", line 1
SyntaxError: encoding problem: utf-8

(3)change file b.py as below(the file’s encoding style is UTF-8):

import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)

the output is:

ascii
s: 涓
type of s: <type 'str'>

(4)change file a.py as below(the file’s encoding style is ANSI):

import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)

the output is:

  File "D:\pyws\code\a1.py", line 3
SyntaxError: Non-ASCII character '\xd1' in file D:\pyws\code\a1.py on
line 3, but no encoding declared; see http://www.python.org/peps/pep-0263.html f
or details

Why these 4 cases’ outputs in question2 are different? Anybody can figure it out in detail?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-13T07:10:23+00:00Added an answer on June 13, 2026 at 7:10 am

    Answer to Question 1:

    In Python versions <3.3, length for a Unicode string u'' is the number of UTF-16 or UTF-32 code units used (depending on build flags), not the number of bytes. \u4e25 is one code unit, but not all characters are represented by one code unit if UTF-16 (default on Windows) is used.

    >>> len(u'\u42e5')
    1
    >>> len(u'\U00010123')
    2
    

    In Python 3.3, the above will return 1 for both functions.

    Also Unicode characters can be composed of combining code units, such as é. The normalize function can be used to generate the combined or decomposed form:

    >>> import unicodedata as ud
    >>> ud.name(u'\xe9')
    'LATIN SMALL LETTER E WITH ACUTE'
    >>> ud.normalize('NFD',u'\xe9')
    u'e\u0301'
    >>> ud.normalize('NFC',u'e\u0301')
    u'\xe9'
    

    So even in Python 3.3, a single display character can have 1 or more code units, and it is best to normalize to one form or another for consistent answers.

    Answer to Question 2:

    The encoding declared at the top of the file must agree with the encoding in which the file is saved. The declaration lets Python know how to interpret the bytes in the file.

    For example, the character 严 is saved as 3 bytes in a file saved as UTF-8, but two bytes in a file saved as GBK:

    >>> u'严'.encode('utf8')
    '\xe4\xb8\xa5'
    >>> u'严'.encode('gbk')
    '\xd1\xcf'
    

    If you declare the wrong encoding, the bytes are interpreted incorrectly and Python either displays the wrong characters or throws an exception.

    Edit per comment

    2(1) – This is system dependent due to ANSI being the system locale default encoding. On my system that is cp1252 and Notepad++ can’t display a Chinese character. If I set my system locale to Chinese(PRC) then I get your results on a console terminal. The reason it works correctly in that case is a byte string is used and the bytes are just sent to the terminal. Since the file was encoded in ANSI on a Chinese(PRC) locale, the bytes the byte string contains are correctly interpreted by the Chinese(PRC) locale terminal.

    2(2) – The file is encoded in UTF-8 but the encoding is declared as GBK. When Python reads the encoding it tries to interpret the file as GBK and fails. You’ve chosen UTF-8 as the encoding, which on Notepad++ also includes a UTF-8 encoded byte order mark (BOM) as the first character in the file and the GBK codec doesn’t read it as a valid GBK-encoded character, so fails on line 1.

    2(3) – The file is encoded in UTF-8 (with BOM), but missing an encoding declaration. Python recognizes the UTF-8-encoded BOM and uses UTF-8 as the encoding, but the file is in GBK. Since a byte string was used, the UTF-8-encoded bytes are sent to the GBK terminal and you get:

    >>> u'严'.encode('utf8').decode(
    '\xe4\xb8\xa5'
    >>> '\xe4\xb8'.decode('gbk')
    u'\u6d93'
    >>> print '\xe4\xb8'.decode('gbk')
    涓
    

    In this case I am surprised, because Python is ignoring the byte \xa5, and as you see below when I explicitly decode incorrectly Python throws an exception:

    >>> u'严'.encode('utf8').decode('gbk')
    Traceback (most recent call last):
      File "<interactive input>", line 1, in <module>
    UnicodeDecodeError: 'gbk' codec can't decode byte 0xa5 in position 2: incomplete multibyte sequence
    

    2(4) – In this case, then encoding is ANSI (GBK) but no encoding is declared, and there is no BOM like in UTF-8 to give Python a hint, so it assumes ASCII and can’t handle the GBK-encoded character on line 3.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have some questions about the performance of this simple python script: import sys,
I have some questions about importing data from Excel/CSV File into SQL Server. Let
i have some questions about constructors in ColdFusion : must i use the name
I have some questions about using MySQLi queries, and related memory management. Suppose I
I have some questions about the default values in a function parameter list Is
I have some questions about vector in STL to clarify..... Where are the objects
I have some questions about the registry. We have Preferences p = Preferences.userRoot(); If
I am using Tomcat 6 and have some questions about Apache mod_jk as follows.
I have created a wildcard App ID and have some questions about bundle ID
We are writing an inventory system and I have some questions about sqlalchemy (postgresql)

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.