In [1]: str='美'
In [2]: str.encode('utf-8')
Out[2]: b'\xe7\xbe\x8e'
In [3]: str.encode('utf-16')
Out[3]: b'\xff\xfe\x8e\x7f'
In [4]: str.encode('ascii')
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
/Users/XXXuserXXXTemp/<ipython-input-4-c7b96e3e54a7> in <module>()
----> 1 str.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\u7f8e' in position 0: ordinal not in range(128)
The str is a Chinese/Japanese character.
-
why
asciidoes not work? -
how to understand Out[2] and Out[3], i.e. what they really are?
str='美'is not an ASCII character, it’s outside the ASCII range and therefore can’t be represented as an ASCII character.From the Unicode tutorial for python:
They are byte strings (not character strings).
Out[2]is the sequence of bytes which represents the美codepoint in UTF-8 code units. The notation\xe7means a byte with the hexadecimal value e7.Out[3]is the sequence of bytes which represents the美codepoint in UTF-16 code units.To understand the distinction between characters, bytes, and code units, read the Unicode tutorial for python carefully and completely. For another, fairly good, treatment of the same material, read Joel Spolsky’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). You should know this much, no excuses!