In Python 2.7’s documentation, three rules about Unicode are described as follows:
If the code point is <128, it’s represented by the corresponding byte value.
If the code point is between 128 and
0x7ff, it’s turned into two byte values between 128 and 255.Code points >
0x7ffare turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
Then I made some tests about it:
>>>> unichr(40960)
u'\ua000'
>>> ord(u'\ua000')
40960
In my view, 40960 is a code point > 0x7ff, so it should be turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255, but it only be turned into two-bytes sequence, and the value ’00’ in u’\a000′ is lower than 128, not matched with the rules mentioned above. Why?
What’s more, I found some more Unicode characters, such as u'\u1234', etc. I found that the value ("12" && "34") in it is also lower than 128, but according to the thoery mentioned first, they shouldn’t be lower than 128. Any other theories that I lost?
Thanks for all answers.
That is a description of the UTF-8 encoding.
\ua000is an escape sequence representing a Unicode character. Thea000is a hexadecimal representation of the numerical code point value. It has nothing to do with UTF-8 encoding.You get UTF-8 encoding when you explicitly encode a unicode string using the UTF-8 encoding.