I’m handling an encoding problem.
My input is a unicode string, such as:
>>> s
u'\xa6\xe8\xac\xc9'
Actually it is encoded in cp950. I want to decode it: (notice there’s no “u”)
>>> print unicode('\xa6\xe8\xac\xc9', 'cp950')
西界
However, I don’t know how to get rid of that “u”.
Direct conversion is not working:
>>> str(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
The result of using encode() is not what I wanted:
>>> s.encode('utf8')
'\xc2\xa6\xc3\xa8\xc2\xac\xc3\x89'
what I want is '\xa6\xe8\xac\xc9'
This is a bit of an abuse of the
unicodetype. Characters in aunicodestring are expected to be Unicode codepoints (e.g.u'\u897f\u754c'), and thus are encoding-agnostic. They are not supposed to be bytes from a specific encoding (Python 3 makes this distinction very clear by separating Unicode stringsstr, from byte stringsbytes).Since you want to just interpret each codepoint as bytes, you can do
since the first 256 codepoints of Unicode are defined to be equal to the codepoints of ISO-8859-1. However, please try to fix the issue that gave you this incorrect Unicode string in the first place.