Consider the next example:
>>> s = u"баба"
>>> s
u'\xe1\xe0\xe1\xe0'
>>> print s
áàáà
I’m using cp1251 encoding within the idle, but it seems like the interpreter actually uses latin1 to create unicode string:
>>> print s.encode('latin1')
баба
Why so? Is there spec for such behavior?
CPython, 2.7.
Edit
The code I was actually looking for is
>>> u'\xe1\xe0\xe1\xe0' == u'\u00e1\u00e0\u00e1\u00e0'
True
Seems like when encoding unicode with latin1 codec, all unicode points less that 256 are simply left as is thus resulting in bytes which I typed in before.
When you type a character such as
бinto the terminal, you see aб, but what is really inputted is a sequence of bytes.Since your terminal encoding is
cp1251, typingбабаresults in the sequence of bytes equal to the unicodeбабаencoded incp1251:(Note I use
utf-8above because my terminal encoding isutf-8, notcp1251. For me,"баба".decode('utf-8')is just unicode forбаба.)Since typing
бабаresults in the sequence of bytes\xe1\xe0\xe1\xe0, when you typeu"баба"into the terminal, Python receivesu'\xe1\xe0\xe1\xe0'instead. This is why you are seeingThis unicode happens to represent
áàáà.And when you type
the
latin1encoding convertsu'\xe1\xe0\xe1\xe0'to'\xe1\xe0\xe1\xe0'.The terminal receives the sequence of bytes
'\xe1\xe0\xe1\xe0', and decodes them withcp1251, thus printingбаба:Try:
(without the
u) instead. Or,to make
sunicode. Or, use the verbose but very explicit (and terminal-encoding agnostic):Or the short but less-readily comprehensible