What is the best way to (losslessly) convert Unicode to a lower-order byte encoding (8 bits), in a language inspecific way? I want a format that is standard, i.e. has widespread library support for conversion both directions.
If I were using Python, I would use repr:
In [1]: x = u"Российская Федерация"
In [2]: repr(x)
Out[2]: "u'\\xd0\\xa0\\xd0\\xbe\\xd1\\x81\\xd1\\x81\\xd0\\xb8\\xd0\\xb9\\xd1\\x81\\xd0\\xba\\xd0\\xb0\\xd1\\x8f \\xd0\\xa4\\xd0\\xb5\\xd0\\xb4\\xd0\\xb5\\xd1\\x80\\xd0\\xb0\\xd1\\x86\\xd0\\xb8\\xd1\\x8f'"
However, I’m looking for a format that has good library support for converting the second string back to the first, in a variety of languages.
If that’s what you see, your terminal is set up wrong, it’s treating UTF-8 input as being ISO-8859-1 (or cp1252 in the case of the Windows console, which isn’t possible to set up right).
The proper Python repr of
Российская Федерацияwould be the Unicode literal:Which as it happens is pretty close to the JavaScript/JSON string literal
If you want a 7-bit-safe (ASCII) representation of a Unicode string, JSON is a reasonable choice of format. Get it by using
json.dumps()though rather than hacking the Python repr, since there are some subtle inconsistencies between the two formats.Other well-understood ASCII representations you could try might include URL-encoding (
%D0%A0%D0%BE...) and XML character escapes (<value>Рос...</value>).If you only need an arbitrary binary representation that doesn’t need to be 7-bit safe, as Max mentioned, just
.encode('utf-8').