What is the best way to (losslessly) convert Unicode to a lower-order byte encoding

Question

0

Asked: May 26, 20262026-05-26T16:40:01+00:00 2026-05-26T16:40:01+00:00

What is the best way to (losslessly) convert Unicode to a lower-order byte encoding

0

What is the best way to (losslessly) convert Unicode to a lower-order byte encoding (8 bits), in a language inspecific way? I want a format that is standard, i.e. has widespread library support for conversion both directions.

If I were using Python, I would use repr:

In [1]: x = u"Российская Федерация"

In [2]: repr(x)
Out[2]: "u'\\xd0\\xa0\\xd0\\xbe\\xd1\\x81\\xd1\\x81\\xd0\\xb8\\xd0\\xb9\\xd1\\x81\\xd0\\xba\\xd0\\xb0\\xd1\\x8f \\xd0\\xa4\\xd0\\xb5\\xd0\\xb4\\xd0\\xb5\\xd1\\x80\\xd0\\xb0\\xd1\\x86\\xd0\\xb8\\xd1\\x8f'"

However, I’m looking for a format that has good library support for converting the second string back to the first, in a variety of languages.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T16:40:02+00:00

Out[2]: “u’\xd0\xa0\xd0\xbe\xd1\x81\xd1\x81\xd0\xb8\xd0\xb9\xd1\x81\xd0\xba\xd0\xb0\xd1\x8f \xd0\xa4\xd0\xb5\xd0\xb4\xd0\xb5\xd1\x80\xd0\xb0\xd1\x86\xd0\xb8\xd1\x8f'”

If that’s what you see, your terminal is set up wrong, it’s treating UTF-8 input as being ISO-8859-1 (or cp1252 in the case of the Windows console, which isn’t possible to set up right).

The proper Python repr of Российская Федерация would be the Unicode literal:

u'\u0420\u043e\u0441\u0441\u0438\u0439\u0441\u043a\u0430\u044f \u0424\u0435\u0434\u0435\u0440\u0430\u0446\u0438\u044f'

Which as it happens is pretty close to the JavaScript/JSON string literal

"\u0420\u043e\u0441\u0441\u0438\u0439\u0441\u043a\u0430\u044f \u0424\u0435\u0434\u0435\u0440\u0430\u0446\u0438\u044f"

If you want a 7-bit-safe (ASCII) representation of a Unicode string, JSON is a reasonable choice of format. Get it by using json.dumps() though rather than hacking the Python repr, since there are some subtle inconsistencies between the two formats.

Other well-understood ASCII representations you could try might include URL-encoding (%D0%A0%D0%BE...) and XML character escapes (<value>Рос...</value>).

If you only need an arbitrary binary representation that doesn’t need to be 7-bit safe, as Max mentioned, just .encode('utf-8').

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

What is the best way to (losslessly) convert Unicode to a lower-order byte encoding

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply