unichr(0x10000) fails with a ValueError when cpython is compiled without --enable-unicode=ucs4.
Is there a language builtin or core library function that converts an arbitrary unicode scalar value or code-point to a unicode string that works regardless of what kind of python interpreter the program is running on?
Yes, here you go:
The crucial point to understand is that
unichr()converts an integer to a single code unit in the Python interpreter’s string encoding. The The Python Standard Library documentation for 2.7.3, 2. Built-in Functions, onunichr()reads,I added emphasis to “one character”, by which they mean “one code unit” in Unicode terms.
I’m assuming that you are using Python 2.x. The Python 3.x interpreter has no built-in
unichr()function. Instead the The Python Standard Library documentation for 3.3.0, 2. Built-in Functions, onchr()reads,Note that the return value is now a string of unspecified length, not a string with a single code unit. So in Python 3.x,
chr(0x10000)would behave as you expected. It “converts an arbitrary unicode scalar value or code-point to aunicodestring that works regardless of what kind of python interpreter the program is running on”.But back to Python 2.x. If you use
unichr()to create Python 2.xunicodeobjects, and you are using Unicode scalar values above 0xFFFF, then you are committing your code to being aware of the Python interpreter’s implementation ofunicodeobjects.You can isolate this awareness with a function which tries
unichr()on a scalar value, catchesValueError, and tries again with the corresponding UTF-16 surrogate pair:But you might find it easier to just convert your scalars to 4-byte UTF-32 values in a UTF-32 byte
string, and decode this bytestringinto aunicodestring:The code above was tested on Python 2.6.7 with UTF-16 encoding for Unicode strings. I didn’t test it on a Python 2.x intepreter with UTF-32 encoding for Unicode strings. However, it should work unchanged on any Python 2.x interpreter with any Unicode string implementation.