I need to convert (in Python) a 4-byte char into some other character. This is to insert it into my utf-8 mysql database without getting an error such as: “Incorrect string value: ‘\xF0\x9F\x94\x8E’ for column ‘line’ at row 1”
Warning raised by inserting 4-byte unicode to mysql shows to do it this way:
>>> import re
>>> highpoints = re.compile(u'[\U00010000-\U0010ffff]')
>>> example = u'Some example text with a sleepy face: \U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '
However, I get the same error as the user in the comment, “…bad character range..” This is apparently because my Python is a UCS-2 (not UCS-4) build. But then I am not clear on what to do instead?
In a UCS-2 build, python uses 2 code units internally for each unicode character over the
\U0000ffffcode point. Regular expressions need to work with those, so you’d need to use the following regular expression to match these:This regular expression matches any code point encoded with a UTF-16 surrogate pair (see UTF-16 Code points U+10000 to U+10FFFF.
To make this compatible across Python UCS-2 and UCS-4 versions, you could use a
try:/exceptto use one or the other:Demonstration on a UCS-2 python build: