I have a regex to detect invalid xml 1.0 characters in a unicode string:
bad_xml_chars = re.compile(u'[^\x09\x0A\x0D\u0020-\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF]', re.U)
On Linux/python2.7, this works perfectly. On windows the following is raised:
File "C:\Python27\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "C:\Python27\lib\re.py", line 242, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range
Any ideas why this isn’t compiling on Windows?
You have a narrow Python build on Windows, so Unicode uses UTF-16. This means that Unicode characters higher than
\uFFFFwill be two separate characters in the Python string. You should see something like this:Here is how the regex engine will attempt to interpret your string on narrow builds:
You can see here that
\udc00-\udbffis where the invalid range message is coming from.