I have a regex to detect invalid xml 1.0 characters in a unicode string:

Question

0

Asked: June 16, 20262026-06-16T01:47:22+00:00 2026-06-16T01:47:22+00:00

I have a regex to detect invalid xml 1.0 characters in a unicode string:

0

I have a regex to detect invalid xml 1.0 characters in a unicode string:

bad_xml_chars = re.compile(u'[^\x09\x0A\x0D\u0020-\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF]', re.U)

On Linux/python2.7, this works perfectly. On windows the following is raised:

  File "C:\Python27\lib\re.py", line 190, in compile
    return _compile(pattern, flags)
  File "C:\Python27\lib\re.py", line 242, in _compile
    raise error, v # invalid expression
  sre_constants.error: bad character range

Any ideas why this isn’t compiling on Windows?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T01:47:23+00:00

You have a narrow Python build on Windows, so Unicode uses UTF-16. This means that Unicode characters higher than \uFFFF will be two separate characters in the Python string. You should see something like this:

>>> len(u'\U00010000')
2
>>> u'\U00010000'[0]
u'\ud800'
>>> u'\U00010000'[1]
u'\udc00'

Here is how the regex engine will attempt to interpret your string on narrow builds:

[^\x09\x0A\x0D\u0020-\ud7ff\ue000-\ufffd\ud800\udc00-\udbff\udfff]

You can see here that \udc00-\udbff is where the invalid range message is coming from.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a regex to detect invalid xml 1.0 characters in a unicode string:

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply