XML is supposed to be strict, and so there are some Unicode characters which

Question

0

Asked: May 20, 20262026-05-20T18:25:15+00:00 2026-05-20T18:25:15+00:00

XML is supposed to be strict, and so there are some Unicode characters which

0

XML is supposed to be strict, and so there are some Unicode characters which aren’t allowed in XML. However I’m trying to work with RSS feeds which often contain these characters anyway, and I’d like to either avoid parse errors from invalid characters or recover gracefully from them and present the document anyway.

See an example here (on March 21 anyway): http://feeds.feedburner.com/chrisblattman

What’s the recommended way to handle unicode in the XML feed? Detect the characters and substitute in null bytes, edit the parser, or some other method?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T18:25:16+00:00

Looks like that RSS feed contained a vertical tab character \x0c which is illegal per the XML 1.0 spec.

My advice is to filter out the illegal characters before passing the data to expat, rather than attempting to catch errors and recover. Here is a routine to filter out the Unicode characters which are illegal. I tested it on your chrisblattman.xml RSS feed:

import re
from xml.parsers import expat

# illegal XML 1.0 character ranges
# See http://www.w3.org/TR/REC-xml/#charsets
XML_ILLEGALS = u'|'.join(u'[%s-%s]' % (s, e) for s, e in [
    (u'\u0000', u'\u0008'),             # null and C0 controls
    (u'\u000B', u'\u000C'),             # vertical tab and form feed
    (u'\u000E', u'\u001F'),             # shift out / shift in
    (u'\u007F', u'\u009F'),             # C1 controls
    (u'\uD800', u'\uDFFF'),             # High and Low surrogate areas
    (u'\uFDD0', u'\uFDDF'),             # not permitted for interchange
    (u'\uFFFE', u'\uFFFF'),             # byte order marks
    ])

RE_SANITIZE_XML = re.compile(XML_ILLEGALS, re.M | re.U)

# decode, filter illegals out, then encode back to utf-8
data = open('chrisblattman.xml', 'rb').read().decode('utf-8')
data = RE_SANITIZE_XML.sub('', data).encode('utf-8')

pr = expat.ParserCreate('utf-8')
pr.Parse(data)

Update: Here is a Wikipedia page about XML character validity. My regexp above filters out the C1 control range, but you may want to allow those characters depending on your application.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

XML is supposed to be strict, and so there are some Unicode characters which

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply