XML is supposed to be strict, and so there are some Unicode characters which aren’t allowed in XML. However I’m trying to work with RSS feeds which often contain these characters anyway, and I’d like to either avoid parse errors from invalid characters or recover gracefully from them and present the document anyway.
See an example here (on March 21 anyway): http://feeds.feedburner.com/chrisblattman
What’s the recommended way to handle unicode in the XML feed? Detect the characters and substitute in null bytes, edit the parser, or some other method?
Looks like that RSS feed contained a vertical tab character
\x0cwhich is illegal per the XML 1.0 spec.My advice is to filter out the illegal characters before passing the data to expat, rather than attempting to catch errors and recover. Here is a routine to filter out the Unicode characters which are illegal. I tested it on your
chrisblattman.xmlRSS feed:Update: Here is a Wikipedia page about XML character validity. My regexp above filters out the C1 control range, but you may want to allow those characters depending on your application.