My question is similar to Why are "control" characters illegal in XML 1.0? – however I’m looking for a solution to the problem below, rather than why the XML spec disallows control characters in XML.
I have a servlet, which prints a String containing an XML upon user request. One particular element contains a CDATA section which is required to contain some user input text.
Now it so happens that in one particular case, our user input contains the character U+0001 (a control character). And even though I specify the charset as UTF-8, the servlet throws an error:
Error: not well-formed
Location:
<![CDATA[
Is there a way I can process the Java String to make it “XML safe” ? Particularly, to make it safe when put in the CDATA section?
I hope my question is clear!
Thanks in advance,
Raj
The only conforming way to make this XML-safe is to add your own encoding.
You can do one of those two (for example):
\u0001for U+0001. orBoth of those approaches need explicit support in both the consumer and the producer. The second approach has the advantage of using well-defined data types with wide support, but if your content is actually text, you need to specify (or communicate) the encoding used in the byte stream (a necessity that is otherwise negated by XML itself).
If removing all non-transferable characters would be appropriate, then this regex should do the trick:
Note that the spec suggests that document authors be even more restrictive with the set of characters allowed in a note. That regex would be a bit longer.