My question is similar to Why are "control" characters illegal in XML 1.0? –

Question

0

Asked: May 23, 20262026-05-23T10:51:32+00:00 2026-05-23T10:51:32+00:00

My question is similar to Why are "control" characters illegal in XML 1.0? –

0

My question is similar to Why are "control" characters illegal in XML 1.0? – however I’m looking for a solution to the problem below, rather than why the XML spec disallows control characters in XML.

I have a servlet, which prints a String containing an XML upon user request. One particular element contains a CDATA section which is required to contain some user input text.

Now it so happens that in one particular case, our user input contains the character U+0001 (a control character). And even though I specify the charset as UTF-8, the servlet throws an error:

Error: not well-formed
Location: 

<![CDATA[

Is there a way I can process the Java String to make it “XML safe” ? Particularly, to make it safe when put in the CDATA section?

I hope my question is clear!

Thanks in advance,
Raj

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T10:51:33+00:00

The only conforming way to make this XML-safe is to add your own encoding.

You can do one of those two (for example):

Store all data as textual data and replace all forbidden characters with some unicode-escape mechanism (other than the one defined in XML itself!). For example you could use the one used by Java: \u0001 for U+0001. or
store the data as binary data and use base64Binary of hexBinary to store your data in XML.

Both of those approaches need explicit support in both the consumer and the producer. The second approach has the advantage of using well-defined data types with wide support, but if your content is actually text, you need to specify (or communicate) the encoding used in the byte stream (a necessity that is otherwise negated by XML itself).

If removing all non-transferable characters would be appropriate, then this regex should do the trick:

Pattern XML_INVALID_CHARS = Pattern.compile("[^\u0009\n\r\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF ]+");
String xmlSafe = XML_INVALID_CHARS.matcher(input).replaceAll("");

Note that the spec suggests that document authors be even more restrictive with the set of characters allowed in a note. That regex would be a bit longer.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

My question is similar to Why are "control" characters illegal in XML 1.0? –

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply