I am trying to serialize DOM documents with supplementary unicode characters such as U+1D49C (𝒜, mathematical script capital A). Creating a node with such a character is not a problem (I just set the node value to the UTF-16 equivalent, “\uD835\uDC9C”). When serializing, however, Xalan and XSLTC (with a Transformer) and Xerces (with LSSerializer) all create invalid character entities like “��” instead of “𝒜”. I tried the “normalize-characters” parameter for LSSerializer, but it is not supported. Only Saxon gets it right, without using a character entity when the encoding is unicode.
I cannot use Saxon in practice (among other reasons, I use Java applets and do not want to load another jar), so I am looking for a solution with the default JDK libraries. Is it possible to get valid XML documents serialized from a DOM document with supplementary unicode characters ?
[edit] I found someone else who encountered this problem : http://www.dragishak.com/?p=131
[edit2] actually, it seems to work with LSSerializer when I don’t have xerces on the classpath (the class used is com.sun.org.apache.xml.internal.serialize.DOMSerializerImpl). It does not work with a transformer and com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl.
Since I didn’t see any answer coming, and other people seem to have the same problem, I looked into it further…
To find the origin of the bug, I used the
serializersource code fromXalan 2.7.1, which is also used inXerces.org.apache.xml.serializer.dom3.LSSerializerImplusesorg.apache.xml.serializer.ToXMLStream, which extendsorg.apache.xml.serializer.ToStream.ToStream.characters(final char chars[], final int start, final int length)handles the characters, and does not support unicode characters properly (note:org.apache.xml.serializer.ToTextSream(which can be used with aTransformer) does a better job in the characters method, but it only handles plain text and ignores all markup; one would think that XML files are text, but for some reasonToXMLStreamdoes not extendToTextStream).org.apache.xalan.transformer.TransformerIdentityImplis also usingorg.apache.xml.serializer.ToXMLStream(which is returned byorg.apache.xml.serializer.SerializerFactory.getSerializer(Properties format)), so it suffers from the same bug.ToStreamis usingorg.apache.xml.serializer.CharInfoto check if a character should be replaced by aString, so the bug could also be fixed there instead of directly inToStream.CharInfois using a propery file,org.apache.xml.serializer.XMLEntities.properties, with a list of character entities, so changing this file could also be a way to fix the bug, although so far it is designed just for the special XML characters (quot,amp,lt,gt). The only way to makeToXMLStreamuse a different property file than the one in the package would be to add aorg.apache.xml.serializer.XMLEntities.propertiesfile before in the classpath, which would not be very clean…With the default JDK (1.6 and 1.7),
TransformerFactoryreturns acom.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl, which usescom.sun.org.apache.xml.internal.serializer.ToXMLStream. Incom.sun.org.apache.xml.internal.serializer.ToStream,characters()is sometimes callingprocessDirty(), which callsaccumDefaultEscape(), which could handle unicode characters better, but in practice it does not seem to work (maybeprocessDirtyis not called for unicode characters)…com.sun.org.apache.xml.internal.serialize.DOMSerializerImplis usingcom.sun.org.apache.xml.internal.serialize.XMLSerializer, which supports unicode. Strangely enough,XMLSerializer comes fromXerces, and yet it is not used byXerceswhenxalanorxsltcare on the classpath. This is becauseorg.apache.xerces.dom.CoreDOMImplementationImpl.createLSSerializeris usingorg.apache.xml.serializer.dom3.LSSerializerImplwhen it is available instead oforg.apache.xerces.dom.DOMSerializerImpl. Withserializer.jaron the classpath,org.apache.xml.serializer.dom3.LSSerializerImplis used. Warning:xalan.jarandxsltc.jarboth referenceserializer.jarin the manifest, soserializer.jarends up on the classpath if it is in the same directory and eitherxalan.jarorxsltc.jaris on the classpath ! If onlyxercesImpl.jarandxml-apis.jarare on the classpath,org.apache.xerces.dom.DOMSerializerImplis used as theLSSerializer, and unicode characters are properly handled.CONCLUSION AND WORKAROUND: the bug lies in Apache’s
org.apache.xml.serializer.ToStreamclass (renamedcom.sun.org.apache.xml.internal.serializer.ToStreaminside the JDK). A serializer that handles unicode characters properly isorg.apache.xml.serialize.DOMSerializerImpl(renamedcom.sun.org.apache.xml.internal.serialize.DOMSerializerImplinside the JDK). However, Apache prefersToStreaminstead ofDOMSerializerImplwhen it is available, so maybe it behaves better for other things (or maybe it’s just a reorganization). On top of that, they went as far as deprecatingDOMSerializerImplinXerces 2.9.0. Hence the following workaround, which might have side effects :when
Xercesand Apache’sserializerare on the classpath, replace "(doc.getImplementation()).createLSSerializer()" by "new org.apache.xerces.dom.DOMSerializerImpl()"when Apache’s
serializeris on the classpath (for instance because ofxalan) but notXerces, try to replace "(doc.getImplementation()).createLSSerializer()" by "newcom.sun.org.apache.xml.internal.serialize.DOMSerializerImpl()" (a fallback is necessary because this class might disappear in the future)These 2 workarounds produce a warning when compiling.
I don’t have a workaround for
XSLT transforms, but this is beyond the scope of the question. I guess one could do a transform to another DOM document and useDOMSerializerImplto serialize.Some other workarounds, which might be a better solution for some people :
use
Saxonwith aTransformeruse XML documents with
UTF-16encoding