I have been struggling with this for a little while. I have a multi-lingual web app that outputs XML at some point. This XML can contain any language so my approach to sanitization has been to disallow certain characters that break XML from being inserted. That and wrapping as much as I can in CDATA, but I have a ton of content in the attributes. I don’t want to disallow special characters because completely valid characters like parenthesis, periods, dashes, ticks and apostrophes are used all the time and they work.
What is the best way to strip out all characters that will break a XML attribute, but leave languages intact?
UPDATE:
I found: http://en.wikipedia.org/wiki/CDATA#CDATA-type_attribute_value , which indicated to me that I can describe an attribute as a CDATA section using DTD; however, this is not true it seems.
<?xml version="1.0" ?>
<!DOCTYPE foo [
<!ELEMENT foo EMPTY>
<!ATTLIST foo a CDATA #REQUIRED>
]>
<foo a="•"><![CDATA[ • ]]> </foo>
Any validator will complain about bull not being an entity in the attribute. If you remove the attribute it will be valid. Also I hear schemas are the way to go, so if something like the above is possible but using an XML Schema instead, that would be awesome.
Thanks!
this is valid
you can translate special characters to html entities with
and reversing with
see: http://www.php.net/manual/en/function.htmlentities.php
see also “html metacharacters”