UPDATE: The invalid characters are actually in the attributes instead of the elements, this will prevent me from using the CDATA solution as suggested below.
In my application I receive the following XML as a string. There are a two problems with this why this isn’t accepted as valid XML.
Hope anyone has a solution for fixing these bug gracefully.
-
There are ASCII characters in the XML that aren’t allowed. Not only the one displayed in the example but I would like to replace all the ASCII code with their corresponding characters.
-
Within an element the ‘<‘ exists – I would like to remove all these entire ‘inner elements’ (
<L CODE="C01">WWW.cars.com</L>) from the XML.
<?xml version="1.0" encoding="ISO-8859-1"?> <cars> <car model="ford" description="Argentinië love this"/> <car model="kia" description="a small family car"/> <car model="opel" description="great car <L CODE="C01">WWW.cars.com</L>"/> </cars>
For a quick fix, you could load this not-XML into a string, and add [CDATA][1] markers inside any XML tags that you know usually tend to contain invalid data. For example, if you only ever see bad data inside
<description>tags, you could do:This would turn the tag into this:
which you could then process successfully — it would be a
<description>tag that contains the simple stringgreat car <L CODE="C01">WWW.cars.com</L>.If the
<description>tag could ever have any attributes, then this kind of string replacement would be fraught with problems. But if you can count on the open tag to always be exactly the string<description>with no attributes and no extra whitespace inside the tag, and if you can count on the close tag to always be</description>with no whitespace before the>, then this should get you by until you can convince whoever is producing your crap input that they need to produce well-formed XML.Update
Since the malformed data is inside an attribute, CDATA won’t work. But you could use a regular expression to find everything inside those quote characters, and then do string manipulation to properly escape the
<s and>s. They’re at least escaping embedded quotes, so a regex to go from"to"would work.Keep in mind that it’s generally a bad idea to use regexes on XML. Of course, what you’re getting isn’t actually XML, but it’s still hard to get right for all the same reasons. So expect this to be brittle — it’ll work for your sample input, but it may break when they send you the next file, especially if they don’t escape
&properly. Your best bet is still to convince them to give you well-formed XML.