I am integrating against Magento ecommerce using their “SOAP” api, and the API returns “XML” results. Problem is, this is not always well formed:
<product>
<entity_id>18</entity_id>
<price regular="2925 <span>Nok</span>"/>
...
In this specific case, the price regular attribute has both an invisible character 0xa0 (before the span tag), and < > within the attribute text.
I have no way to get proper well-formed XML from Magento it seems, so the alternative is to clean it up before I feed it to my XmlSerializer deserialization:
XmlSerializer serializer = new XmlSerializer(typeof(Responses.Product.product));
product = serializer.Deserialize(textReader) as Responses.Product.product;
The invisible character I can get rid of using a simple text replace, but I’m more unsure about the <> within the attribute text.
My question is, how to clean it up for be valid XML?
The character 0x3c is the
<character. For an invisible character you would rather be looking for something like the 0x09 TAB character.To fix the broken markup you could look for that specific HTML tag in the content, using a regular expression to allow any currency within the tag:
This works as long as there isn’t any
spanelements in the XML code itself, with a three character content. You could do similar replacements for other HTML tags, but try to keep the pattern as specific as possible, to avoid false positives.