I often get XML files which have illegal chars like &, <, >, “ and ‘. Because of that, I cannot read them with simple_xml & DOM and validate users’ XML files against my XSD below to do further processing in PHP.
Is there any way of solving this problem?
I’m reading XML file from remote host so it can be between 10KB and 10MB.
Thanks in advance
Note: I’m putting only invalid XML elements below because some reason whole XML file appears as plain text here.
XML
<url>http://www.amazon.co.uk/gp/product/B005MG8O96/ref=olp_product_details?ie=UTF8&me=&seller=</url>
<description>iPhone 4. The "fastest", <b>highest-resolution</b> iPhone.</description>
XSD
<?xml version="1.0" encoding="UTF-8"?>
<xs:element name="store">
<xs:complexType>
<xs:sequence>
<xs:element name="item" minOccurs="1" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="title" type="title_type" />
<xs:element name="description" type="description_type" />
<xs:element name="price" type="xs:decimal" />
<xs:element name="url" type="url_type" />
<xs:element name="images">
<xs:complexType>
<xs:sequence>
<xs:element name="image" minOccurs="1" maxOccurs="unbounded">
<xs:complexType>
<xs:attribute name="url" type="url_type" />
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="id" type="id_type" />
<xs:attribute name="available" type="available_type" />
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="id" type="id_type" />
<xs:attribute name="date" type="xs:date" />
<xs:attribute name="time" type="xs:time" />
</xs:complexType>
</xs:element>
<xs:simpleType name="title_type">
<xs:restriction base="xs:string">
<xs:minLength value="1" />
<xs:maxLength value="100" />
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="description_type">
<xs:restriction base="xs:string">
<xs:minLength value="1" />
<xs:maxLength value="255" />
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="url_type">
<xs:restriction base="xs:anyURI">
<xs:minLength value="10" />
<xs:maxLength value="2000" />
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="id_type">
<xs:restriction base="xs:string">
<xs:minLength value="1" />
<xs:maxLength value="100" />
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="available_type">
<xs:restriction base="xs:string">
<xs:enumeration value="Yes" />
<xs:enumeration value="No" />
</xs:restriction>
</xs:simpleType>
You should get them to send you proper XML as the commenters said. If you are unable to, you can do the following:
For each element that might contain invalid characters, if the type is xs:string and the element name is unique in your schema do a multiline search for the open and close tags. Between those tags, replace
&with&, replace<with<and replace>with>. Single and double-quotes are not metacharacters outside tags so once you do those replacements you should have valid XML. It might not be the XML the sender wanted, but it is the only unambiguous way I can think of to turn their non-XML into valid XML.An alternative to the replacements I mentioned would be to always wrap the text content of those string elements in a CDATA section. But really, how hard is it to just require whoever generates these files to do that for you?