I am reading a XML using SAX (javax.xml.parsers.SAXParser;). In that XML, there are some special character like (&,<,>,”,’) available among the child node values. So, upto that point SAX read the XML successfully, but on that point it throws a org.xml.sax.SAXParseException.
For an example, in the below sample XML, SAX reads up to the node value of the successfully. But it throws this org.xml.sax.SAXParseException at the since the value of Name argument has < in there.
<Parent>
<child1>
LS-23541723
</child1>
<child2 id="2" Name="T-Shirt And Denim - T<D" Rate="500.00">
</child2>
<child3>
<![CDATA[This is the child 2]]>
</child3>
<child4>
<![CDATA[This is the child 4]]>
</child4>
</Parent>
I can’t determine the nodes that contains these special characters before hand.(It is dyanamic.) So, What I wanna do is, reading an XML with SAX, ignoring the nodes that contains these like special characters.. Simply, I think I can do this if it is possible to read the XML with SAX, skipping the nodes that pass the org.xml.sax.SAXParseException.
Is this possible and if yes how?
Note : I cannot simply replace them with the Entity Refrences like & since, some times the XML nodes are comming with the < , > as well ( is comming as <child1>). So, before starting to read it with SAX, I replace all the Entity References with the Character References.(replaceAll(">",">"),etc)
I don’t think that SAX can handle this. The XML has to be well-formed. Thus you have to make a bunch of replacements before the text is submitted to SAX. Look for any
',"or<that are not in the right places."between",'between'and<that is not part of a start tag or end tag. That should be feasible. That’s the second pass after your first pass that replaces<and>by their equivalent counterparts.Ideally you should also watch for comments, CDATA section, etc… to be sure they are well-formed.