I am trying to parse some xml with saxon to make some xpath querying on it but got 2 problems : the first one is that saxon is very long to build a very short document in xhtml.
code is this :
Processor processorInstance = new Processor(false);
processorInstance.setConfigurationProperty(FeatureKeys.DTD_VALIDATION, false);
XPathCompiler XPathCompilerInstance = processorInstance.newXPathCompiler();
XPathCompilerInstance.setBackwardsCompatible(false);
String expressionTitre = "//div[@class='score_global']/preceding-sibling::img[1]";
XPathExecutable XPathExecutableInstance = XPathCompilerInstance.compile(expressionTitre);
XPathSelector selector = XPathExecutableInstance.load();
logger.info("Xpath compiled.");
// Phase 2, load xml document.
DocumentBuilder documentBuilderInstance = processorInstance.newDocumentBuilder();
documentBuilderInstance.setSchemaValidator(null);
documentBuilderInstance.setLineNumbering(false);
documentBuilderInstance.setRetainPSVI(false);
XdmNode context = documentBuilderInstance.build(new File("sample/sample.xml")); // This line takes ages to return.
What I don’t understand is that if I do it with SAX, it loads at normal speed :(.
What did I forget to provide in saxon ?
Java 1.6
Saxon 9.1.0.8
Second problem is that he is unable to process accented characters while my xml was like this:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
So I removed xml:lang en lang= attributes but got no better luck 🙁
Do you have any ideas ?
Thank you !
Well After much reading, it was simply necessary to define a CatalogResolver and downloading locally the Xhtml dtds. I dropped saxon and used simple JaxP/SaxReader instead.
This page http://xml.apache.org/commons/components/resolver/resolver-article.html proved very interesting.
Hope this considerations will prove themselves useful to someone 🙂