I’d like to perform XPath queries on an XML document online. I’ve set up InputStreams that retrieve the content and append a <?xml ...?> header that declares the encoding present in charset field of the HTTP requests. Although it works, it’s painfully slow.
//bis is the BufferedInputStream with the content part of the HTTP reply
docBuilder = docBuilderFactory.newDocumentBuilder(); // throws exception.
Document doc = docBuilder.parse
(new PrependInputStream(bis,
"<?xml version='1.0' encoding='"+charset+"' ?>\r\n"));
(please allow me not to put my whole source this time: I’m preparing an assignment for students).
Some strace analysis revealed that the program stalls when contacting w3.org:
send(8, "GET /TR/xhtml1/DTD/xhtml1-transitional.dtd HTTP/1.1\r\nUser-Agent: Java/1.6.0_17\r\nHost: www.w3.org\r\nAccept:
text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2\r\nConnection: keep-alive\r\n\r\n", 186, 0)
recv(8, ...
As I don’t worry too much about the HTML content to be valid (well-formed should be enough), I tried docBuilderFactory.setValidating(false) but that doesn’t seem to prevent online retrieval of the DTD.
Trying to set manually a schema with ” (that was not a good idea)docBuilderFactory.setSchema() using the same dtd file retrieved manually results in a “org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed.
Where am I over-complicating things?
(the XML backend seems to be com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadSchema, as far as I can tell from stack traces — if that’s of any use).
HTML dtd’s are huge, using includes. And you are right, they take forever. Use an XML catalog. There one can store the dtds locally and map them by their system ID.
If you use a tool, like maven, you will find sufficient pointers.
The advantage i.o. intercepting entities as the answer linked by @sylvainulg suggests, is that you receive the correct characters.