I have a very strange case:
I tried to parse several XHTML-conform websites using default Java XML parser(s). The test blocks during parsing (not during downloading).
Can this be a bug, or does the parser tries to download additional referenced resources during parsing (which would be a “nice” anti-feature)?
With simple data, it works. (TEST1)
With complex data, it blocks. (TEST2)
(I tried en.wikipedia.org and validator.w3.org)
When blocking occurs, CPU is idle.
Tested with JDK6 and JDK7, same results.
Please see test case, source is ready for copy + paste + run.
Source
import java.io.*;
import java.net.*;
import java.nio.charset.*;
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;
import org.w3c.dom.*;
public class _XmlParsingBlocks {
private static Document parseXml(String data)
throws Exception {
Transformer t = TransformerFactory.newInstance().newTransformer();
DocumentBuilder b = DocumentBuilderFactory.newInstance().newDocumentBuilder();
DOMResult out = new DOMResult(b.newDocument());
t.transform(new StreamSource(new StringReader(data)), out);
return (Document) out.getNode();
}
private static byte[] streamToByteArray(InputStream is)
throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
for (;;) {
byte[] buffer = new byte[256];
int count = is.read(buffer);
if (count == -1) {
is.close();
break;
}
baos.write(buffer, 0, count);
}
return baos.toByteArray();
}
private static void test(byte[] data)
throws Exception {
String asString = new String(data, Charset.forName("UTF-8"));
System.out.println("===== PARSING STARTED =====");
Document doc = parseXml(asString);
System.out.println("===== PARSING ENDED =====");
}
public static void main(String[] args)
throws Exception {
{
System.out.println("********** TEST 1");
test("<html>test</html>".getBytes("UTF-8"));
}
{
System.out.println("********** TEST 2");
URL url = new URL("http://validator.w3.org/");
URLConnection connection = url.openConnection();
InputStream is = connection.getInputStream();
byte[] data = streamToByteArray(is);
System.out.println("===== DOWNLOAD FINISHED =====");
test(data);
}
}
}
Output
********** TEST 1
===== PARSING STARTED =====
===== PARSING ENDED =====
********** TEST 2
===== DOWNLOAD FINISHED =====
===== PARSING STARTED =====
[here it blocks]
W3C have in the last few months started blocking requests for common DTDs such as the XHTML DTD – they can’t cope with the traffic generated. If you’re not using a proxy server that caches the DTDs, you will need to use an EntityResolver or catalog to redirect the references to a local copy.