I’m looking to get an explanation on why my SAX parser fails when some special UTF-8 characters are inside my XML file.
To parse the XML file I use Document doc = builder.parse(inputSource);
However when I use an inputSource it works fine:
DocumentBuilder builder = factory.newDocumentBuilder();
InputStream in = new FileInputStream(file);
InputSource inputSource = new InputSource(new InputStreamReader(in));
Document doc = builder.parse(inputSource);
I don’t quite understand why the latter works. I’ve seen example of it being used but there isn’t an explanation on why it works.
Does the second parse a string rather than a file, therefore the encoding will be UTF-8?
I suspect your document isn’t really in the encoding you’ve declared. This line:
will use the platform default encoding to convert the binary data into text within
InputStreamReader. The XML parser doesn’t get to do it any more – it doesn’t get to see the raw bytes.If this is working, your XML file is probably subtly bust – it may be declaring that it’s in UTF-8, but using the platform default encoding (e.g. Windows-1252). Rather than use the workaround, you should fix the XML if you have any choice about it.