I want to parse an XML file from URL using JDOM. But when trying this:
SAXBuilder builder = new SAXBuilder();
builder.build(aUrl);
I get this exception:
Invalid byte 1 of 1-byte UTF-8 sequence.
I thought this might be the BOM issue. So I checked the source and saw the BOM in the beginning of the file. I tried reading from URL using aUrl.openStream() and removing the BOM with Commons IO BOMInputStream. But to my surprise it didn’t detect any BOM.
I tried reading from the stream and writing to a local file and parse the local file. I set all the encodings for InputStreamReader and OutputStreamWriter to UTF8 but when I opened the file it had crazy characters.
I thought the problem is with the source URL encoding. But when I open the URL in browser and save the XML in a file and read that file through the process I described above, everything works fine.
I appreciate any help on the possible cause of this issue.
That HTTP server is sending the content in GZIPped form (
Content-Encoding: gzip; see http://en.wikipedia.org/wiki/HTTP_compression if you don’t know what that means), so you need to wrapaUrl.openStream()in aGZIPInputStreamthat will decompress it for you. For example:Edited to add, based on the follow-up comment: If you don’t know in advance whether the URL will be GZIPped, you can write something like this:
(warning: not tested) and then use:
. This is basically equivalent to the above —
aUrl.openStream()is explicitly documented to be a shorthand foraUrl.openConnection().getInputStream()— except that it examines theContent-Encodingheader before deciding whether to wrap the stream in aGZIPInputStream.See the documentation for
java.net.URLConnection.