This is the beginning — I have a file on disk which is HTML page. When I open it with regular web browser it displays as it should — i.e. no matter what encoding is used, I see correct national characters.
Then I come — my task is to load the same file, parse it, and print out some pieces on the screen (console) — let’s say, all <hX> texts. Of course I would like to see only correct characters, not some mambo-jumbo. The last step is changing some of text, and save the file.
So the parser has to parse and handle encoding in both ways as well. So far I am unaware of parser which is even capable of loading data correctly.
Question
What parser would you recommend?
Details
HTML page in general has the encoding given in header (in meta tag), so parser should use it. The scenario I have to look in advance and check the encoding, and then manually set the encoding in code is no-go. For example, this is taken from JSoup tutorials:
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
I cannot do such thing, parser has to handle encoding detection by itself.
In C# I faced similar problem with loading html. I used HTMLAgilityPack and first executed encoding detection, then using it I encoded the data stream, and after that I parsed the data. So, I did both steps explicitly, but since the library delivers both methods it is fine with me.
Such explicit separation might be even better, because it would be possible to use in case of missing header probabilistic encoding detection method.
The Jsoup API reference says for that parse method that if you provide
nullas the second argument (the encoding one), it’ll use thehttp-equivmeta-tag to determine the encoding. So it looks like it already does the “parse a bit, determine encoding, re-parse with proper encoding” routine. Normally such parsers should be capable of resolving the encoding themselves using any means available to them. I know that SAX parsers in Java are supposed to use byte-order marks and the XML declaration to try and establish an encoding.Apparently Jsoup will default to UTF-8 if no proper meta-tag is found. As they say in the documentation, this is “usually safe” since UTF-8 is compatible with a host of common encodings for the lower code points. But I take it that “usually safe” might not really be good enough in this case.
If you don’t sufficiently trust Jsoup to detect the encoding, I see two alternatives: