I have
Document document = Jsoup.connect(link).get();
and some times for some urls I get an exception:
Exception in thread "main" java.nio.charset.UnsupportedCharsetException: X-MAC-ROMAN
at java.nio.charset.Charset.forName(Unknown Source)
at org.jsoup.helper.DataUtil.parseByteData(DataUtil.java:86)
at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:469)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:147)
I have a catch block as:
catch (IOException e1)
I understand the exception is because java is unicode and that webpage/site is not following unicode. how to handle this issue also the connect is used for many websites which include both unicode and bytecode
That’s not entirely correct. You’re likely confusing the statement “Java is unicode” with the fact that Java uses Unicode to store strings/characters in memory (you know, a computer memory can only store bytes (zeroes and ones), not characters, therefore characters needs to be converted to bytes and back using a specific character encoding; Java is using unicode for this).
This exception occurs because the underlying operating system platform wherein your Java code runs doesn’t support this charset, so Java can’t convert the from the webserver obtained bytes to characters in this encoding. This charset is specific to Mac OS platforms and you’re likely running Windows or so.
Contact the website admin and report it as a bug. It’s their fault that they used a platform-specific (Mac OS) encoding instead of an universal (ISO/UTF) encoding.
As to Jsoup, your best bet is to get website as
InputStreambyURL#openStream()first and then feed it toJsoup#parse()instead wherein you explicitly specify the character encoding which is supported on your platform, such as ISO-8859-1. E.g.:Note that you still risk to end up with Mojibake when there are non-ASCII characters present. Also note that you shouldn’t do it for all links, but only for those which threw
UnsupportedCharsetException(thus, perform the job in itscatchblock).That is because Chrome is trying to be so kind for you that it ignored the unknown encoding and chooses a default encoding instead –which might still risk in the website being displayed in Mojibake; anything beyond the ASCII range might look malformed.
Please refresh your vocabulary on the meaning of the word “bytecode”. This has got absolutely nothing to do with character encodings.