Some pages have HTML special characters in their content, but they are appearing as a square (an unknown character).
What can I do?
Can I convert the String containg the carachters to another format(UTF-8)? It’s in the conversion from InputStream to String that happens this. I really don’t know what causes it.
public HttpURLConnection openConnection(String url) {
try {
URL urlDownload = new URL(url);
HttpURLConnection con = (HttpURLConnection) urlDownload.openConnection();
con.setInstanceFollowRedirects(true);
con.connect();
return con;
} catch (Exception e) {
return null;
}
}
private String getContent(HttpURLConnection con) {
try {
return IOUtils.toString(con.getInputStream());
} catch (Exception e) {
System.out.println("Erro baixando página: " + e);
return null;
}
}
page.setContent(getContent(openConnection(con)));
You need to read the
InputStreamusingInputStreamReaderwith the charset as specified in theContent-Typeheader of the downloaded HTML page. Otherwise the platform default charset will be used, which is apparently not the same as the HTML’s one in your case.You can of course also use a HTML reader/parser like Jsoup which takes this automatically into account.
Update: as per your updated question, you seem to be using
URLConnectionto request the HTML page andIOUtilsto convertInputStreamtoString. You need to use it as follows:If you’re still having problems with getting the characters right, then it can only mean that the console/viewer where you’re printing those characters to doesn’t support the charset. E.g., when you run the following in Eclipse
Then you need to ensure that the Eclipse console uses UTF-8. You can set it by Window > Preferences > General > Workspace > Text File Encoding.
Or if you’re writing it to some file by
FileWriter, then you should rather be usingInputStream/OutputStreamfrom the beginning on without converting it toStringfirst. If converting toStringis really an important step, then you need to write it tonew OutputStreamWriter(output, "UTF-8").