I’m tryng to download a web page in java with the following:
URL url = new URL("www.jksfljasdlfas.com");
FIle to = new File("/home/test/test.html");
Reader in = new InputStreamReader(url.openStream(), "UTF-8");
Writer out = new OutputStreamWriter(new FileOutputStream(to), "UTF-8");
int c;
while((c = in.read()) != -1){
out.write(c);
}
in.close();
out.close();
I download the page and some character are replaced by entities:
this:
<a href="http://www.generation276.org/film/?m=200812&paged=2" >Pagina successiva »</a>
become this:
<a href="http://www.generation276.org/film/?m=200812&paged=2" >Pagina successiva »</a>
Downloading the same page with Chrome, the & remains &.
I’m new in Charset/encoding; can anybody understand the probem?
The Java part is working perfectly fine.
Chrome is tricking you there. In FireFox, when I select
View -> Page Source, I see this:while with FireBug / Inspect Element I see this:
and it copies to the clipboard as this:
Browsers don’t always show you what’s really there.
The second part of your question is identical to this previous Question:
And hence the answer is also the same:
Use StringEscapeUtils.unescapeHTML(String) from the Apache Commons / Lang project.