I’m trying to read XML data from Google weather webservice. The response contain some Spanish characters. Problem is that these characters are not displayed properly. I’ve tried to convert everything to UTF-8 but that does not seem to help. Code is given below
public static void main(String[] args) {
try {
URL url = new URL("http://www.google.com/ig/api?weather=Noja&hl=es");
HttpURLConnection con = (HttpURLConnection) url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
con.getInputStream(), "UTF-8"));
String str = in.readLine();
//this does not work even
//String str = new String(in.readLine().getBytes("UTF-8"),"UTF-8");
System.out.println(str);
in.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Output is given below (trimmed to keep the post in limits). Notice “mi�” and s�b
trimmed to keep max char limit
<day_of_week data="mi�"/><day_of_week data="s�b"/><low data="11"/><high data="16"/><icon data="/ig/images/weather/chance_of_rain.gif"/><condition data="Posibilidad de lluvia"/></forecast_conditions></weather></xml_api_reply>
If that page is xml then you should usually pass the InputStream directly to the xml parser and let it automatically detect the encoding. Otherwise you should look at the charset parameter of the content type response header to determine the correct encoding and create the appropriate InputStreamReader.
Edit: That server is indeed responding with different encodings to the browser and the java client, probably depending on the
Accept-Charsetrequest header. For firefox this header has the valueThis means both charset are accepted, there is no preference for either one. The server responds with a
Content-Typeheader oftext/xml; charset=UTF-8. The java client does not send this header and the server responds withtext/xml; charset=ISO-8859-1.To use the charset supplied by the server you can use code like the following:
Edit 2: Turns out the server decides the charset to use based on the user-agent header. If you add the following line, it responds with a charset of utf-8.
Anyway, the
Content-Typeresponse header contains the correct charset to use.