I’ve a simple web service that lists a variable number of foreign languages.
Some of them are listed in native charset (like Chinese, for example).
I must read this from a webpage and dynamically add them to a JComboBox.
Actually I’m reading them in this way:
public static Vector getSiteLanguages() {
System.out.println("Reading Home from " + Constants.HOME);
URL url;
URLConnection connection;
BufferedReader br;
String inputLine;
String regEx = "<option.*value=.([A-Z]*).>(.*)</option>";
Pattern pattern = Pattern.compile(regEx);
Matcher m;
Vector siteLangs = new Vector();
try {
url = new URL( Constants.HOME);
connection = url.openConnection();
br = new BufferedReader(new InputStreamReader(connection.getInputStream()));
while ((inputLine = br.readLine()) != null) {
m = pattern.matcher(inputLine);
while ( m.find()) {
System.out.println(m.group(1) + "->" + m.group(2) );
siteLangs.add(m.group(2));
}
}
br.close();
} catch (IOException e) {
return siteLangs;
}
return siteLangs;
}
Then in the JFrame class I’m doing this:
Vector siteLangs = Language.getSiteLanguages();
JComboBox siteLangCombo = new JComboBox(siteLangs);
But in this way all non-latin languages are lost…
How do I preserve non-latin info in this situation?
The
InputStreamReaderuses by default the platform default character encoding to convert bytes to characters. The website is apparently using a different character encoding to convert characters to bytes in the HTTP response. You need to check the HTTPContent-Typeresponse header which one it is.Assuming that it’s UTF-8, which is these days the most commonly used character encoding in websites who strive to world domination, here’s how you should be specifying it during the construction of the
InputStreamReaderin your code:See also:
Unrelated to the concrete problem, the
Vectoris a legacy class which has been replaced by theListinterface since 1998. Are you sure that you’re reading up-to-date resources during your Java learning spree? Further, regex should not be your first choice when you just need to parse HTML. This is Java, not PHP. Use a normal HTML parser. You may find Jsoup helpful in this. The whole code which you’ve so far can then be brought back to two or three lines.