I was fetching data from a website using its API which was returning the data in JSON format.
The issue was when there where some umlaut characters in the JSON. It would return its UNICODE, for e.g. Münich would be Mu\u0308nich.
When I passed this JSON string to the constructor of the org.codehaus.jettison.json.JSONObject, Mu\u0308nich was converted to Munich (n has an umlaut). Wrong.
I realized this very late (after fetching the entire data). Now I use the following method to convert it back to the Unicode form i.e. I pass Munich (n has an umlaut) to the method and it returns Mu\u0308nich.
I want to somehow convert this Mu\u0308nich to Münich. Any ideas?
Please note the conversion is needed only for u\u0308 to ü and o\u0308 to ö and a\u0308 to ä and so on.
Method used to convert back –
public static String escapeUnicode(String input) {
StringBuilder b = new StringBuilder(input.length());
Formatter f = new Formatter(b);
for (char c : input.toCharArray()) {
if (c < 128) {
b.append(c);
} else {
f.format("\\u%04x", (int) c);
}
}
return b.toString();
}
These are called Diacritics and you can use Normalizer to combine diacritics into single unicode characters.
Use the
normalizemethod and as FormNFKC. This will first decompose the full string into diacritics and then do a composition to return ‘real’ unicode umlauts.So: ‘München’ stays ‘München’ and ‘Mu\u0308nchen’ will become ‘München’
You then will have the string in a single format, not using diacritics anymore and easily portable and displayable.
If you work with texts from different platforms, some normalization is crucial or you will end up with the problems you described.