While trying to process a JSON response with GSON (the output is from the flickr API in case you’re asking) I encountered what I’d describe as a pretty weird encoding of certain special chars:

Here’s a hex view of it:

The ‘u’ followed by the ‘double-dots’ is what’s supposed to be a German ‘ü’, and this is where my confusion starts. It’s as if someone took the char and ripped it in half, encoding each of the 2 pieces. The following image shows the hex encoding of what I’d expect it to be in case the ‘ü’ was correctly encoded:

Even more weird, in cases where I would expect problems to occur (namely, the Asian character set) everything seems to work fine, e.g. “title”: “ナガレテユク・・・”
Questions:
- Is that some flickrAPI oddity or correct JSON encoding for the reposonse? Or is it rather correctly encoded JSON and it’s GSON that’s failing to ‘re-assemble’ this response into the original ‘ü’. Or did the author of the title message simply screw it on his part?
- How do I solve the problem (in case it’s either JSON or GSON that’s messing around, can’t obviously do anything if it was the author). How do I know what ‘other’ chars are affected (ö and ä come to mind, but there are probably more ‘special cases’).
What you’re seeing there is a case of Unicode decomposition:
Characters like German umlauts can be expressed in two ways:
üorufollowed by a combining diaeresis̈_(I had to use an underscore here to make it show up because it’s not supposed to stand alone, it’s really just the to “hovering dots”)If you receive something like this, it’s easily converted into precomposed form by using
java.text.Normalizer(available since Java 1.6):As you can see, applying NFC to an already precomposed string doesn’t hurt.
Note that printing the
Stringwill look correctly on any Unicode-capable terminal, only if you print the character array you see the difference between decomposed and precomposed form.A possible source might be MacOS that tends to encode things in decomposed form, it’s curious that Flickr doesn’t normalize this stuff, though.