Many sequences of encoded Unicode characters have the same visual representation and the same computational meaning.
The ñ character can be coded two ways:
U+00F1: ñ (LATIN SMALL LETTER N WITH TIDLE)
or:
U+006E: n (LATIN SMALL LETTER N)
U+0303: ~ (COMBINING TILDE)
This creates 10 different byte sequences that display as ñ:
U+00F1 in UTF-8, UTF-16LE, UTF-16BE, UTF-32BE, UTF32-LE
U+006E followed by U+0303 UTF-8, UTF-16LE, UTF-16BE, UTF-32BE, UTF32-LE
Is there any straightforward way to compare Unicode strings (I’m happy with unicode characters that have been decoded from the various UTF representations) and find out that they are the same? That is, I want something that tells me that U+00F1 is the same as U+0303 U+006E
Thanks.
The process is called normalization, supported by any decent Unicode library. Backgrounder is here.