I’m writing some Java code that deals with Chinese characters, and I got some unexpected results — strings that should be equal were not. Here is one of the offending characters, which means “six” (pinyin: liù): 六. This character can be represented with either of two code points:
F9D1 in the block: CJK Compatibility Ideographs
516D in the block: CJK Unified Ideographs
Wikipedia has a page about these character ranges, and the short section on compatibility ideographs does mention some duplicates, but the list omits this specific character.
So I’m wondering:
- Is there a list of duplicate unicode characters somewhere so I can transform Strings before trying to compare them?
- Is this normal when dealing with CJK characters, or have I done something else wrong?
Just normalize them. U+F9D1 becomes U+516D under any of the four normalization schemes:
Many essential Unicode tools, including those, are available here.