We are testing our application for Unicode compatibility and have been selecting random characters outside the Latin character set for testing.
On both Latin and Japanese-collated systems the following equality is true (U+3422):
N'㐢㐢㐢㐢' = N'㐢㐢㐢'
but the following is not (U+30C1):
N'チチチチ' = N'チチチ'
This was discovered when a test case using the first example (using U+3422) violated a unique index. Do we need to be more selective about the characters we use for testing? Obviously we don’t know the semantic meaning of the above comparisons. Would this behavior be obvious to a native speaker?
Michael Kaplan has a blog post where he explains how Unicode strings are compared. It all comes down to the point that a string needs to have a weight, if it doesn’t it will be considered equal to the empty string.
In SQL Server this weight is influenced by the defined collation. Microsoft has added appropriate collations for CJK Unified Ideographs in Windows XP/2003 and SQL Server 2005. This post recommends to use
Chinese_Simplified_Pinyin_100_CI_ASorChinese_Simplified_Stroke_Order_100_CI_AS:So the following SQL statement would work as expected:
A list of all supported collations can be found in MSDN: