I’m using ICU with Lithuanian (lt_LT) language. The alphabet for this language is the following: a ą b c č d e ę ė <...> v z ž
However, when sorting, ICU’s collator assumes that, for example, a and ą (a with ogonek) are equivalent, so a list of Lithuanian words get sorted as this:
a, ą, ab, aba, abadas, <...>, b, ba, <...>`
When the expected result would be:
a, ab, aba, abadas, <...>, ą, <...>, b, ba, <...>
The same happens with other “accented” letters (e – ę – ė, z – ž, etc.)
More specific test case: running source/samples/coll/coll -locale lt_LT -source ą -target aa decides that source is less than target when it’s not the case (see coll.cpp if you need to).
Is this behavior expected? Is this a bug or a feature? If so, how can I prevent ICU’s collator from aligning “similar” letters together?
The letters are listed as a secondary difference in the CLDR tailorings and so they will sort like so. If this is wrong, bring it up to CLDR, not an ICU problem. Mimer agrees.