I am looking for a method to compare and sort UTF-8 strings in C++ in a case-insensitive manner to use it in a custom collation function in SQLite.
- The method should ideally be locale-independent. However I won’t be holding my breath, as far as I know, collation is very language-dependent, so anything that works on languages other than English will do, even if it means switching locales.
- Options include using standard C or C++ library or a small (suitable for embedded system) and non-GPL (suitable for a proprietary system) third-party library.
What I have so far:
strcollwith C locales andstd::collate/std::collate_bynameare case-sensitive. (Are there case-insensitive versions of these?)-
I tried to use a POSIX strcasecmp, but it seems to be not defined for locales other than
'POSIX'In the POSIX locale, strcasecmp() and strncasecmp() do upper to lower conversions, then a byte comparison. The results are unspecified in other locales.
And, indeed, the result of
strcasecmpdoes not change between locales on Linux with GLIBC.#include <clocale> #include <cstdio> #include <cassert> #include <cstring> const static char *s1 = 'Äaa'; const static char *s2 = 'äaa'; int main() { printf('strcasecmp('%s', '%s') == %d\n', s1, s2, strcasecmp(s1, s2)); printf('strcoll('%s', '%s') == %d\n', s1, s2, strcoll(s1, s2)); assert(setlocale(LC_ALL, 'en_AU.UTF-8')); printf('strcasecmp('%s', '%s') == %d\n', s1, s2, strcasecmp(s1, s2)); printf('strcoll('%s', '%s') == %d\n', s1, s2, strcoll(s1, s2)); assert(setlocale(LC_ALL, 'fi_FI.UTF-8')); printf('strcasecmp('%s', '%s') == %d\n', s1, s2, strcasecmp(s1, s2)); printf('strcoll('%s', '%s') == %d\n', s1, s2, strcoll(s1, s2)); }This is printed:
strcasecmp('Äaa', 'äaa') == -32 strcoll('Äaa', 'äaa') == -32 strcasecmp('Äaa', 'äaa') == -32 strcoll('Äaa', 'äaa') == 7 strcasecmp('Äaa', 'äaa') == -32 strcoll('Äaa', 'äaa') == 7
P. S.
And yes, I am aware about ICU, but we can’t use it on the embedded platform due to its enormous size.
What you really want is logically impossible. There is no locale-independent, case-insensitive way of sorting strings. The simple counter-example is ‘i’ <> ‘I’ ? The naive answer is no, but in Turkish these strings are unequal. ‘i’ is uppercased to ‘İ’ (U+130 Latin Capital I with dot above)
UTF-8 strings add extra complexity to the question. They’re perfectly valid multi-byte char* strings, if you have an appropriate locale. But neither the C nor the C++ standard defines such a locale; check with your vendor (too many embedded vendors, sorry, no genearl answer here). So you HAVE to pick a locale whose multi-byte encoding is UTF-8, for the mbscmp function to work. This of course influences the sort order, which is locale dependent. And if you have NO locale in which const char* is UTF-8, you can’t use this trick at all. (As I understand it, Microsoft’s CRT suffers from this. Their multi-byte code only handles characters up to 2 bytes; UTF-8 needs 3)
wchar_t is not the standard solution either. It supposedly is so wide that you don’t have to deal with multi-byte encodings, but your collation will still depend on locale (LC_COLLATE) . However, using wchar_t means you now choose locales that do not use UTF-8 for const char*.
With this done, you can basically write your own ordering by converting strings to lowercase and comparing them. It’s not perfect. Do you expect L’ß’ == L’ss’ ? They’re not even the same length. Yet, for a German you have to consider them equal. Can you live with that?