First I develope an independent platform library by using ANSI C (not C++ and any non standard libs like MS CRT or glibc, …).
After a few searchs, I found that one of the best way to internationalization in ANSI C, is using UTF-8 encoding.
In utf-8:
- strlen(s): always counts the number of bytes.
- mbstowcs(NULL,s,0): The number of characters can be counted.
But I have some problems when I want to random access of elements(characters) of a utf-8 string.
In ASCII encoding:
char get_char(char* assci_str, int n)
{
// It is very FAST.
return assci_str[n];
}
In UTF-16/32 encoding:
wchar_t get_char(wchar_t* wstr, int n)
{
// It is very FAST.
return wstr[n];
}
And here my problem in UTF-8 encoding:
// What is the return type?
// Because sizeof(utf-8 char) is 8 or 16 or 24 or 32.
/*?*/ get_char(char* utf8str, int n)
{
// I can found Nth character of string by using for.
// But it is too slow.
// What is the best way?
}
Thanks.
Perhaps you’re thinking about this a bit wrongly. UTF-8 is an encoding which is useful for serializing data, e.g. writing it to a file or the network. It is a very non-trivial encoding, though, and a raw string of Unicode codepoints can end up in any number of encoded bytes.
What you should probably do, if you want to handle text (given your description), is to store raw, fixed-width strings internally. If you’re going for Unicode (which you should), then you need 21 bits per codepoint, so the nearest integral type is
uint32_t. In short, store all your strings internally as arrays of integers. Then you can random-access each codepoint.Only encode to UTF-8 when you are writing to a file or console, and decode from UTF-8 when reading.
By the way, a Unicode codepoint is still a long way from a character. The concept of a character is just far to high-level to have a simple general mechanic. (E.g. “a” + “accent grave” — two codepoints, how many characters?)