Suppose we have an arbitrary string, s.
s has the property of being from just about anywhere in the world. People from USA, Japan, Korea, Russia, China and Greece all write into s from time to time. Fortunately we don’t have time travellers using Linear A, however.
For the sake of discussion, let’s presume we want to do string operations such as:
- reverse
- length
- capitalize
- lowercase
- index into
and, just because this is for the sake of discussion, let’s presume we want to write these routines ourselves (instead of grabbing a library), and we have no legacy software to maintain.
There are 3 standards for Unicode: utf-8, utf-16, and utf-32, each with pros and cons. But let’s say I’m sorta dumb, and I want one Unicode to rule them all (because rolling a dynamically adapting library for 3 different kinds of string encodings that hides the difference from the API user sounds hard).
- Which encoding is most general?
- Which encoding is supported by wchar_t?
- Which encoding is supported by the STL?
- Are these encodings all(or not at all) null-terminated?
—
The point of this question is to educate myself and others in useful and usable information for Unicode: reading the RFCs is fine, but there’s a ‘stack’ of information related to compilers, languages, and operating systems that the RFCs do not cover, but is vital to know to actually use Unicode in a real app.
Which encoding is most general
Probably UTF-32, though all three formats can store any character. UTF-32 has the property that every character can be encoded in a single codepoint.
Which encoding is supported by wchar_t
None. That’s implementation defined. On most Windows platforms it’s UTF-16, on most Unix platforms its UTF-32.
Which encoding is supported by the STL
None really. The STL can store any type of character you want. Just use the
std::basic_string<t>template with a type large enough to hold your code point. Most operations (e.g.std::reverse) do not know about any sort of unicode encoding though.Are these encodings all(or not at all) null-terminated?
No. Null is a legal value in any of those encodings. Technically, NULL is a legal character in plain ASCII too. NULL termination is a C thing — not an encoding thing.
Choosing how to do this has a lot to do with your platform. If you’re on Windows, use UTF-16 and wchar_t strings, because that’s what the Windows API uses to support unicode. I’m not entirely sure what the best choice is for UNIX platforms but I do know that most of them use UTF-8.