Suppose we have an arbitrary string, s . s has the property of being

Question

0

Asked: May 13, 20262026-05-13T19:45:04+00:00 2026-05-13T19:45:04+00:00

Suppose we have an arbitrary string, s . s has the property of being

0

Suppose we have an arbitrary string, s.

s has the property of being from just about anywhere in the world. People from USA, Japan, Korea, Russia, China and Greece all write into s from time to time. Fortunately we don’t have time travellers using Linear A, however.

For the sake of discussion, let’s presume we want to do string operations such as:

reverse
length
capitalize
lowercase
index into

and, just because this is for the sake of discussion, let’s presume we want to write these routines ourselves (instead of grabbing a library), and we have no legacy software to maintain.

There are 3 standards for Unicode: utf-8, utf-16, and utf-32, each with pros and cons. But let’s say I’m sorta dumb, and I want one Unicode to rule them all (because rolling a dynamically adapting library for 3 different kinds of string encodings that hides the difference from the API user sounds hard).

Which encoding is most general?
Which encoding is supported by wchar_t?
Which encoding is supported by the STL?
Are these encodings all(or not at all) null-terminated?

—

The point of this question is to educate myself and others in useful and usable information for Unicode: reading the RFCs is fine, but there’s a ‘stack’ of information related to compilers, languages, and operating systems that the RFCs do not cover, but is vital to know to actually use Unicode in a real app.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T19:45:04+00:00

Which encoding is most general
Probably UTF-32, though all three formats can store any character. UTF-32 has the property that every character can be encoded in a single codepoint.
Which encoding is supported by wchar_t
None. That’s implementation defined. On most Windows platforms it’s UTF-16, on most Unix platforms its UTF-32.
Which encoding is supported by the STL
None really. The STL can store any type of character you want. Just use the std::basic_string<t> template with a type large enough to hold your code point. Most operations (e.g. std::reverse) do not know about any sort of unicode encoding though.
Are these encodings all(or not at all) null-terminated?
No. Null is a legal value in any of those encodings. Technically, NULL is a legal character in plain ASCII too. NULL termination is a C thing — not an encoding thing.

Choosing how to do this has a lot to do with your platform. If you’re on Windows, use UTF-16 and wchar_t strings, because that’s what the Windows API uses to support unicode. I’m not entirely sure what the best choice is for UNIX platforms but I do know that most of them use UTF-8.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Suppose we have an arbitrary string, s . s has the property of being

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply