You can write UTF-8/16/32 string literals in C++11 by prefixing the string literal with u8/u/U respectively. How must the compiler interpret a UTF-8 file that has non-ASCII characters inside of these new types of string literals? I understand the standard does not specify file encodings, and that fact alone would make the interpretation of non-ASCII characters inside source code completely undefined behavior, making the feature just a tad less useful.
I understand you can still escape single unicode characters with \uNNNN, but that is not very readable for, say, a full Russian, or French sentence, which typically contain more than one unicode character.
What I understand from various sources is that u should become equivalent to L on current Windows implementations and U on e.g. Linux implementations. So with that in mind, I’m also wondering what the required behavior is for the old string literal modifiers…
For the code-sample monkeys:
string utf8string a = u8"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf16string b = u"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf32string c = U"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
In an ideal world, all of these strings produce the same content (as in: characters after conversion), but my experience with C++ has taught me that this is most definitely implementation defined and probably only the first will do what I want.
In GCC, use
-finput-charset=charset:Also check out the options
-fexec-charsetand-fwide-exec-charset.Finally, about string literals:
The size modifier of the string literal (
L,u,U) merely determines the type of the literal.