The functions c32rtomb and mbrtoc32 from <cuchar>/<uchar.h> are described in the C Unicode TR (draft) as performing conversions between UTF-321 and “multibyte characters”.
(…) If
sis not a null
pointer, thec32rtombfunction determines the number of bytes needed to represent
the multibyte character that corresponds to the wide character given byc32
(including any shift sequences), and stores the multibyte character representation in
the array whose first element is pointed to bys. (…)
What is this “multibyte character representation”? I’m actually interested in the behaviour of the following program:
#include <cassert>
#include <cuchar>
#include <string>
int main() {
std::u32string u32 = U"this is a wide string";
std::string narrow = "this is a wide string";
std::string converted(1000, '\0');
char* ptr = &converted[0];
std::mbstate_t state {};
for(auto u : u32) {
ptr += std::c32rtomb(ptr, u, &state);
}
converted.resize(ptr - &converted[0]);
assert(converted == narrow);
}
Is the assertion in it guaranteed to hold1?
1 Working under the assumption that __STDC_UTF_32__ is defined.
For the assertion to be guaranteed to hold true it’s necessary that the multibyte encoding used by
c32rtomb()be the same as the encoding used for string literals, at least as far as the characters actually used in the string.C99 7.11.1.1/2 specifies that
setlocale()with the categoryLC_CTYPEaffects the behavior of the character handling functions and the multibyte and wide character functions. I don’t see any explicit acknowledgement that the effect is to set the multibyte and wide character encodings used, however that is the intent.So the multibyte encoding used by
c32rtomb()is the multibyte encoding from the default “C” locale.C++11 2.14.3/2 specifies that the execution encoding, wide execution encoding, UTF-16, and UTF-32 are used for the corresponding character and string literals. Therefore
std::string narrowuses the execution encoding to represent that string.So is the “C” locale encoding of this string the same as the execution encoding of this string?
C99 7.11.1.1/3 specifies that the “C” locale provides “the minimal environment” for C translation. Such an environment would include not only character sets, but also the specific character codes used. So I believe this means not only that the “C” locale must support the characters required in translation (i.e., the basic character set), but additionally that those characters in the “C” locale must use the same character codes.
All of the characters in your string literals are members of the basic character set, and therefore converting the
char32_trepresentation to thechar“C” locale representation must produce the same sequence of values as the compiler produces for thecharstring literal; the assertion must hold true.I don’t see any suggestion that anything beyond the basic character set is supported in a compatible way between the execution encoding and the “C” locale, so if your string literal used any characters outside the basic character set then there would not be any guarantee that the assertion would hold. Even stipulating extended characters that exist in both the execution character set and the “C” locale, I don’t see any requirement that the representations match each other.