I found the C standard (C99 and C11) vague with respect to character/string code positions and encoding rules:
Firstly the standard defines the source character set and the execution character set.
Essentially it provides a set of glyphs, but does not associate any numerical values
with them – So what is the default character set?
I’m not asking about encoding here but just the glyph/repertoire to numeric/code point mapping.
It does define universal character names as ISO/IEC 10646, but does it say that
this is the default charset?
As an extension to the above – I couldn’t find anything which says what characters
the numeric escape sequences \0 and \x represent.
From the C standards (C99 and C11, I didn’t check ANSI C) I got the following
about character and string literals:
+---------+-----+------------+----------------------------------------------+
| Literal | Std | Type | Meaning |
+---------+-----+------------+----------------------------------------------+
| '...' | C99 | int | An integer character constant is a sequence |
| | | | of one or more multibyte characters |
| L'...' | C99 | wchar_t | A wide character constant is a sequence of |
| | | | one or more multibyte characters |
| u'...' | C11 | char16_t | A wide character constant is a sequence of |
| | | | one or more multibyte characters |
| U'...' | C11 | char32_t | A wide character constant is a sequence of |
| | | | one or more multibyte characters |
| "..." | C99 | char[] | A character string literal is a sequence of |
| | | | zero or more multibyte characters |
| L"..." | C99 | wchar_t[] | A wide string literal is a sequence of zero |
| | | | or more multibyte characters |
| u"..." | C11 | char16_t[] | A wide string literal is a sequence of zero |
| | | | or more multibyte characters |
| U"..." | C11 | char32_t[] | A wide string literal is a sequence of zero |
| | | | or more multibyte characters |
| u8"..." | C11 | char[] | A UTF-8 string literal is a sequence of zero |
| | | | or more multibyte characters |
+---------+-----+------------+----------------------------------------------+
However I couldn’t find anything about the encoding rules for these literals.
UTF-8 does seem to hint UTF-8 encoding, but I don’t think it’s explicitly mentioned
anywhere. Also, for the other types is the encoding undefined or implementation dependent?
I’m not to familiar with the UNIX specification. Does the UNIX specification specify any additional constraint(s) to these rules?
Also if anyone can tell me what charset/encoding scheme is used by GCC and MSVC that would also help.
C is not greedy about character sets. There’s no such thing as “default character set”, it’s implementation defined – although it’s mostly ASCII or UTF-8 on most modern systems.