My OS is Debian, my default locale is UTF-8 and my compiler is gcc. By default CHAR_BIT in limits.h is 8 which is ok for ASCII because in ASCII 1 char = 8 bits. But since I am using UTF-8, chars can be up to 32 bits which contradicts the CHAR_BIT default value of 8.
If I modify CHAR_BIT to 32 in limits.h to better suit UTF-8, what do I have to do in order for this new value to come into effect ? I guess I have to recompile gcc ? Do I have to recompile the linux kernel ? What about the default installed Debian packages, will they work ?
C and C++ define
charas a byte, i.e., the integer type for whichsizeofreturns 1. It doesn’t have to be 8 bits, but the overwhelming majority of the time, it is. IMHO, it should have been namedbyte. But back in 1972 when C was created, Westerners didn’t have to deal with multi-byte character encodings, so you could get away with conflating the “character” and “byte” types.You just have to live with the confusing terminology. Or
typedefit away. But don’t edit your system header files. If you want a character type instead of a byte type, usewchar_t.But a UTF-8 string is made of 8-bit code units, so
charwill work just fine. You just have to remember the distinction betweencharand character. For example, don’t do this:toupper('a')works as expected, buttoupper('\xC3')is a nonsensical attempt to uppercase half of a character.