This is an ANSI C question. I have the following code.
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
int main()
{
if (!setlocale(LC_CTYPE, "")) {
printf( "Can't set the specified locale! "
"Check LANG, LC_CTYPE, LC_ALL.\n");
return -1;
}
wint_t c;
while((c=getwc(stdin))!=WEOF)
{
printf("%lc",c);
}
return 0;
}
I need full UTF-8 support, but even at this simplest level, can I improve this somehow? Why is wint_t used, and not wchar, with appropriate changes?
UTF-8is one possible encoding for Unicode. It defines 1, 2, 3 or 4 bytes per character. When you read it throughgetwc(), it will fetch one to four bytes and compose from them a single Unicode character codepoint, which would fit within awchar(which can be 16 or even 32 bits wide, depending on platform).But since Unicode values map to all of the values from
0x0000to0xFFFF, there are no values left to return condition or error codes in. (Some have pointed out that Unicode is larger than 16 bits, which is true; in those cases surrogate pairs are used. But the point here is that Unicode uses all of the available values leaving none for EOF.)Various error codes include EOF (
WEOF), which maps to -1. If you were to put the return value ofgetwc()in awchar, there would be no way to distinguish it from a Unicode0xFFFFcharacter (which, BTW, is reserved anyway, but I digress).So the answer is to use a wider type, an
wint_t(orint), which holds at least 32 bits. That gives the lower 16 bits for the real value, and anything with a bit set outside of that range means something other than a character returning happened.Why don’t we always use
wcharthen instead ofwint? Most string-related functions usewcharbecause on most platforms it’s ½ the size ofwint, so strings have a smaller memory footprint.