I’m trying to store a wchar_t string as octets, but I’m positive I’m doing it wrong – anybody mind to validate my attempt? What’s going to happen when one char will consume 4 bytes?
unsigned int i;
const wchar_t *wchar1 = L"abc";
wprintf(L"%ls\r\n", wchar1);
for (i=0;i< wcslen(wchar1);i++) {
printf("(%d)", (wchar1[i]) & 255);
printf("(%d)", (wchar1[i] >> 8) & 255);
}
Unicode text is always encoded. Popular encodings are UTF-8, UTF-16 and UTF-32. Only the latter has a fixed size for a glyph. UTF-16 uses surrogates for codepoints in the upper planes, such a glyph uses 2 wchar_t. UTF-8 is byte oriented, it uses between 1 and 4 bytes to encode a codepoint.
UTF-8 is an excellent choice if you need to transcode the text to a byte oriented stream. A very common choice for text files and HTML encoding on the Internet. If you use Windows then you can use WideCharToMultiByte() with CodePage = CP_UTF8. A good alternative is the ICU library.
Be careful to avoid byte encodings that translate text to a code page, such as wcstombs(). They are lossy encodings, glyphs that don’t have a corresponding character code in the code page are replaced by ?.