I want to write a program in C++ that should work on Unix and Windows. This program should be able to use both: the Unicode and non Unicode environments. Its behavior should depend only on the environment settings.
One of the nice features that I want to have, is to manipulate file names read from directories. These can be unicode… or not.
What is the easiest way to achieve that?
First, make sure you understand the difference between how Unix supports Unicode and how Windows supports Unicode.
In the pre-Unicode days, both platforms were similar in that each locale had its own preferred character encodings. Strings were arrays of
char. Onechar= one character, except in a few East Asian locales that used double-byte encodings (which were awkward to handle due to being non-self-synchronizing).But they approached Unicode in two different ways.
Windows NT adopted Unicode in the early days when Unicode was intended to be a fixed-width 16-bit character encoding. Microsoft wrote an entirely new version of the Windows API using 16-bit characters (
wchar_t) instead of 8-bit char. For backwards-compatibility, they kept the old “ANSI” API around and defined a ton of macros so you could call either the “ANSI” or “Unicode” version depending on whether_UNICODEwas defined.In the Unix world (specifically, Plan 9 from Bell Labs), developers decided it would be easier to expand Unix’s existing East Asian multi-byte character support to handle 3-byte characters, and created the encoding now known as UTF-8. In recent years, Unix-like systems have been making UTF-8 the default encoding for most locales.
Windows theoretically could expand their ANSI support to include UTF-8, but they still haven’t, because of hard-coded assumptions about the maximum size of a character. So, on Windows, you’re stuck with an OS API that doesn’t support UTF-8 and a C++ runtime library that doesn’t support UTF-8.
The upshot of this is that:
This creates just as much complication for cross-platform code as it sounds. It’s easier if you just pick one Unicode encoding and stick to it.
Which encoding should that be?
See UTF-8 or UTF-16 or UTF-32 or UCS-2
In summary:
wchar_t
is the standard C++ “wide character” type. But it’s encoding is not standardized: It’s UTF-16 on Windows and UTF-32 on Unix. Except on those platforms that use locale-dependent
wchar_tencodings as a legacy from East Asian programming.If you want to use UTF-32, use a
uint32_tor equivalent typedef to store characters. Or usewchar_tif__STDC_ISO_10646__is defined anduint32_t.The new C++ standard will have
char16_tandchar32_t, which will hopefully clear up the confusion on how to represent UTF-16 and UTF-32.TCHAR
is a Windows typedef for
wchar_t(assumed to be UTF-16) when_UNICODEis defined andchar(assumed to be “ANSI”) otherwise. It was designed to deal with the overloaded Windows API mentioned above.In my opinion,
TCHARsucks. It combines the disadvantages of having platform-dependentcharwith the disadvantages of platform-dependentwchar_t. Avoid it.The most important consideration
Character encodings are about information interchange. That’s what the “II” stands for in ASCII. Your program doesn’t exist in a vacuum. You have to read and write files, which are more likely to be encoded in UTF-8 than in UTF-16.
On the other hand, you may be working with libraries that use UTF-16 (or more rarely, UTF-32) characters. This is especially true on Windows.
My recommendation is to use the encoding form that minimizes the amount of conversion you have to do.
It would be much better to have your program work entirely in Unicode internally and only deal with legacy encodings for reading legacy data (or writing it, but only if explicitly asked to.)