I’m still trying to decide whether my (home) project should use UTF-8 strings (implemented in terms of std::string with additional UTF-8-specific functions when necessary) or some 16-bit string (implemented as std::wstring). The project is a programming language and environment (like VB, it’s a combination of both).
There are a few wishes/constraints:
- It would be cool if it could run on limited hardware, such as computers with limited memory.
- I want the code to run on Windows, Mac and (if resources allow) Linux.
- I’ll be using wxWidgets as my GUI layer, but I want the code that interacts with that toolkit confined in a corner of the codebase (I will have non-GUI executables).
- I would like to avoid working with two different kinds of strings when working with user-visible text and with the application’s data.
Currently, I’m working with std::string, with the intent of using UTF-8 manipulation functions only when necessary. It requires less memory, and seems to be the direction many applications are going anyway.
If you recommend a 16-bit encoding, which one: UTF-16? UCS-2? Another one?
I would recommend UTF-16 for any kind of data manipulation and UI. The Mac OS X and Win32 API uses UTF-16, same for wxWidgets, Qt, ICU, Xerces, and others. UTF-8 might be better for data interchange and storage. See http://unicode.org/notes/tn12/.
But whatever you choose, I would definitely recommend against std::string with UTF-8 ‘only when necessary’.
Go all the way with UTF-16 or UTF-8, but do not mix and match, that is asking for trouble.