I am developing an application of which the core code base would be cross-platform for Windows, iOS and Android.
My question is: how should I internally represent strings used by this app to be able to effectively use them on all three platforms?
It is important to note, that I use DirectWrite heavily in Windows, of which the API functions usually expect wchar_t* to be passed (btw. the API documentation states that “A pointer to an array of Unicode characters.”, I don’t know whether this means that they are in UTF-16 encoding or not)
I see three different approaches (however I find it quite difficult to grasp the details of handling Unicode strings with C++ in a cross-platform manner, so maybe I miss some important concept):
- use std::string internally everywhere (and store the strings in UTF-8 encoding?), and convert them to wchar_t* where it is needed for the DirectWrite API (I don’t know what is needed by the text-processing APIs of Android and iOS yet).
- use std::wstring internally everywhere. If I understand things right, this wouldn’t be effective from memory-usage perspective, because a wchar_t is 4 bytes on iOS and Android (and does it mean that i would have to store the string in UTF-16 on Windows, and in UTF-32 on Android/iOS?)
- create an abstraction for strings with an abstract base class, and implement internal storing specifically for the different platforms.
What would be the best solution? And by the way, are there any existing cross-platform libraries that abstract string handling? (and also, reading and serializing of Unicode strings)
(UPDATE: deleted the part with the question about the difference of char* and std::string.)
A part of my question comes from my misunderstanding, or not completely understanding how string and wstring classes work in C++ (I am coming from C# background).
The differences of the two and pros and cons have been described in this great answer: std::wstring VS std::string.
How string and wstring works
For me, the single most important discovery about string and wstring classes was that semantically they do not represent a piece of encoded text, rather simply a “string” of char or wchar_t. They are more like a simple data array with some string-specific operations (like append and substr) rather than representing text. Neither of them are aware of any kind of string-encoding whatsoever, they handle each char or wchar_t element individually as a separate character.
Encodings
However, on most systems, if you create a string from a string literal with a special character like this:
The ű character will be represented by more than one byte in memory, but that has nothing to do with the std::string class, that is a feature of the compiler as it can encode string literals with UTF8 (not every compiler though). (And string literals prefixed with L will be represented by wchar_t-s in either UTF16 or UTF32 or something else, depending on the compiler).
Thus the string “ű” will be represented in memory with two bytes: 0xC5 0xB1, and the std::string class won’t know that those two bytes semantically mean one character (one Unicode code point) in UTF8, hence the sample code:
produces the following result (depending on the compiler, some compilers do not take string literals as UTF8, and some compilers depend on the encoding of the source file):
The size() function returns 2, because the only thing the std::string knows is that it stores two bytes (two chars). And substr works “primitively” as well, it returns a string containing the single char 0xC5, which is displayed as �, because it is not a valid UTF8 character (but that does not bother the std::string).
And from that we can see that who handle encodings are the various text-processing APIs of the platform, like the simple cout, or DirectWrite.
My approach
In my application DirectWrite is very important, which only accepts strings encoded in UTF16 (in the form of wchar_t* pointers). So I decided to store the strings both in memory and in file encoded in UTF16. However, I wanted a cross-platform implementation which can handle the UTF16 strings on Windows, Android and iOS, which is not possible with std::wstring, because its data size (and the encoding it fits to use) is platform-dependent.
To create a cross-platform, strictly UTF16 string class I templated basic_string on a data type which is 2 bytes long. Quite surprisingly – at least for me – I found almost no information about this online, I based my solution on this approach. Here is the code:
Strings are stored with the above class, and the raw UTF16 data is passed to the specific API functions of the various platforms, all of which at the moment seems to support UTF16 encoding.
The implementation might not be perfect, however the append, substr and size functions seem to work properly. I still don’t have much experience with string handling in C++ so feel free to comment/edit if I stated something incorrectly.