Looking at the unicode standard, they recommend to use plain chars for storing UTF-8 encoded strings. Does this work as expected with C++ and the basic std::string, or do cases exist in which the UTF-8 encoding can create problems?
For example, when computing the length, it may not be identical to the number of bytes – how is this supposed to be handled? Reading the standard, I’m probably fine using a char array for storage, but I’ll still need to write functions like strlen etc. on my own, which work on encoded text, cause as far as I understand the problem, the standard routines are either ASCII only, or expect wide literals (16bit or more), which are not recommended by the unicode standard. So far, the best source I found about the encoding stuff is a post on Joel’s on Software, but it does not explain what we poor C++ developer should use 🙂
There’s a library called ‘UTF8-CPP‘, which lets you store your UTF-8 strings in standard std::string objects, and provides additional functions to enumerate and manipulate utf-8 characters.
I haven’t tested it yet, so I don’t know what it’s worth, but I am considering using it myself.