Using strings in C++ development is always a bit more complicated than in languages like Java or scripting languages. I think some of the complexity comes from a performance focus in C++ and some is just historical.
I know of the following major string systems and would like to find out if there are others and what specific drawbacks they have vs. each other:
- ICU : http://userguide.icu-project.org/strings#TOC-Using-Unicode-Strings-in-C-
- GLib::ustring : http://library.gnome.org/devel/gtkmm-tutorial/unstable/sec-basics-ustring.html.en
- MFC CString : http://msdn.microsoft.com/en-us/library/5bzxfsea%28VS.100%29.aspx
- std::basic_string : http://en.cppreference.com/w/cpp/string/basic_string
- QT QString : http://doc.qt.nokia.com/4.6/qstring.html#details
I’ll admit that there can be no definite answer, but I think SOs voting system in uniquely suited to show the preferences (and thus the validity of arguments) of people actually using a certain string system.
Added from answers:
- UFT8-CPP : http://utfcpp.sourceforge.net/
I’d say it’s all historical. In particular, two pieces of history:
charand “byte” are hopelessly confounded.char*. Unfortunately, they had to wait 15 years for one to be officially standardized. In the meantime, people wrote their own string classes that we’re still stuck with today.Anyhow, I’ve used two of the classes you mentioned:
MFC CString
MSDN documentation
There are actually two
CStringclasses:CStringAusescharwith “ANSI” encoding, andCStringWuseswchar_twith UTF-16 encoding.CStringis a typedef of one of them depending on a preprocessor macro. (Lots of things in Windows come in “ANSI” and “Unicode” versions.)You could use UTF-8 for the
char-based version, but this has the problem that Microsoft refuses to support “UTF-8” as an ANSI code page. Thus, functions likeTrim(const char* pszTargets), which depend on being able to recognize character boundaries, won’t work correctly if you use them with non-ASCII characters.Since UTF-16 is natively supported, you’ll probably prefer the
wchar_t-based version.Both CString classes have a fairly convenient interface, including a printf-like
Formatfunction. Plus the ability to pass CString objects to this varags function, due to the way the class is implemented.The main disadvantages are:
<<and>>for streams.(That last point has caused me much frustration since I got put in charge of porting our code to Linux. Our company wrote our own string class that’s a clone of CString but cross-platform.)
std::basic_string
The good thing about
basic_stringis that it’s the standard.The bad thing about it is that it doesn’t have Unicode support. OTOH, it doesn’t actively not support Unicode, as it lacks member functions like
upper()/lower()that would depend on the character encoding. In that sense, it’s really more of a “dynamic array of code units” than a “string”.There are libraries that let you use
std::stringwith UTF-8, such as the above-mentioned UTF8-CPP and some of the functions in the Poco library.For which size characters to use, see std::wstring vs std::string.