libxml2 seems to store all its strings in UTF-8, as xmlChar *.
/**
* xmlChar:
*
* This is a basic byte in an UTF-8 encoded string.
* It's unsigned allowing to pinpoint case where char * are assigned
* to xmlChar * (possibly making serialization back impossible).
*/
typedef unsigned char xmlChar;
As libxml2 is a C library, there’s no provided routines to get an std::wstring out of an xmlChar *. I’m wondering whether the prudent way to convert xmlChar * to a std::wstring in C++11 is to use the mbstowcs C function, via something like this (work in progress):
std::wstring xmlCharToWideString(const xmlChar *xmlString) {
if(!xmlString){abort();} //provided string was null
int charLength = xmlStrlen(xmlString); //excludes null terminator
wchar_t *wideBuffer = new wchar_t[charLength];
size_t wcharLength = mbstowcs(wideBuffer, (const char *)xmlString, charLength);
if(wcharLength == (size_t)(-1)){abort();} //mbstowcs failed
std::wstring wideString(wideBuffer, wcharLength);
delete[] wideBuffer;
return wideString;
}
Edit: Just an FYI, I’m very aware of what xmlStrlen returns; it’s the number of xmlChar used to store the string; I know it’s not the number of characters but rather the number of unsigned char. It would have been less confusing if I had named it byteLength, but I thought it would have been clearer as I have both charLength and wcharLength. As for the correctness of the code, the wideBuffer will be larger or equal to the required size to hold the buffer, always (I believe). As characters that require more space than wide_t will be truncated (I think).
xmlStrlen()returns the number of UTF-8 encoded codeunits in thexmlChar*string. That is not going to be the same number ofwchar_tencoded codeunits needed when the data is converted, so do not usexmlStrlen()to allocate the size of yourwchar_tstring. You need to callstd::mbtowc()once to get the correct length, then allocate the memory, and callmbtowc()again to fill the memory. You will also have to usestd::setlocale()to tellmbtowc()to use UTF-8 (messing with the locale may not be a good idea, especially if multiple threads are involved). For example:A better option, since you mention C++11, is to use
std::codecvt_utf8withstd::wstring_convertinstead so you do not have to deal with locales:An alternative option is to use an actual Unicode library, such as ICU or ICONV, to handle Unicode conversions.