libxml2 seems to store all its strings in UTF-8, as xmlChar * . /**

Question

0

Asked: June 16, 20262026-06-16T15:24:01+00:00 2026-06-16T15:24:01+00:00

libxml2 seems to store all its strings in UTF-8, as xmlChar * . /**

0

libxml2 seems to store all its strings in UTF-8, as xmlChar *.

/**
 * xmlChar:
 *
 * This is a basic byte in an UTF-8 encoded string.
 * It's unsigned allowing to pinpoint case where char * are assigned
 * to xmlChar * (possibly making serialization back impossible).
 */
typedef unsigned char xmlChar;

As libxml2 is a C library, there’s no provided routines to get an std::wstring out of an xmlChar *. I’m wondering whether the prudent way to convert xmlChar * to a std::wstring in C++11 is to use the mbstowcs C function, via something like this (work in progress):

std::wstring xmlCharToWideString(const xmlChar *xmlString) {
    if(!xmlString){abort();} //provided string was null
    int charLength = xmlStrlen(xmlString); //excludes null terminator
    wchar_t *wideBuffer = new wchar_t[charLength];
    size_t wcharLength = mbstowcs(wideBuffer, (const char *)xmlString, charLength);
    if(wcharLength == (size_t)(-1)){abort();} //mbstowcs failed
    std::wstring wideString(wideBuffer, wcharLength);
    delete[] wideBuffer;
    return wideString;
}

Edit: Just an FYI, I’m very aware of what xmlStrlen returns; it’s the number of xmlChar used to store the string; I know it’s not the number of characters but rather the number of unsigned char. It would have been less confusing if I had named it byteLength, but I thought it would have been clearer as I have both charLength and wcharLength. As for the correctness of the code, the wideBuffer will be larger or equal to the required size to hold the buffer, always (I believe). As characters that require more space than wide_t will be truncated (I think).

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T15:24:03+00:00

xmlStrlen() returns the number of UTF-8 encoded codeunits in the xmlChar* string. That is not going to be the same number of wchar_t encoded codeunits needed when the data is converted, so do not use xmlStrlen() to allocate the size of your wchar_t string. You need to call std::mbtowc() once to get the correct length, then allocate the memory, and call mbtowc() again to fill the memory. You will also have to use std::setlocale() to tell mbtowc() to use UTF-8 (messing with the locale may not be a good idea, especially if multiple threads are involved). For example:

std::wstring xmlCharToWideString(const xmlChar *xmlString)
{    
    if (!xmlString) { abort(); } //provided string was null

    std::wstring wideString;

    int charLength = xmlStrlen(xmlString);
    if (charLength > 0)
    {
        char *origLocale = setlocale(LC_CTYPE, NULL);
        setlocale(LC_CTYPE, "en_US.UTF-8");

        size_t wcharLength = mbtowc(NULL, (const char*) xmlString, charLength); //excludes null terminator
        if (wcharLength != (size_t)(-1))
        {
            wideString.resize(wcharLength);
            mbtowc(&wideString[0], (const char*) xmlString, charLength);
        }

        setlocale(LC_CTYPE, origLocale);
        if (wcharLength == (size_t)(-1)) { abort(); } //mbstowcs failed
    }

    return wideString;
}

A better option, since you mention C++11, is to use std::codecvt_utf8 with std::wstring_convert instead so you do not have to deal with locales:

std::wstring xmlCharToWideString(const xmlChar *xmlString)
{    
    if (!xmlString) { abort(); } //provided string was null
    try
    {
        std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> conv;
        return conv.from_bytes((const char*)xmlString);
    }
    catch(const std::range_error& e)
    {
        abort(); //wstring_convert failed
    }
}

An alternative option is to use an actual Unicode library, such as ICU or ICONV, to handle Unicode conversions.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

libxml2 seems to store all its strings in UTF-8, as xmlChar * . /**

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply