I am trying to use ICU libraries to test if a string has invalid

Question

0

Asked: May 30, 20262026-05-30T17:26:00+00:00 2026-05-30T17:26:00+00:00

I am trying to use ICU libraries to test if a string has invalid

0

I am trying to use ICU libraries to test if a string has invalid UTF-8 characters. I created a UTF-8 converter but no invalid data gives me an error on conversion. Appreciate your help.

Thanks,
Prashanth

int main()                                                                                        
{                                     
    string str ("AP1120 CorNet-IP v5.0 v5.0.1.22 òÀ MIB 1.5.3.50 Profile EN-C5000");
    //  string str ("example string here");
    //  string str (" ����������"     );                  
    UErrorCode status = U_ZERO_ERROR;                   
    UConverter *cnv;            
    const char *sourceLimit;    
    const char * source = str.c_str();                  
    cnv = ucnv_open("utf-8", &status);                                                              
    assert(U_SUCCESS(status));                                                                      

    UChar *target;                                                                                  
    int sourceLength = str.length();                                                                
    int targetLimit = 2 * sourceLength;                                                             
    target = new UChar[targetLimit];                                                                

    ucnv_toUChars(cnv, target, targetLimit, source, sourceLength, &status);
    cout << u_errorName(status) << endl;
    assert(U_SUCCESS(status));                          
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T17:26:01+00:00

I modified your program to print out the actual strings, before and after:

#include <unicode/ucnv.h>
#include <string>
#include <iostream>
#include <cassert>
#include <cstdio>

int main()
{
    std::string str("22 òÀ MIB 1");
    UErrorCode status = U_ZERO_ERROR;
    UConverter * const cnv = ucnv_open("utf-8", &status);
    assert(U_SUCCESS(status));

    int targetLimit = 2 * str.size();
    UChar *target = new UChar[targetLimit];

    ucnv_toUChars(cnv, target, targetLimit, str.c_str(), -1, &status);

    for (unsigned int i = 0; i != targetLimit && target[i] != 0; ++i)
        std::printf("0x%04X ", target[i]);
    std::cout << std::endl;
    for (char c : str)
        std::printf("0x%02X ", static_cast<unsigned char>(c));
    std::cout << std::endl << "Status: " << status << std::endl;
}

Now, with default compiler settings, I get:

0x0032 0x0032 0x0020 0x00F2 0x00C0 0x0020 0x004D 0x0049 0x0042 0x0020 0x0031
0x32 0x32 0x20 0xC3 0xB2 0xC3 0x80 0x20 0x4D 0x49 0x42 0x20 0x31

That is, the input is already UTF-8. This is a conspiracy of my editor that saved the file in UTF-8 (verifiable in a hex editor), and of GCC that sets is execution character set to UTF-8.

You can coerce GCC to change those parameters. For example, forcing the execution character set to ISO-8859-1 (via -fexec-charset=iso-8859-1) produces this:

0x0032 0x0032 0x0020 0xFFFD 0xFFFD 0x0020 0x004D 0x0049 0x0042 0x0020 0x0031
0x32 0x32 0x20 0xF2 0xC0 0x20 0x4D 0x49 0x42 0x20 0x31

As you can see, the input is now ISO-8859-1-encoded, and the conversion prompty fails and produces “invalid character” code points U+FFFD.

However, the conversion operation still returns a “success” state. It appears that the library doesn’t consider a user data conversion error an error of the function call. Rather, the error status seems to be reserved for things like running out of space.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to use ICU libraries to test if a string has invalid

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply