I am trying to use ICU libraries to test if a string has invalid UTF-8 characters. I created a UTF-8 converter but no invalid data gives me an error on conversion. Appreciate your help.
Thanks,
Prashanth
int main()
{
string str ("AP1120 CorNet-IP v5.0 v5.0.1.22 òÀ MIB 1.5.3.50 Profile EN-C5000");
// string str ("example string here");
// string str (" ����������" );
UErrorCode status = U_ZERO_ERROR;
UConverter *cnv;
const char *sourceLimit;
const char * source = str.c_str();
cnv = ucnv_open("utf-8", &status);
assert(U_SUCCESS(status));
UChar *target;
int sourceLength = str.length();
int targetLimit = 2 * sourceLength;
target = new UChar[targetLimit];
ucnv_toUChars(cnv, target, targetLimit, source, sourceLength, &status);
cout << u_errorName(status) << endl;
assert(U_SUCCESS(status));
}
I modified your program to print out the actual strings, before and after:
Now, with default compiler settings, I get:
That is, the input is already UTF-8. This is a conspiracy of my editor that saved the file in UTF-8 (verifiable in a hex editor), and of GCC that sets is execution character set to UTF-8.
You can coerce GCC to change those parameters. For example, forcing the execution character set to ISO-8859-1 (via
-fexec-charset=iso-8859-1) produces this:As you can see, the input is now ISO-8859-1-encoded, and the conversion prompty fails and produces “invalid character” code points U+FFFD.
However, the conversion operation still returns a “success” state. It appears that the library doesn’t consider a user data conversion error an error of the function call. Rather, the error status seems to be reserved for things like running out of space.