Using the features currently available in PHP’s intl wrapper for ICU, how would you go about checking for validity of a string’s encoding? (e.g. check for valid UTF-8)
I know it can be done with mbstring, iconv() and PCRE but I’m specifically interested in intl with this question.
I did some digging and found ICU unorm2_normalize() documentation. Its pErrorCode out parameter is interesting. The standard ICU error codes start around line 620 of utypes.h. So I tried this test script:
So I guess a test based on that and looking for the following three error codes would be a decent indication of bad UTF-8 encoding:
Or when I’m feeling lazy I could just use
Btw: I’m confused by this line of the ICU API spec:
The “the function returns immediately” phrase is encouraging re performance of my test but does “the function” refer to unorm2_normalize() or U_SUCCESS()? Any ideas?