In PHP, we can use mb_check_encoding() to determine if a string is valid UTF-8.

Question

0

Asked: June 8, 20262026-06-08T12:36:42+00:00 2026-06-08T12:36:42+00:00

In PHP, we can use mb_check_encoding() to determine if a string is valid UTF-8.

0

In PHP, we can use mb_check_encoding() to determine if a string is valid UTF-8. But that’s not a portable solution as it requires the mbstring extension to be compiled in and enabled. Additionally, it won’t tell us which character is invalid.

Is there a regular expression (or another other 100% portable method) that can match invalid UTF-8 bytes in a given string?

That way, those bytes can be replaced if needed (keeping the binary information, such as when building a test output XML file that includes binary data). So converting the characters to UTF-8 would lose information. So, we may want to convert:

"foo" . chr(128) . chr(255)

Into

"foo<128><255>"

So just "detecting" that the string is not good enough, we’d need to be able to detect which characters are invalid.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T12:36:43+00:00

You can use this PCRE regular expression to check for byte sequences in a string that are not valid UTF-8. If the regex matches, the string contains invalid byte sequences. It’s 100% portable because it doesn’t rely on PCRE_UTF8 to be compiled in.

$regex = '/(
    [\xC0-\xC1] # Invalid UTF-8 Bytes
    | [\xF5-\xFF] # Invalid UTF-8 Bytes
    | \xE0[\x80-\x9F] # Overlong encoding of prior code point
    | \xF0[\x80-\x8F] # Overlong encoding of prior code point
    | [\xC2-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start
    | [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start
    | [\xF0-\xF4](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start
    | (?<=[\x00-\x7F\xF5-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle
    | (?<![\xC2-\xDF]|[\xE0-\xEF]|[\xE0-\xEF][\x80-\xBF]|[\xF0-\xF4]|[\xF0-\xF4][\x80-\xBF]|[\xF0-\xF4][\x80-\xBF]{2})[\x80-\xBF] # Overlong Sequence
    | (?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF]) # Short 3 byte sequence
    | (?<=[\xF0-\xF4])[\x80-\xBF](?![\x80-\xBF]{2}) # Short 4 byte sequence
    | (?<=[\xF0-\xF4][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]) # Short 4 byte sequence (2)
)/x';

We can test it by creating a few variations of text:

// Overlong encoding of code point 0
$text = chr(0xC0) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 5 byte encoding
$text = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 6 byte encoding
$text = chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);        
var_dump(preg_match($regex, $text)); // int(1)
// High code-point without trailing characters
$text = chr(0xD0) . chr(0x01);
var_dump(preg_match($regex, $text)); // int(1)

etc…

In fact, since this matches invalid bytes, you could then use it in preg_replace to replace them away:

preg_replace($regex, '', $text); // Remove all invalid UTF-8 code-points

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In PHP, we can use mb_check_encoding() to determine if a string is valid UTF-8.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply