I’ve found a useful function on another answer and I wonder if someone could

Question

0

Asked: May 31, 20262026-05-31T13:34:08+00:00 2026-05-31T13:34:08+00:00

I’ve found a useful function on another answer and I wonder if someone could

0

I’ve found a useful function on another answer and I wonder if someone could explain to me what it is doing and if it is reliable. I was using mb_detect_encoding(), but it was incorrect when reading from an ISO 8859-1 file on a Linux OS.

This function seems to work in all cases I tested.

Here is the question: Get file encoding

Here is the function:

function isUTF8($string){
    return preg_match('%(?:
    [\xC2-\xDF][\x80-\xBF]              # Non-overlong 2-byte
    |\xE0[\xA0-\xBF][\x80-\xBF]         # Excluding overlongs
    |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # Straight 3-byte
    |\xED[\x80-\x9F][\x80-\xBF]         # Excluding surrogates
    |\xF0[\x90-\xBF][\x80-\xBF]{2}      # Planes 1-3
    |[\xF1-\xF3][\x80-\xBF]{3}          # Planes 4-15
    |\xF4[\x80-\x8F][\x80-\xBF]{2}      # Plane 16
    )+%xs', $string);
}

Is this a reliable way of detecting UTF-8 strings?
What exactly is it doing? Can it be made more robust?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T13:34:09+00:00

If you do not know the encoding of a string, it is impossible to guess the encoding with any degree of accuracy. That’s why mb_detect_encoding simply does not work. If however you know what encoding a string should be in, you can check if it is a valid string in that encoding using mb_check_encoding. It more or less does what your regex does, probably a little more comprehensively. It can answer the question “Is this sequence of bytes valid in UTF-8?” with a clear yes or no. That doesn’t necessarily mean the string actually is encoded in that encoding, just that it may be. For example, it’ll be impossible to distinguish any single-byte encoding using all 8 bits from any other single-byte encoding using 8 bits. But UTF-8 should be rather distinguishable, though you can produce, for instance, Latin-1 encoded strings that also happen to be valid UTF-8 byte sequences.

In short, there’s no way to know for sure. If you expect UTF-8, check if the byte sequence you received is valid in UTF-8, then you can treat the string safely as UTF-8. Beyond that there’s hardly anything you can do.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve found a useful function on another answer and I wonder if someone could

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply