From this excellent UTF-8 all the way through question, I read about this: Unfortunately,

Question

0

Asked: June 13, 20262026-06-13T05:56:52+00:00 2026-06-13T05:56:52+00:00

From this excellent UTF-8 all the way through question, I read about this: Unfortunately,

0

From this excellent “UTF-8 all the way through” question, I read about this:

Unfortunately, you should verify every submitted string as being valid
UTF-8 before you try to store it or use it anywhere. PHP’s
mb_check_encoding() does the trick, but you have to use it
religiously. There’s really no way around this, as malicious clients
can submit data in whatever encoding they want, and I haven’t found a
trick to get PHP to do this for you reliably.

Now, I’m still learning the quirks of encoding, and I’d like to know exactly what malicious clients can do to abuse encoding. What can one achieve? Can somebody give an example? Let’s say I save the user input into a MySQL database, or I send it through e-mail, how can a user create harm if I do not use the mb_check_encoding functionality?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T05:56:54+00:00

how can a user create harm if I do not use the mb_check_encoding functionality?

This is about overlong encodings.

Due to an unfortunate quirk of UTF-8 design, it is possible to make byte sequences that, if parsed with a naïve bit-packing decoder, would result in the same character as a shorter sequence of bytes – including a single ASCII character.

For example the character < is usually represented as byte 0x3C, but could also be represented using the overlong UTF-8 sequence 0xC0 0xBC (or even more redundant 3- or 4-byte sequences).

If you take this input and handle it in a Unicode-oblivious byte-based tool, then any character processing step being used in that tool may be evaded. The canonical example would be submitting 0x80 0xBC to PHP, which has native byte strings. The typical use of htmlspecialchars to HTML-encode the character < would fail here because the expected byte sequence 0x3C is not present. So the output of the script would still include the overlong-encoded <, and any browser reading that output could potentially read the sequence 0x80 0xBC 0x73 0x63 0x72 0x69 0x70 0x74 as <script and hey presto! XSS.

Overlongs have been banned since way back and modern browsers no longer permit them. But this was a genuine problem for IE and Opera for a long time, and there’s no guarantee every browser is going to get it right in future. And of course this is only one example – any place where a byte-oriented tool processes Unicode strings you’ve potentially got similar problems. The best approach, therefore, is to remove all overlongs at the earliest input phase.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

From this excellent UTF-8 all the way through question, I read about this: Unfortunately,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply