Quoted from here : Security may also be impacted by a characteristic of several

Question

0

Asked: May 26, 20262026-05-26T02:34:09+00:00 2026-05-26T02:34:09+00:00

Quoted from here : Security may also be impacted by a characteristic of several

0

Quoted from here:

Security may also be impacted by a characteristic of several character
encodings, including UTF-8: the “same thing” (as far as a user can
tell) can be represented by several distinct character sequences. For
instance, an e with acute accent can be represented by the precomposed
U+00E9 E ACUTE character or by the canonically equivalent sequence
U+0065 U+0301 (E + COMBINING ACUTE). Even though UTF-8 provides a
single byte sequence for each character sequence, the existence of
multiple character sequences for “the same thing” may have security
consequences whenever string matching, indexing,

Is this a hidden feature of UTF-8 that I’ve never tackled before?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T02:34:10+00:00

This issue is not actually specific to UTF-8 at all. It happens with all encodings that can represent all (or at least most) Unicode codepoints.

The general idea of Unicode is to not provide so-called pre-composed characters (e.g. U+00E9 E ACUTE), instead they usually like to provide the base character (e.g. U+0065 LATIN SMALL LETTER E) and the combining character (e.g. U+0301 COMBINING ACUTE ACCENT). This has the advantage of not having to provide every possible combination as its own character.

Note: the U+xxxx notation is used to refer to unicode codepoints. It’s the encoding-independent way to refer to Unicode characters.

However when Unicode was first designed an important goal was to have round-trip compatibility for existing, widely-used encodings, so some pre-composed characters were included (in fact most of the diacritic characters from the latin and related alphabets are included).

So yes (and tl;dr): in a correctly working Unicode-capable application U+00E9 should render the same way and be treated the same way as U+0065 followed by U+0301.

There’s a non-trivial process called normalization that helps work with these differences by reducing a given string to one of four normal forms.

For example passing both strings (U+00E9 and U+0065 U+0301) will result in U+00E9 when using NFC and will result in U+0065 U+0301 when using NFD.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Quoted from here : Security may also be impacted by a characteristic of several

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply