Quoted from here:
Security may also be impacted by a characteristic of several character
encodings, including UTF-8: the “same thing” (as far as a user can
tell) can be represented by several distinct character sequences. For
instance, an e with acute accent can be represented by the precomposed
U+00E9 E ACUTE character or by the canonically equivalent sequence
U+0065 U+0301 (E + COMBINING ACUTE). Even though UTF-8 provides a
single byte sequence for each character sequence, the existence of
multiple character sequences for “the same thing” may have security
consequences whenever string matching, indexing,
Is this a hidden feature of UTF-8 that I’ve never tackled before?
This issue is not actually specific to UTF-8 at all. It happens with all encodings that can represent all (or at least most) Unicode codepoints.
The general idea of Unicode is to not provide so-called pre-composed characters (e.g. U+00E9 E ACUTE), instead they usually like to provide the base character (e.g. U+0065 LATIN SMALL LETTER E) and the combining character (e.g. U+0301 COMBINING ACUTE ACCENT). This has the advantage of not having to provide every possible combination as its own character.
Note: the U+xxxx notation is used to refer to unicode codepoints. It’s the encoding-independent way to refer to Unicode characters.
However when Unicode was first designed an important goal was to have round-trip compatibility for existing, widely-used encodings, so some pre-composed characters were included (in fact most of the diacritic characters from the latin and related alphabets are included).
So yes (and tl;dr): in a correctly working Unicode-capable application U+00E9 should render the same way and be treated the same way as U+0065 followed by U+0301.
There’s a non-trivial process called normalization that helps work with these differences by reducing a given string to one of four normal forms.
For example passing both strings (
U+00E9andU+0065 U+0301) will result inU+00E9when using NFC and will result inU+0065 U+0301when using NFD.