If I write the character é to a file and I open it with an hexadecimal editor I can see the bytes 0xC3, 0xA9.
From Wikipedia, the first byte it’s called the leading byte and the second, the trailing byte. 0xC3 it’s a metadata byte that means that the character it’s encoded with 1 byte, 0xA9, but the unicode value for é is 0xE9.
I basically want to know why é it’s encoded with a 0xA9 instead of 0xE9. How the text editors convert from 0xC3A9 to 0xE9? Any shift operation?
What makes you think that 0xC3 is “a metadata byte”?
Every byte in UTF-8 contains relevant information about the codepoint that is encoded.
The first byte of a UTF-8 encoded codepoint contains a marker (number of leading 1s) that indicates the total number of bytes used to encode the codepoint(*) and the first few bits of the actual codepoint. All trailing bytes then contain a “continuation marker” (the bits
10) and 6 more bits of the encoded codepoint.The Wikipedia article on UTF-8 has a pretty good description of the process.
There is an encoding that uses the codepoint value directly: UTF-32 (a.k.a UCS-4) which is basically “use the codepoint value as a 32bit value”
(*) The marker is actually remarkably easy: if the byte starts with (i.e. it’s most significant bits are)
0, then it’s a single-byte encoding (i.e. a codepoint between 0 and 127). If it starts with10, then it’s a continuation byte. If it’s110,1110or11110then it’s the start of a 2-, 3- or 4-byte sequence, respectively.111110and1111110used to be defined as well, but are no longer valid in modern UTF-8 (since those are only needed to encode values that are guaranteed to never be used in the Unicode standard).