Unicode simply assigns an integer to each character. UTF-8 or others are used to encode these integers (“code points”) to a sequence of bytes to be stored in the memory. My question is that why can’t we simply store the character as the binary representation of its Unicode value (the “code point”) ? Consequently, some languages have characters that require multiple bytes to represent them. Isn’t it more easier to store them just as the binary of their code points ?
Share
Yes we can, and that is UTF-32.
The problem is UTF-32 wastes a lot of space. If the text contains a lot of European / Hebrew / Arabic text, with UTF-8 it takes only 1 to 2 bytes per code point, but with UTF-32 it takes 4 bytes per code point.
If we store the integer value as variable size, e.g. 0 ~ 255 use 1 byte, 256 ~ 65535 use 2 bytes etc., we would have an ambiguity problem, e.g. should
5a 5arepresent “ZZ” or “婚”? Basically, the solution is what we called UTF-8 — we use some special bits to indicate the length of the byte sequence to give a unique decoding result.