Unicode simply assigns an integer to each character. UTF-8 or others are used to

Question

0

Asked: May 16, 20262026-05-16T23:16:20+00:00 2026-05-16T23:16:20+00:00

Unicode simply assigns an integer to each character. UTF-8 or others are used to

0

Unicode simply assigns an integer to each character. UTF-8 or others are used to encode these integers (“code points”) to a sequence of bytes to be stored in the memory. My question is that why can’t we simply store the character as the binary representation of its Unicode value (the “code point”) ? Consequently, some languages have characters that require multiple bytes to represent them. Isn’t it more easier to store them just as the binary of their code points ?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T23:16:21+00:00

Yes we can, and that is UTF-32.

The problem is UTF-32 wastes a lot of space. If the text contains a lot of European / Hebrew / Arabic text, with UTF-8 it takes only 1 to 2 bytes per code point, but with UTF-32 it takes 4 bytes per code point.

If we store the integer value as variable size, e.g. 0 ~ 255 use 1 byte, 256 ~ 65535 use 2 bytes etc., we would have an ambiguity problem, e.g. should 5a 5a represent “ZZ” or “婚”? Basically, the solution is what we called UTF-8 — we use some special bits to indicate the length of the byte sequence to give a unique decoding result.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Unicode simply assigns an integer to each character. UTF-8 or others are used to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply