Possible Duplicate: Why UTF-32 exists whereas only 21 bits are necessary to encode every

Question

0

Asked: June 1, 20262026-06-01T19:40:53+00:00 2026-06-01T19:40:53+00:00

Possible Duplicate: Why UTF-32 exists whereas only 21 bits are necessary to encode every

0

Possible Duplicate:
Why UTF-32 exists whereas only 21 bits are necessary to encode every character?

The maximum Unicode code point is 0x10FFFF in UTF-32. UTF-32 has 21 information bits and 11 superfluous blank bits. So why is there no UTF-24 encoding (i.e. UTF-32 with the high byte removed) for storing each code point in 3 bytes rather than 4?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T19:40:54+00:00

Well, the truth is : UTF-24 was suggested in 2007 :

http://unicode.org/mail-arch/unicode-ml/y2007-m01/0057.html

The mentioned pros & cons being :

"UTF-24 
Advantages: 
 1. Fixed length code units. 
 2. Encoding format is easily detectable for any content, even if mislabeled. 
 3. Byte order can be reliably detected without the use of BOM, even for single-code-unit data. 
 4. If octets are dropped / inserted, decoder can resync at next valid code unit. 
 5. Practical for both internal processing and storage / interchange. 
 6. Conversion to code point scalar values is more trivial then for UTF-16 surrogate pairs 
    and UTF-7/8 multibyte sequences. 
 7. 7-bit transparent version can be easily derived. 
 8. Most compact for texts in archaic scripts. 
Disadvantages: 
 1. Takes more space then UTF-8/16, except for texts in archaic scripts. 
 2. Comparing to UTF-32, extra bitwise operations required to convert to code point scalar values. 
 3. Incompatible with many legacy text-processing tools and protocols. "

As pointed out by David Starner in http://www.mail-archive.com/unicode@unicode.org/msg16011.html :

Why? UTF-24 will almost invariably be larger then UTF-16, unless you
are talking a document in Old Italic or Gothic. The math alphanumberic
characters will almost always be combined with enough ASCII to make
UTF-8 a win, and if not, enough BMP characters to make UTF-16 a win.
Modern computers don’t deal with 24 bit chunks well; in memory, they’d
take up 32 bits a piece, unless you declared them packed, and then
they’d be a lot slower then UTF-16 or UTF-32. And if you’re storing to
disk, you may as well use BOCU or SCSU (you’re already going
non-standard), or use standard compression with UTF-8, UTF-16, BOCU or
SCSU. SCSU or BOCU compressed should take up half the space of UTF-24,
if that.

You could also check the following StackOverflow post :

Why UTF-32 exists whereas only 21 bits are necessary to encode every character?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Possible Duplicate: Why UTF-32 exists whereas only 21 bits are necessary to encode every

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply