Can UTF-8 encode 5 or 6 byte sequences, allowing all Unicode characters to be

Question

0

Asked: May 16, 20262026-05-16T14:20:33+00:00 2026-05-16T14:20:33+00:00

Can UTF-8 encode 5 or 6 byte sequences, allowing all Unicode characters to be

0

Can UTF-8 encode 5 or 6 byte sequences, allowing all Unicode characters to be encoded? I’m getting conflicting standards. I need to be able to support every Unicode character, not just those in the U+0000..U+10FFFF range.

(All quotes are from RFC 3629)

Section 3:

In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
accessible range) are encoded using sequences of 1 to 4 octets. The
only octet of a "sequence" of one has the higher-order bit set to 0,
the remaining 7 bits being used to encode the character number. In a
sequence of n octets, n>1, the initial octet has the n higher-order
bits set to 1, followed by a bit set to 0. The remaining bit(s) of
that octet contain bits from the number of the character to be
encoded. The following octet(s) all have the higher-order bit set to
1 and the following bit set to 0, leaving 6 bits in each to contain
bits from the character to be encoded.

So not all possible characters can be encoded with UTF-8? Does this mean I cannot encode characters from different planes than the BMP?

Section 2:

The octet values C0, C1, F5 to FF never appear.

This means we cannot encode UTF-8 values with 5 or 6 octets (or even some with 4 that aren’t within the above range)?

Section 12:

Restricted the range of characters to 0000-10FFFF (the UTF-16
accessible range).

Looking at the previous RFC confirms this…they reduced the range of characters.

Section 10:

Another security issue occurs when encoding to UTF-8: the ISO/IEC
10646 description of UTF-8 allows encoding character numbers up to
U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore
a risk of buffer overflow if the range of character numbers is not
explicitly limited to U+10FFFF or if buffer sizing doesn’t take into
account the possibility of 5- and 6-byte sequences.

So these sequences are allowed per the ISO/IEC 10646 definition, but not the RFC 3629 definition? Which one should I follow?

Thanks in advance.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T14:20:34+00:00

Editorial Team

2026-05-16T14:20:34+00:00Added an answer on May 16, 2026 at 2:20 pm

They are no Unicode characters beyond 10FFFF, the BMP covers 0000 through FFFF.

UTF-8 is well-defined for 0-10FFFF.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Can UTF-8 encode 5 or 6 byte sequences, allowing all Unicode characters to be

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply