I have a need to manipulate UTF-8 byte arrays in a low-level environment. The

Question

0

Asked: May 28, 20262026-05-28T05:48:09+00:00 2026-05-28T05:48:09+00:00

I have a need to manipulate UTF-8 byte arrays in a low-level environment. The

0

I have a need to manipulate UTF-8 byte arrays in a low-level environment. The strings will be prefix-similar and kept in a container that exploits this (a trie.) To preserve this prefix-similarity as much as possible, I’d prefer to use a terminator at the end of my byte arrays, rather than (say) a byte-length prefix.

What terminator should I use? It seems 0xff is an illegal byte in all positions of any UTF-8 string, but perhaps someone knows concretely?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T05:48:09+00:00

The byte 0xff cannot appear in a valid UTF-8 sequence, nor can any of 0xfc, 0xfd, 0xfe.

All UTF-8 bytes must match one of

0xxxxxxx - Lower 7 bit.
10xxxxxx - Second and subsequent bytes in a multi-byte sequence.
110xxxxx - First byte of a two-byte sequence.
1110xxxx - First byte of a three-byte sequence.
11110xxx - First byte of a four-byte sequence.
111110xx - First byte of a five-byte sequence.
1111110x - First byte of a six-byte sequence.

There are no seven or larger byte sequences. The latest version of UTF-8 only allows UTF-8 sequences up to 4 bytes in length, which would leave 0xf8-0xff unused, but is possible though that a byte sequence could be validly called UTF-8 according to an obsolete version and include octets in 0xf8-0xfb.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a need to manipulate UTF-8 byte arrays in a low-level environment. The

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply