I have a need to manipulate UTF-8 byte arrays in a low-level environment. The strings will be prefix-similar and kept in a container that exploits this (a trie.) To preserve this prefix-similarity as much as possible, I’d prefer to use a terminator at the end of my byte arrays, rather than (say) a byte-length prefix.
What terminator should I use? It seems 0xff is an illegal byte in all positions of any UTF-8 string, but perhaps someone knows concretely?
The byte 0xff cannot appear in a valid UTF-8 sequence, nor can any of 0xfc, 0xfd, 0xfe.
All UTF-8 bytes must match one of
There are no seven or larger byte sequences. The latest version of UTF-8 only allows UTF-8 sequences up to 4 bytes in length, which would leave 0xf8-0xff unused, but is possible though that a byte sequence could be validly called UTF-8 according to an obsolete version and include octets in 0xf8-0xfb.