I’m trying to figure out what “continuation bytes” are (for curiousity sake) in the UTF-8 encoding.
Wikipedia introduces this term in the UTF-8 article without defining it at all
Google search returns no useful information either. I’m about to jump into the official specification, but would preferably read a high-level summary first.
A continuation byte in UTF-8 is any byte where the top two bits are
10.They are the subsequent bytes in multi-byte sequences. The following table may help:
Here you can see how the Unicode code points map to UTF-8 multi-byte byte sequences, and their equivalent binary values.
The basic rules are this:
0bit, it’s a single byte value less than 128.11, it’s the first byte of a multi-byte sequence and the number of1bits at the start indicates how many bytes there are in total (110xxxxxhas two bytes,1110xxxxhas three and11110xxxhas four).10, it’s a continuation byte.This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the
10bits.Similarly, it can also be used for a UTF-8
strlenby only counting non-10xxxxxxbytes.