I’m looking to take some shortcuts when looking for non-printable ASCII characters in raw byte streams of text encoded using Unicode encoding schemes.
I know for instance that in UTF-8 encoding, if a character is encoded using multiple bytes, each byte will always be => 128, therefore if a byte has a value of < 32 I know it’s a non-printable ASCII character. I want to know if I can take similar shortcuts with UTF-16 and UTF-32.
I know UTF-16 and UTF-32 use zero padding for encoded ASCII characters, but wanted to know if individual bytes in non-ASCII range characters could ever be less than 32.
Basically I would like to know if I can scan bytes for ASCII characters below 32 reliably (as I can with UTF-8), without having to decode the stream into characters.
For reference I’m looking for line breaks (10, 13) to index text into lines, and looking at optimal ways of doing this i.e. without decoding into characters.
UTF-32 is a straightforward, no-frills encoding. Each character is represented directly by its 32-bit codepoint. There is no provision like there is with UTF-8 that ASCII bytes will never be found in the middle of non-ASCII characters. Any codepoint of the form
\uxxxxxx10,\uxxxx10xx,\uxx10xxxx, or\u10xxxxxxwill contain the byte0x10when “encoded” as UTF-32.However, because every character is always a full 32 bits, you can read the stream in 4-byte chunks and look the 4-byte value
0x00000010or0x00000013.