Quick & dirty Q: Can I safely assume that a byte of a UTF-8,

Question

0

Asked: May 17, 20262026-05-17T22:39:46+00:00 2026-05-17T22:39:46+00:00

Quick & dirty Q: Can I safely assume that a byte of a UTF-8,

0

Quick & dirty Q: Can I safely assume that a byte of a UTF-8, UTF-16 or UTF-32 codepoint (character) will not be an ASCII whitespace character (unless the codepoint is representing one)?

I’ll explain:

Say that I have a UTF-8 encoded string. This string contains some characters that take more than one byte to store. I need to find out if any of the characters in this string are ASCII whitespace characters (space, horizontal tab, vertical tab, carriage return, linefeed etc – Unicode defines some more whitespace characters, but forget about them).

So what I do is that I loop through the string and check if any of the bytes match the bytes that define whitespace characters. Take e.g. 0D (hex) for carriage return. Note that we are talking bytes here, not characters.

Will this work? Will there be UTF-8 codepoints where the first byte will be 0D and the second byte something else – and this codepoint does not represent a carriage return? Maybe the other way around? Will there be codepoints where the first byte is something weird, and the second (or third, or fourth) byte is 0D – and this codepoint does not represent a carriage return?

UTF-8 is backwards compatible with ASCII, so I really hope that it will work for UTF-8. From what I know of it, it might, but I don’t know the details well enough to say for sure.

As for UTF-16 and UTF-32 I doubt it’ll work at all, but I barely know anything about the details of these, so feel free to surprise me there…

The reason for this whacky question is that I have code checking for whitespace that works for ASCII, and I need to know if it may break on Unicode. I have no choice but to check byte-for-byte, for a bunch of reasons. I’m hoping that the backwards compatibility with ASCII might give me at least UTF-8 support for free.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T22:39:47+00:00

For UTF-8, yes, you can. All non-ASCII characters are represented by bytes with the high-bit set and all ASCII characters have the high bit unset.

Just to be clear, every byte in the encoding of a non-ASCII character has the high bit set; this is by design.

You should never operate on UTF-16 or UTF-32 at the byte level. This almost certainly won’t work. In fact lots of things will break, since every second byte is likely to be '\0' (unless you typically work in another language).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Quick & dirty Q: Can I safely assume that a byte of a UTF-8,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply