Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3318076
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 17, 20262026-05-17T22:39:46+00:00 2026-05-17T22:39:46+00:00

Quick & dirty Q: Can I safely assume that a byte of a UTF-8,

  • 0

Quick & dirty Q: Can I safely assume that a byte of a UTF-8, UTF-16 or UTF-32 codepoint (character) will not be an ASCII whitespace character (unless the codepoint is representing one)?

I’ll explain:

Say that I have a UTF-8 encoded string. This string contains some characters that take more than one byte to store. I need to find out if any of the characters in this string are ASCII whitespace characters (space, horizontal tab, vertical tab, carriage return, linefeed etc – Unicode defines some more whitespace characters, but forget about them).

So what I do is that I loop through the string and check if any of the bytes match the bytes that define whitespace characters. Take e.g. 0D (hex) for carriage return. Note that we are talking bytes here, not characters.

Will this work? Will there be UTF-8 codepoints where the first byte will be 0D and the second byte something else – and this codepoint does not represent a carriage return? Maybe the other way around? Will there be codepoints where the first byte is something weird, and the second (or third, or fourth) byte is 0D – and this codepoint does not represent a carriage return?

UTF-8 is backwards compatible with ASCII, so I really hope that it will work for UTF-8. From what I know of it, it might, but I don’t know the details well enough to say for sure.

As for UTF-16 and UTF-32 I doubt it’ll work at all, but I barely know anything about the details of these, so feel free to surprise me there…


The reason for this whacky question is that I have code checking for whitespace that works for ASCII, and I need to know if it may break on Unicode. I have no choice but to check byte-for-byte, for a bunch of reasons. I’m hoping that the backwards compatibility with ASCII might give me at least UTF-8 support for free.

  • 1 1 Answer
  • 4 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-17T22:39:47+00:00Added an answer on May 17, 2026 at 10:39 pm

    For UTF-8, yes, you can. All non-ASCII characters are represented by bytes with the high-bit set and all ASCII characters have the high bit unset.

    Just to be clear, every byte in the encoding of a non-ASCII character has the high bit set; this is by design.

    You should never operate on UTF-16 or UTF-32 at the byte level. This almost certainly won’t work. In fact lots of things will break, since every second byte is likely to be '\0' (unless you typically work in another language).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Is there a quick & easy way to do this in jQuery that I'm
I have to do some quick & dirty job today. I need to send
Is there a quick & dirty way of obtaining a list of all the
Using the tutorial Find Stuff (quick and Dirty) in Advance Rails Recipes, I can
Here's a quick Perl question: How can I convert HTML special characters like ü
Quick question, indexof() find the first occur position of the string character? what about
Quick Wordpress question. Is it possible to check against a specific category, so not
I have a JTable in a swing application. I wrote a quick and dirty
I'm having a low-brainwave day... Does anyone know of a quick & elegant way
displaying html problem with php & mysql Hi basically i have a quick mysql_fetch_array

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.