Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6097847
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T13:05:23+00:00 2026-05-23T13:05:23+00:00

I am converting from UTF8 format to actual value in hex. However there are

  • 0

I am converting from UTF8 format to actual value in hex. However there are some invalid sequences of bytes that I need to catch. Is there a quick way to check if a character doesn’t belong in UTF8 in C++?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T13:05:24+00:00Added an answer on May 23, 2026 at 1:05 pm

    Follow the tables in the Unicode standard, chapter 3. (I used the Unicode 5.1.0 version of the chapter (p103); it was Table 3-7 on p94 of the Unicode 6.0.0 version, and was on p95 in the Unicode 6.3 version — and it is on p125 of the Unicode 8.0.0 version.)

    Bytes 0xC0, 0xC1, and 0xF5..0xFF cannot appear in valid UTF-8.
    The valid sequences are documented; all others are invalid.

    Table 3-7. Well-Formed UTF-8 Byte Sequences

    Code Points        First Byte Second Byte Third Byte Fourth Byte
    U+0000..U+007F     00..7F
    U+0080..U+07FF     C2..DF     80..BF
    U+0800..U+0FFF     E0         A0..BF      80..BF
    U+1000..U+CFFF     E1..EC     80..BF      80..BF
    U+D000..U+D7FF     ED         80..9F      80..BF
    U+E000..U+FFFF     EE..EF     80..BF      80..BF
    U+10000..U+3FFFF   F0         90..BF      80..BF     80..BF
    U+40000..U+FFFFF   F1..F3     80..BF      80..BF     80..BF
    U+100000..U+10FFFF F4         80..8F      80..BF     80..BF
    

    Note that the irregularities are in the second byte for certain ranges of values of the first byte. The third and fourth bytes, when needed, are consistent. Note that not every code point within the ranges identified as valid has been allocated (and some are explicitly ‘non-characters’), so there is more validation needed still.

    The code points U+D800..U+DBFF are for UTF-16 high surrogates and U+DC00..U+DFFF are for UTF-16 low surrogates; those cannot appear in valid UTF-8 (you encode the values outside the BMP — Basic Multilingual Plane — directly in UTF-8), which is why that range is marked invalid.

    Other excluded ranges (initial byte C0 or C1, or initial byte E0 followed by 80..9F, or initial byte F0 followed by 80..8F) are non-minimal encodings. For example, C0 80 would encode U+0000, but that’s encoded by 00, and UTF-8 defines that the non-minimal encoding C0 80 is invalid. And the maximum Unicode code point is U+10FFFF; UTF-8 encodings starting from F4 90 upwards generate values that are out of range.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am converting a website from ISO to UTF-8, so I need to convert
There are two classes: A and B . There are algorithms for converting from
I'm converting SQL from Teradata to SQL Server in Teradata, they have the format
I am manually converting code from Java (1.6) to C# and finding some difficulty
I am having trouble converting strings from utf8 to gb2312. My convert function is
I've tried converting the text to or from utf8, which didn't seem to help.
I have a problem with converting a text file from ANSI to UTF8 in
I'm trying to convert some mysql tables from latin1 to utf8. I'm using the
I'm converting a legacy app from ISO-8859-1 to UTF-8, and I've used a number
I am converting from existing CVS repository to SVN repository. CVS repository has few

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.