Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9303111
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 18, 20262026-06-18T23:24:32+00:00 2026-06-18T23:24:32+00:00

A problem with various character encodings is that the containing file is not always

  • 0

A problem with various character encodings is that the containing file is not always clearly marked. There are inconsistent conventions for marking some using “byte-order-markers” or BOMs. But in essence you have to be told what the file encoding is, to read it accurately.

We build programming tools that read source files, and this gives us grief. We have means to specify defaults, and sniff for BOMs, etc. And we do pretty well with conventions and defaults. But a place we (and I assume everybody else) gets hung up on are UTF-8 files that are not BOM-marked.

Recent MS IDEs (e.g., VS Studio 2010) will apparently “sniff” a file to determine if it is UTF-8 encoded without a BOM. (Being in the tools business, we’d like to be compatible with MS because of their market share, even if it means having to go over the “stupid” cliff with them.) I’m specifically interested in what they use as a heuristic (although discussions of heuristics is fine)? How can it be “right”? (Consider an ISO8859-x encoded string interpreted this way).

EDIT: This paper on detecting character encodings/sets is pretty interesting:
http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

EDIT December 2012: We ended scanning the entire file to see if it contained any violations of UTF-8 sequences… and if it does not, we call it UTF-8. The bad part of this solution is you have to process the characters twice if it is UTF-8. (If it isn’t UTF-8, this test is likely to determine that fairly quickly, unless the file happens to all 7 bit ASCII at which point reading like UTF-8 won’t hurt).

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-18T23:24:34+00:00Added an answer on June 18, 2026 at 11:24 pm

    If encoding is UTF-8, the first character you see over 0x7F must be the start of a UTF-8 sequence. So test it for that. Here is the code we use for that:

    unc ::IsUTF8(unc *cpt)
    {
        if (!cpt)
            return 0;
    
        if ((*cpt & 0xF8) == 0xF0) { // start of 4-byte sequence
            if (((*(cpt + 1) & 0xC0) == 0x80)
             && ((*(cpt + 2) & 0xC0) == 0x80)
             && ((*(cpt + 3) & 0xC0) == 0x80))
                return 4;
        }
        else if ((*cpt & 0xF0) == 0xE0) { // start of 3-byte sequence
            if (((*(cpt + 1) & 0xC0) == 0x80)
             && ((*(cpt + 2) & 0xC0) == 0x80))
                return 3;
        }
        else if ((*cpt & 0xE0) == 0xC0) { // start of 2-byte sequence
            if ((*(cpt + 1) & 0xC0) == 0x80)
                return 2;
        }
        return 0;
    }
    

    If you get a return of 0, it is not valid UTF-8. Else skip the number of chars returned and continue checking the next one over 0x7F.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

There is a common problem that F# does not natively support infix-style use of
My program has to read files that use various encodings. They may be ANSI,
I'm working with various articles and the problem I'm having is that various authors
I have the problem of stitching together pcm audio samples from various parts of
Here is the problem: I have a list of items with various length for
Top of the morning to ye people on various surfaces of earth. The problem:
I have seen various discussions and code attempts at solving the "String reduction" problem
Problem: I have a table that prints out vertical but I would like it
I have an ASP Access database that contains strings in various European languages. The
I am taking an XML file and reading it into various strings, before writing

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.