Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 177861
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 11, 20262026-05-11T14:07:48+00:00 2026-05-11T14:07:48+00:00

I have read Joel’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must

  • 0

I have read Joel’s article ‘The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)’ but still don’t understand all the details. An example will illustrate my issues. Look at this file below:

alt text
(source: yart.com.au)

I have opened the file in a binary editor to closely examine the last of the three a’s next to the first Chinese character:

alt text
(source: yart.com.au)

According to Joel:

In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

So does the editor say:

  1. E6 (230) is above code point 128.
  2. Thus I will interpret the following bytes as either 2, 3, in fact, up to 6 bytes.

If so, what indicates that the interpretation is more than 2 bytes? How is this indicated by the bytes that follow E6?

Is my Chinese character stored in 2, 3, 4, 5 or 6 bytes?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. 2026-05-11T14:07:48+00:00Added an answer on May 11, 2026 at 2:07 pm

    If the encoding is UTF-8, then the following table shows how a Unicode code point (up to 21 bits) is converted into UTF-8 encoding:

    Scalar Value                 1st Byte  2nd Byte  3rd Byte  4th Byte 00000000 0xxxxxxx            0xxxxxxx 00000yyy yyxxxxxx            110yyyyy  10xxxxxx zzzzyyyy yyxxxxxx            1110zzzz  10yyyyyy  10xxxxxx 000uuuuu zzzzyyyy  yyxxxxxx  11110uuu  10uuzzzz  10yyyyyy  10xxxxxx 

    There are a number of non-allowed values – in particular, bytes 0xC1, 0xC2, and 0xF5 – 0xFF can never appear in well-formed UTF-8. There are also a number of other verboten combinations. The irregularities are in the 1st byte and 2nd byte columns. Note that the codes U+D800 – U+DFFF are reserved for UTF-16 surrogates and cannot appear in valid UTF-8.

    Code Points          1st Byte  2nd Byte  3rd Byte  4th Byte U+0000..U+007F       00..7F U+0080..U+07FF       C2..DF    80..BF U+0800..U+0FFF       E0        A0..BF    80..BF U+1000..U+CFFF       E1..EC    80..BF    80..BF U+D000..U+D7FF       ED        80..9F    80..BF U+E000..U+FFFF       EE..EF    80..BF    80..BF U+10000..U+3FFFF     F0        90..BF    80..BF    80..BF U+40000..U+FFFFF     F1..F3    80..BF    80..BF    80..BF U+100000..U+10FFFF   F4        80..8F    80..BF    80..BF 

    These tables are lifted from the Unicode standard version 5.1.


    In the question, the material from offset 0x0010 .. 0x008F yields:

    0x61           = U+0061 0x61           = U+0061 0x61           = U+0061 0xE6 0xBE 0xB3 = U+6FB3 0xE5 0xA4 0xA7 = U+5927 0xE5 0x88 0xA9 = U+5229 0xE4 0xBA 0x9A = U+4E9A 0xE4 0xB8 0xAD = U+4E2D 0xE6 0x96 0x87 = U+6587 0xE8 0xAE 0xBA = U+8BBA 0xE5 0x9D 0x9B = U+575B 0x2C           = U+002C 0xE6 0xBE 0xB3 = U+6FB3 0xE6 0xB4 0xB2 = U+6D32 0xE8 0xAE 0xBA = U+8BBA 0xE5 0x9D 0x9B = U+575B 0x2C           = U+002C 0xE6 0xBE 0xB3 = U+6FB3 0xE6 0xB4 0xB2 = U+6D32 0xE6 0x96 0xB0 = U+65B0 0xE9 0x97 0xBB = U+95FB 0x2C           = U+002C 0xE6 0xBE 0xB3 = U+6FB3 0xE6 0xB4 0xB2 = U+6D32 0xE4 0xB8 0xAD = U+4E2D 0xE6 0x96 0x87 = U+6587 0xE7 0xBD 0x91 = U+7F51 0xE7 0xAB 0x99 = U+7AD9 0x2C           = U+002C 0xE6 0xBE 0xB3 = U+6FB3 0xE5 0xA4 0xA7 = U+5927 0xE5 0x88 0xA9 = U+5229 0xE4 0xBA 0x9A = U+4E9A 0xE6 0x9C 0x80 = U+6700 0xE5 0xA4 0xA7 = U+5927 0xE7 0x9A 0x84 = U+7684 0xE5 0x8D 0x8E = U+534E 0x2D           = U+002D 0x29           = U+0029 0xE5 0xA5 0xA5 = U+5965 0xE5 0xB0 0xBA = U+5C3A 0xE7 0xBD 0x91 = U+7F51 0x26           = U+0026 0x6C           = U+006C 0x74           = U+0074 0x3B           = U+003B 
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 119k
  • Answers 119k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer You can't know, by only looking at a shared_ptr, where… May 11, 2026 at 11:53 pm
  • Editorial Team
    Editorial Team added an answer Arbitrary memory blocks can be allocated with operator new in… May 11, 2026 at 11:53 pm
  • Editorial Team
    Editorial Team added an answer You can detect if an object has a property without… May 11, 2026 at 11:53 pm

Related Questions

I have read Joel's article The Absolute Minimum Every Software Developer Absolutely, Positively Must
I am looking for good templates for writing both technical and functional specs on
I've read Joel's article on Unicode and I feel that I have at least
I am building Windows apps for a few clients. I read Joel on Software
So I've read Joel's article , and looked through SO, and it seems the

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.