You can't know, by only looking at a shared_ptr, where…

Question

0

Asked: May 11, 20262026-05-11T14:07:48+00:00 2026-05-11T14:07:48+00:00

I have read Joel’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must

0

I have read Joel’s article ‘The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)’ but still don’t understand all the details. An example will illustrate my issues. Look at this file below:

_{(source: yart.com.au)}

I have opened the file in a binary editor to closely examine the last of the three a’s next to the first Chinese character:

_{(source: yart.com.au)}

According to Joel:

In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

So does the editor say:

E6 (230) is above code point 128.
Thus I will interpret the following bytes as either 2, 3, in fact, up to 6 bytes.

If so, what indicates that the interpretation is more than 2 bytes? How is this indicated by the bytes that follow E6?

Is my Chinese character stored in 2, 3, 4, 5 or 6 bytes?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-11T14:07:48+00:00

If the encoding is UTF-8, then the following table shows how a Unicode code point (up to 21 bits) is converted into UTF-8 encoding:

Scalar Value                 1st Byte  2nd Byte  3rd Byte  4th Byte 00000000 0xxxxxxx            0xxxxxxx 00000yyy yyxxxxxx            110yyyyy  10xxxxxx zzzzyyyy yyxxxxxx            1110zzzz  10yyyyyy  10xxxxxx 000uuuuu zzzzyyyy  yyxxxxxx  11110uuu  10uuzzzz  10yyyyyy  10xxxxxx

There are a number of non-allowed values – in particular, bytes 0xC1, 0xC2, and 0xF5 – 0xFF can never appear in well-formed UTF-8. There are also a number of other verboten combinations. The irregularities are in the 1st byte and 2nd byte columns. Note that the codes U+D800 – U+DFFF are reserved for UTF-16 surrogates and cannot appear in valid UTF-8.

Code Points          1st Byte  2nd Byte  3rd Byte  4th Byte U+0000..U+007F       00..7F U+0080..U+07FF       C2..DF    80..BF U+0800..U+0FFF       E0        A0..BF    80..BF U+1000..U+CFFF       E1..EC    80..BF    80..BF U+D000..U+D7FF       ED        80..9F    80..BF U+E000..U+FFFF       EE..EF    80..BF    80..BF U+10000..U+3FFFF     F0        90..BF    80..BF    80..BF U+40000..U+FFFFF     F1..F3    80..BF    80..BF    80..BF U+100000..U+10FFFF   F4        80..8F    80..BF    80..BF

These tables are lifted from the Unicode standard version 5.1.

In the question, the material from offset 0x0010 .. 0x008F yields:

0x61           = U+0061 0x61           = U+0061 0x61           = U+0061 0xE6 0xBE 0xB3 = U+6FB3 0xE5 0xA4 0xA7 = U+5927 0xE5 0x88 0xA9 = U+5229 0xE4 0xBA 0x9A = U+4E9A 0xE4 0xB8 0xAD = U+4E2D 0xE6 0x96 0x87 = U+6587 0xE8 0xAE 0xBA = U+8BBA 0xE5 0x9D 0x9B = U+575B 0x2C           = U+002C 0xE6 0xBE 0xB3 = U+6FB3 0xE6 0xB4 0xB2 = U+6D32 0xE8 0xAE 0xBA = U+8BBA 0xE5 0x9D 0x9B = U+575B 0x2C           = U+002C 0xE6 0xBE 0xB3 = U+6FB3 0xE6 0xB4 0xB2 = U+6D32 0xE6 0x96 0xB0 = U+65B0 0xE9 0x97 0xBB = U+95FB 0x2C           = U+002C 0xE6 0xBE 0xB3 = U+6FB3 0xE6 0xB4 0xB2 = U+6D32 0xE4 0xB8 0xAD = U+4E2D 0xE6 0x96 0x87 = U+6587 0xE7 0xBD 0x91 = U+7F51 0xE7 0xAB 0x99 = U+7AD9 0x2C           = U+002C 0xE6 0xBE 0xB3 = U+6FB3 0xE5 0xA4 0xA7 = U+5927 0xE5 0x88 0xA9 = U+5229 0xE4 0xBA 0x9A = U+4E9A 0xE6 0x9C 0x80 = U+6700 0xE5 0xA4 0xA7 = U+5927 0xE7 0x9A 0x84 = U+7684 0xE5 0x8D 0x8E = U+534E 0x2D           = U+002D 0x29           = U+0029 0xE5 0xA5 0xA5 = U+5965 0xE5 0xB0 0xBA = U+5C3A 0xE7 0xBD 0x91 = U+7F51 0x26           = U+0026 0x6C           = U+006C 0x74           = U+0074 0x3B           = U+003B

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions