I’m really confused about UTF in Unicode.
there is UTF-8, UTF-16 and UTF-32.
my question is :
-
what UTF that are support all Unicode blocks ?
-
What is the best UTF(performance, size, etc), and why ?
-
What is different between these three UTF ?
-
what is endianness and byte order marks (BOM) ?
Thanks
All UTF encodings support all Unicode blocks – there is no UTF encoding that can’t represent any Unicode codepoint. However, some non-UTF, older encodings, such as UCS-2 (which is like UTF-16, but lacks surrogate pairs, and thus lacks the ability to encode codepoints above 65535/U+FFFF), may not.
For textual data that is mostly English and/or just ASCII, UTF-8 is by far the most space-efficient. However, UTF-8 is sometimes less space-efficient than UTF-16 and UTF-32 where most of the codepoints used are high (such as large bodies of CJK text).
UTF-8 encodes each Unicode codepoint from one to four bytes. The Unicode values 0 to 127, which are the same as they are in ASCII, are encoded like they are in ASCII. Bytes with values 128 to 255 are used for multi-byte codepoints.
UTF-16 encodes each Unicode codepoint in either two bytes (one UTF-16 value) or four bytes (two UTF-16 values). Anything in the Basic Multilingual Plane (Unicode codepoints 0 to 65535, or U+0000 to U+FFFF) are encoded with one UTF-16 value. Codepoints from higher plains use two UTF-16 values, through a technique called ‘surrogate pairs’.
UTF-32 is not a variable-length encoding for Unicode; all Unicode codepoint values are encoded as-is. This means that
U+10FFFFis encoded as0x0010FFFF.Endianness is how a piece of data, particular CPU architecture or protocol orders values of multi-byte data types. Little-endian systems (such as x86-32 and x86-64 CPUs) put the least-significant byte first, and big-endian systems (such as ARM, PowerPC and many networking protocols) put the most-significant byte first.
In a little-endian encoding or system, the 32-bit value
0x12345678is stored or transmitted as0x78 0x56 0x34 0x12. In a big-endian encoding or system, it is stored or transmitted as0x12 0x34 0x56 0x78.A byte order mark is used in UTF-16 and UTF-32 to signal which endianness the text is to be interpreted as. Unicode does this in a clever way — U+FEFF is a valid codepoint, used for the byte order mark, while U+FFFE is not. Therefore, if a file starts with
0xFF 0xFE, it can be assumed that the rest of the file is stored in a little-endian byte ordering.A byte order mark in UTF-8 is technically possible, but is meaningless in the context of endianness for obvious reasons. However, a stream that begins with the UTF-8 encoded BOM almost certainly implies that it is UTF-8, and thus can be used for identification because of this.
Benefits of UTF-8
Benefits of UTF-16
Benefits of UTF-32