public class UTF8 {
public static void main(String[] args){
String s = "ヨ"; //0xFF6E
System.out.println(s.getBytes().length);//length of the string
System.out.println(s.charAt(0));//first character in the string
}
}
output:
3
ヨ
Please help me understand this. Trying to understand how utf8 encoding works in java.
As per java doc definition of char
char: The char data type is a single 16-bit Unicode character.
Does it mean char type in java can only support those unicode characters that can be represented with 2 bytes and not more than that?
In the above program, the no of bytes allocated for that string is 3 but in the third line which returns first character( 2 bytes in java) can hold a character which is 3 bytes long?
really confused here?
Any good references regarding this concept in java/ general would be really appreciated.
Nothing in your code example is directly using UTF-8. Java strings are encoded in memory using UTF-16 instead. Unicode codepoints that do not fit in a single 16-bit char will be encoded using a 2-char pair known as a surrogate pair.
If you do not pass a parameter value to
String.getBytes(), it returns a byte array that has theStringcontents encoded using the underlying OS’s default charset. If you want to ensure a UTF-8 encoded array then you need to usegetBytes("UTF-8")instead.Calling
String.charAt()returns an original UTF-16 encoded char from the String’s in-memory storage only.So in your example, the Unicode character
ョis stored in theStringin-memory storage using two bytes that are UTF-16 encoded (0x6E 0xFFor0xFF 0x6Edepending on endian), but is stored in the byte array fromgetBytes()using three bytes that are encoded using whatever the OS default charset is.In UTF-8, that particular Unicode character happens to use 3 bytes as well (
0xEF 0xBD 0xAE).