I really expect that the byte data below should show differently, but in fact, they are same, according to wiki http://en.wikipedia.org/wiki/UTF-8#Examples , the encoding in byte look different, but why Java print them out as the same?
String a = "€";
byte[] utf16 = a.getBytes(); //Java default UTF-16
byte[] utf8 = null;
try {
utf8 = a.getBytes("UTF-8");
} catch (UnsupportedEncodingException e) {
throw new RuntimeException(e);
}
for (int i = 0 ; i < utf16.length ; i ++){
System.out.println("utf16 = " + utf16[i]);
}
for (int i = 0 ; i < utf8.length ; i ++){
System.out.println("utf8 = " + utf8[i]);
}
Although Java holds characters internally as UTF-16, when you convert to bytes using
String.getBytes(), each character is converted using the default platform encoding which will likely be something like windows-1252. The results I’m getting are:This indicates that the default encoding is “UTF-8” on my system.
Also note that the documentation for String.getBytes() has this comment:
The behavior of this method when this string cannot be encoded in the default charset is unspecified.Generally, though, you’ll avoid confusion if you always specify an encoding like you do with
a.getBytes("UTF-8")Also, another thing that can cause confusion is including Unicode characters directly in your source file:
String a = "€";. That euro symbol has to be encoded to be stored as one or more bytes in a file. When Java compiles your program, it sees those bytes and decodes them back into the euro symbol. You hope. You have to be sure that the software that save the euro symbol into the file (Notepad, eclipse, etc) encodes it the same way as Java expects when it reads it back in. UTF-8 is becoming more popular but it is not universal and many editors will not write files in UTF-8.