I searched Java’s internal representation for String, but I’ve got two materials which look reliable but inconsistent.
One is:
http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451
and it says:
Java uses UTF-16 for the internal text representation and supports a non-standard modification of UTF-8 for string serialization.
The other is:
and it says:
Tcl also uses the same modified UTF-8[25] as Java for internal representation of Unicode data, but uses strict CESU-8 for external data.
Modified UTF-8? Or UTF-16? Which one is correct? And how many bytes does Java use for a char in memory?
Please let me know which one is correct and how many bytes it uses.
The representation for String and StringBuilder etc in Java is UTF-16
https://docs.oracle.com/javase/8/docs/technotes/guides/intl/overview.html
At the JVM level, if you are using
-XX:+UseCompressedStrings(which is default for some updates of Java 6) The actual in-memory representation can be 8-bit, ISO-8859-1 but only for strings which do not need UTF-16 encoding.http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
Serialized Strings use UTF-8 by default.
A
charis always two bytes, if you ignore the need for padding in an Object.Note: a code point (which allows character > 65535) can use one or two characters, i.e. 2 or 4 bytes.