In the Java and C# implementation of String, is the underlying information a null-terminated char array like in C/C++?
(In addition to other information like size, etc.)
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
No. It is a sequence of UTF-16 code units and a length. Java and C# strings can contain embedded NULs.
Each UTF-16 code-unit occupies two bytes, so you can think of the string
"\n\0\n"as:Note that the last byte in
bytesis not 0. Thelengthfield tells how many of the bytes are used. This allowssubstringto be very efficient — reuse the same byte array, but with a different length (and offset if your VM implementation can’t point into an array).From javadoc
C#
System.Stringis defined similarlyI’m not sure whether C# guards against orphaned surrogates, but the above text seems to mix the terms "scalar value" and "codepoint" which is confusing. A scalar value is defined thus by
unicode.org:Java definitely takes the codepoint view, and does not attempt to guard against invalid scalar values in strings.
"Strings Immutability and Persistence" explains the efficiency benefits of this representation.
EDIT:
The above is true conceptually and in practice, but VMs and CLRs have freedom to do things differently in certain situations.
The Java language specification mandates that strings are laid out a certain way in
.classfiles, and its JNIjstringtype abstracts away in-memory representation details so a VM could, in theory, represent a string in memory as a NUL-terminated UTF-8 string with a two-byte form used for embedded NUL characters instead of theint32 lengthanduint16[] bytesrepresentation that allows for efficient random access to code-units.VMs don’t do this in practice though. "The Most Expensive One-byte Mistake" argues that NUL-terminated strings were a huge mistake in C, so I doubt VMs will adopt them internally for efficiency reasons.