In the Java and C# implementation of String , is the underlying information a

Question

0

Asked: May 25, 20262026-05-25T12:25:47+00:00 2026-05-25T12:25:47+00:00

In the Java and C# implementation of String , is the underlying information a

0

In the Java and C# implementation of String, is the underlying information a null-terminated char array like in C/C++?

(In addition to other information like size, etc.)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T12:25:48+00:00

No. It is a sequence of UTF-16 code units and a length. Java and C# strings can contain embedded NULs.

Each UTF-16 code-unit occupies two bytes, so you can think of the string "\n\0\n" as:

{
  length: 3,  // 3 pairs of bytes == 3 UTF-16 code units
  bytes:  [0, 10, // \n
           0, 0,  // \0
           0, 10] // \n
}

Note that the last byte in bytes is not 0. The length field tells how many of the bytes are used. This allows substring to be very efficient — reuse the same byte array, but with a different length (and offset if your VM implementation can’t point into an array).

UTF-16 (16-bit Unicode Transformation Format) is a character encoding for Unicode capable of encoding 1,112,064 numbers (called code points) in the Unicode code space from 0 to 0x10FFFF. It produces a variable-length result of either one or two 16-bit code units per code point.

From javadoc

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

C# System.String is defined similarly

Each Unicode character in a string is defined by a Unicode scalar value, also called a Unicode code point or the ordinal (numeric) value of the Unicode character. Each code point is encoded using UTF-16 encoding, and the numeric value of each element of the encoding is represented by a Char. The resulting collection of Char objects constitutes the String.

I’m not sure whether C# guards against orphaned surrogates, but the above text seems to mix the terms "scalar value" and "codepoint" which is confusing. A scalar value is defined thus by unicode.org:

Any Unicode code point except high-surrogate and low-surrogate code points

Java definitely takes the codepoint view, and does not attempt to guard against invalid scalar values in strings.

"Strings Immutability and Persistence" explains the efficiency benefits of this representation.

One of the benefits of the immutable data types I’ve talked about here previously is that they are not just immutable, they are also "persistent". By "persistent", I mean an immutable data type such that common operations on that type (like adding a new item to a queue, or removing an item from a tree) can re-use most or all of the memory of an existing data structure. Since it is all immutable, you can re-use its parts without worrying about them changing on you.

EDIT:
The above is true conceptually and in practice, but VMs and CLRs have freedom to do things differently in certain situations.

The Java language specification mandates that strings are laid out a certain way in .class files, and its JNI jstring type abstracts away in-memory representation details so a VM could, in theory, represent a string in memory as a NUL-terminated UTF-8 string with a two-byte form used for embedded NUL characters instead of the int32 length and uint16[] bytes representation that allows for efficient random access to code-units.

VMs don’t do this in practice though. "The Most Expensive One-byte Mistake" argues that NUL-terminated strings were a huge mistake in C, so I doubt VMs will adopt them internally for efficiency reasons.

The best candidate I have been able to come up with is the C/Unix/Posix use of NUL-terminated text strings. The choice was really simple: Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end?

…

Thinking a bit about virtual memory systems settles that question for us. Optimizing the movement of a known-length string of bytes can take advantage of the full width of memory buses and cache lines, without ever touching a memory location that is not part of the source or destination string.

One example is FreeBSD’s libc, where the bcopy(3)/memcpy(3) implementation will move as much data as possible in chunks of "unsigned long," typically 32 or 64 bits, and then "mop up any trailing bytes" as the comment describes it, with byte-wide operations.2

If the source string is NUL terminated, however, attempting to access it in units larger than bytes risks attempting to read characters after the NUL. If the NUL character is the last byte of a [virtual memory] page and the next [virtual memory] page is not defined, this would cause the process to die from an unwarranted "page not present" fault.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In the Java and C# implementation of String , is the underlying information a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply