I’m having a difficult time understanding the rationale behind the semantics of the Java String(byte[]) constructors (Java 6). The length of the resulting String object is usually wrong. Perhaps someone here can explain why this makes any sense.
Consider the following small Java program:
import java.nio.charset.Charset;
public class Test {
public static void main(String[] args) {
String abc1 = new String("abc");
byte[] bytes = new byte[32];
bytes[0] = 0x61; // 'a'
bytes[1] = 0x62; // 'b'
bytes[2] = 0x63; // 'c'
bytes[3] = 0x00; // NUL
String abc2 = new String(bytes, Charset.forName("US-ASCII"));
System.out.println("abc1: \"" + abc1 + "\" length: " + abc1.length());
System.out.println("abc2: \"" + abc2 + "\" length: " + abc2.length());
System.out.println("\"" + abc1 + "\" " +
(abc1.equals(abc2) ? "==" : "!=") + " \"" + abc2 + "\"");
}
}
The output of this program is:
abc1: "abc" length: 3
abc2: "abc" length: 32
"abc" != "abc"
The documentation for the String byte[] constructor states, “The length of the new String is a function of the charset, and hence may not be equal to the length of the byte array.” Precisely true indeed, and in the US-ASCII character set, the length of the string “abc” is 3, and not 32.
Strangely, even though abc2 contains no whitespace characters, abc2.trim() returns the same string, but with the length adjusted to the correct value of 3 and abc1.equals(abc2) returns true… Am I missing something obvious?
Yes, I realize I can pass in an explicit length into the constructor, I’m just trying to understand the default semantics.
In Java, strings are not null-delimited. The string that is constructed from the byte array uses the entire length of the array. Since 0x00 converts one-to-one to the character
'\0', the resulting string has the same length as the entire array—32. When it is printed to System.out, null characters have zero width, so it looks like “abc” but it is really “abc\0\0\0…” (for 32 characters).The reason
trim()fixes this is that it considers'\0'to be white space.Note that if you want to convert a null-delimited byte representation of a string to a
String, you will need to find the index at which to stop. Then (as @Brian notes in his comment), you can use a different String constructor:However, this must be done with caution. You are using the US-ASCII character set for the platform, where the index of the first zero byte is probably a natural stopping place. However, in many character sets (such as UTF-16), zero bytes can occur as a normal part of the actual text.