Hi all the code is as follows:
File file2 = new File("D://deploy//body.txt");
byte[] bytes = loadFile(file2);
System.out.println(bytes.length);
StringBuffer buffer = new StringBuffer();
InputStream inputStream = new ByteArrayInputStream(bytes);
InputStreamReader reader = new InputStreamReader(inputStream,"CP1252");
Reader in = new BufferedReader(reader);
int ch;
while ((ch = in.read()) > -1) {
buffer.append((char)ch);
}
in.close();
System.out.println(buffer.toString().getBytes().length);
The final result is 1576 and 2439 for the length of the byte arrays. What is a proper way of converting a CP1252 byte array to a string and retain the proper size? Thanks
I noticed your phrase – “proper string”, and would like to point out that there is no such thing as a proper or improper string in your case. It’s the encoding that is either proper or improper.
You’re reading the byte sequence of cp1252 bytes, and appending the individual characters into a buffer. If the original file is in cp1252, there are no problems with this process. Under the hood, the InputStreamReader employs a CharsetDecoder that is capable of decoding the underlying charset of the stream, into a sequence of sixteen-bit Unicode characters (UTF-16). This is done, because you are reading characters from the byte stream.
As pointed out by bmargulies, when you execute
buffer.toString().getBytes()you are transforming these sequences of UTF-16 characters into a byte sequence that has the same encoding as the platform. Since this is not cp1252, the lengths of the original byte array and the transformed one are not comparable. Specifying the charset to thegetBytes()method causes a StringEncoder (this is an internal class with the Oracle/Sun JVM; other implementations might use a different class) to be used, to transform the UTF-16 character sequence to the sequence of bytes in the desired encoding (cp1252).