We can disassemble String to and from byte[] easily
String s = "my string";
byte[] b = s.getBytes();
System.out.println(new String(b)); // my string
When compression is involved however there seem to be some issues. Suppose you have 2 methods, compress and uncompress (code below works fine)
public static byte[] compress(String data)
throws UnsupportedEncodingException, IOException {
byte[] input = data.getBytes("UTF-8");
Deflater df = new Deflater();
df.setLevel(Deflater.BEST_COMPRESSION);
df.setInput(input);
ByteArrayOutputStream baos = new ByteArrayOutputStream(input.length);
df.finish();
byte[] buff = new byte[1024];
while (!df.finished()) {
int count = df.deflate(buff);
baos.write(buff, 0, count);
}
baos.close();
byte[] output = baos.toByteArray();
return output;
}
public static String uncompress(byte[] input)
throws UnsupportedEncodingException, IOException,
DataFormatException {
Inflater ifl = new Inflater();
ifl.setInput(input);
ByteArrayOutputStream baos = new ByteArrayOutputStream(input.length);
byte[] buff = new byte[1024];
while (!ifl.finished()) {
int count = ifl.inflate(buff);
baos.write(buff, 0, count);
}
baos.close();
byte[] output = baos.toByteArray();
return new String(output);
}
My Tests work as follows (works fine)
String text = "some text";
byte[] bytes = Compressor.compress(text);
assertEquals(Compressor.uncompress(bytes), text); // works
For no reason other then, why not, i’d like to modify the first method to return a String instead of the byte[].
So i return new String(output) from the compress method and modify my tests to:
String text = "some text";
String compressedText = Compressor.compress(text);
assertEquals(Compressor.uncompress(compressedText.getBytes), text); //fails
This test fails with java.util.zip.DataFormatException: incorrect header check
Why is that? What needs to be done to make it work?
The
String(byte[])constructor is the problem. You cannot simply take arbitrary bytes, convert them to a string and then back to byte array.Stringclass performs sophisticated encoding on thisbytebased on desired charset. If given byte sequence can’t be represented e.g. in Unicode it will be discarded or converted to something else. The conversion from bytes toStringand back tobytesis lossless only if these bytes really represented someString(in some encoding).Here is a simplest example:
The above returns
-17, -65, -67while127input returns the exact same output.