I did simple test and it seems that Java conversions of String to bytes[] is not one-one, at least using UTF-8.
The code :
byte[] bytes1 = {-1, 127, 0, 38, 97, 104, 55, 110, 50, -24, -48, 59, -20, -6, 64, 1, 4, 107, 56, 54 };
String msg = new String( bytes1, "UTF-8" );
byte[] bytes2 = msg.getBytes( "UTF-8" );
for( byte curr : bytes1 ) {
System.out.print( curr );
System.out.print( ", " );
}
System.out.println();
for( byte curr : bytes2 ) {
System.out.print( curr );
System.out.print( ", " );
}
I supposed that I’ll see two equals lines of output. In reality it was:
-1, 127, 0, 38, 97, 104, 55, 110, 50, -24, -48, 59, -20, -6, 64, 1, 4, 107, 56, 54,
-17, -65, -67, 127, 0, 38, 97, 104, 55, 110, 50, -17, -65, -67, -17, -65, -67, 59, -17, -65, -67, -17, -65, -67, 64, 1, 4, 107, 56, 54,
I wonder why it happens and how I can achieve one-one conversion. Anybody knows?
You cannot for arbitrary text. Conversion from UTF-16 (the representation in a String) to UTF-8 is defined to be not-one-to-one. See the Unicode standard at Unicode.org.
It looks like what you really want is to pass “UTF-16” as the charset, thus asking for a byte serialization of UTF-16 instead of a conversion to UTF-8.
See http://docs.oracle.com/javase/6/docs/technotes/guides/intl/encoding.doc.html. If you don’t want a BOM, use an ‘unmarked’ variation.