Can two different strings when encoded with different encodings have the same byte sequence?
i.e. some “string one” and “string two” in the example below when encoded using two different encodings
(Cp1252 and UTF-8 are just examples) will cause the test to pass?
import java.io.UnsupportedEncodingException;
import java.util.Arrays;
import org.junit.Assert;
import org.junit.Test;
public class EncodingTest {
@Test
public void test() throws UnsupportedEncodingException {
final byte[] sequence1 = "string one".getBytes("Cp1252");
final byte[] sequence2 = "string two".getBytes("UTF-8");
Assert.assertTrue(Arrays.equals(sequence1, sequence2));
}
}
A bug in my code hashes byte sequence generated from a String with JVM’s default encoding and I need to verify whether that will cause hash collisions when the code is run with different strings and different JVM file encodings (which can happen when run on Windows and Linux for example).
Since an encoding is a mapping between byte sequences and characters, I think there may be some strings and encodings that pass the above test. But just wanted to know if there are any well known examples or some good reasons for why I shouldn’t be relying on hash collisions not happening.
Thanks
PS: This is only for encodings supported by JDK 1.6 and not by some made up ones.
Yes. To take a simple example, the string “¡” (the inverted exclamation mark) encoded as ISO-8859-1 and the string “Ą” (capital A with ogoned) encodes as ISO-8859-2 both become the single-byte sequence A1 (hex). It is more or less obvious that such things happen when using the very simple encodings that map characters to single bytes; otherwise they would not be different encodings. It can surely happen when more complicated encoding schemes are involved, too.