We have a database where we save byte arrays (HBase).
All our Strings are encoded as bytes, and we do the conversion manually.
However, some old data has been wrongfully saved, and I wonder if there’s a way to recover them.
What happened is that we had some original text that was encoded, let’s say, in ISO_8859_1
BUT, the process that saved these Strings as byte arrays did something similar to new String(original_bytes, UTF8).getBytes(UTF8)
(whereas original_bytes represent the String as ISO8859_1)
I can’t find a way to recover the original_bytes array. Is it at actually possible ?
I tried to reproduce it using this simple Java sample code :
String s = "é";
System.out.println("s: " + s);
System.out.println("s.getBytes: " + Arrays.toString(s.getBytes()));
System.out.println("s.getBytes(UTF8): " + Arrays.toString(s.getBytes(Charsets.UTF_8)));
System.out.println("new String(s.getBytes()): " + new String(s.getBytes()));
System.out.println("new String(s.getBytes(), UTF-8): " + new String(s.getBytes(), Charsets.UTF_8));
byte [] iso = s.getBytes(Charsets.ISO_8859_1);
System.out.println("iso " + Arrays.toString(iso));
System.out.println("new String(iso)" + new String(iso));
System.out.println("new String(iso, ISO)" + new String(iso, Charsets.ISO_8859_1));
System.out.println("new String(iso).getBytes()" + Arrays.toString(new String(iso).getBytes()));
System.out.println("new String(iso).getBytes(ISO)" + Arrays.toString(new String(iso).getBytes(Charsets.ISO_8859_1)));
System.out.println("new String(iso, UTF8).getBytes()" + Arrays.toString(new String(iso, Charsets.UTF_8).getBytes()));
System.out.println("new String(iso, UTF8).getBytes(UTF8)" + Arrays.toString(new String(iso, Charsets.UTF_8).getBytes(Charsets.UTF_8)));
output: (on a computer with a default charset of UTF8)
s: é
s.getBytes: [-61, -87]
s.getBytes(UTF8): [-61, -87]
new String(s.getBytes()): é
new String(s.getBytes(), UTF-8): é
iso [-23]
new String(iso)�
new String(iso, ISO)é
new String(iso).getBytes()[-17, -65, -67]
new String(iso).getBytes(ISO)[63]
new String(iso, UTF8).getBytes()[-17, -65, -67]
new String(iso, UTF8).getBytes(UTF8)[-17, -65, -67]
new String(new String(iso).getBytes(), Charsets.ISO_8859_1) �
Unfortunately no, it’s not possible in every case.
UTF-8 has quite a few byte sequences that are illegal and that will (usually) be replaced by some replacement character when decoded. When your
original_bytescontained any of those byte sequences, then that information is lost for sure.Your best bet is to do the reverse, which will probably get you as close to the original String as possible:
tl;dr decoding non-UTF-8 data as UTF-8 is not generally a lossless operation. A valid UTF-8 decoder will replace all malformed byte sequences with replacement characters (or even abort the decoding, depending on the decoder and its settings).