We have a database where we save byte arrays (HBase). All our Strings are

Question

0

Asked: June 3, 20262026-06-03T22:32:03+00:00 2026-06-03T22:32:03+00:00

We have a database where we save byte arrays (HBase). All our Strings are

0

We have a database where we save byte arrays (HBase).
All our Strings are encoded as bytes, and we do the conversion manually.
However, some old data has been wrongfully saved, and I wonder if there’s a way to recover them.

What happened is that we had some original text that was encoded, let’s say, in ISO_8859_1
BUT, the process that saved these Strings as byte arrays did something similar to new String(original_bytes, UTF8).getBytes(UTF8)
(whereas original_bytes represent the String as ISO8859_1)

I can’t find a way to recover the original_bytes array. Is it at actually possible ?

I tried to reproduce it using this simple Java sample code :

String s = "é";
System.out.println("s: " + s);
System.out.println("s.getBytes: " + Arrays.toString(s.getBytes()));
System.out.println("s.getBytes(UTF8): " + Arrays.toString(s.getBytes(Charsets.UTF_8)));
System.out.println("new String(s.getBytes()): " + new String(s.getBytes()));
System.out.println("new String(s.getBytes(), UTF-8): " + new String(s.getBytes(), Charsets.UTF_8));

byte [] iso = s.getBytes(Charsets.ISO_8859_1);
System.out.println("iso " + Arrays.toString(iso));
System.out.println("new String(iso)" + new String(iso));
System.out.println("new String(iso, ISO)" + new String(iso, Charsets.ISO_8859_1));
System.out.println("new String(iso).getBytes()" + Arrays.toString(new String(iso).getBytes()));
System.out.println("new String(iso).getBytes(ISO)" + Arrays.toString(new String(iso).getBytes(Charsets.ISO_8859_1)));
System.out.println("new String(iso, UTF8).getBytes()" + Arrays.toString(new String(iso, Charsets.UTF_8).getBytes()));
System.out.println("new String(iso, UTF8).getBytes(UTF8)" + Arrays.toString(new String(iso, Charsets.UTF_8).getBytes(Charsets.UTF_8)));

output: (on a computer with a default charset of UTF8)

s: é
s.getBytes: [-61, -87]
s.getBytes(UTF8): [-61, -87]
new String(s.getBytes()): é
new String(s.getBytes(), UTF-8): é
iso [-23]
new String(iso)�
new String(iso, ISO)é
new String(iso).getBytes()[-17, -65, -67]
new String(iso).getBytes(ISO)[63]
new String(iso, UTF8).getBytes()[-17, -65, -67]
new String(iso, UTF8).getBytes(UTF8)[-17, -65, -67]
new String(new String(iso).getBytes(), Charsets.ISO_8859_1) ï¿½

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T22:32:04+00:00

Unfortunately no, it’s not possible in every case.

UTF-8 has quite a few byte sequences that are illegal and that will (usually) be replaced by some replacement character when decoded. When your original_bytes contained any of those byte sequences, then that information is lost for sure.

Your best bet is to do the reverse, which will probably get you as close to the original String as possible:

byte[] originalISOData = ...;
byte[] badUTF8 = new String(originalISOData, "UTF-8").getBytes("UTF-8");
byte[] partialReconstruction = new String(badUTF8, "ISO-8859-1");

tl;dr decoding non-UTF-8 data as UTF-8 is not generally a lossless operation. A valid UTF-8 decoder will replace all malformed byte sequences with replacement characters (or even abort the decoding, depending on the decoder and its settings).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

We have a database where we save byte arrays (HBase). All our Strings are

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply