I have a String created from a byte[] array, using UTF-8 encoding.
However, it should have been created using another encoding (Windows-1252).
Is there a way to convert this String back to the right encoding?
I know it’s easy to do if you have access to the original byte array, but it my case it’s too late because it’s given by a closed source library.
As there seems to be some confusion on whether this is possible or not I think I’ll need to provide an extensive example.
The question claims that the (initial) input is a
byte[]that contains Windows-1252 encoded data. I’ll call thatbyte[]ib(for "initial bytes").For this example I’ll choose the German word "Bär" (meaning bear) as the input:
(If your JVM doesn’t support that encoding, then you can use ISO-8859-1 instead, because those three letters (and most others) are at the same position in those two encodings).
The question goes on to state that some other code (that is outside of our influence) already converted that
byte[]to a String using the UTF-8 encoding (I’ll call thatStringisfor "input String"). ThatStringis the only input that is available to achieve our goal (ifibwere available, it would be trivial):This obviously produces the incorrect output "B�".
The goal would be to produce
ib(or the correct decoding of thatbyte[]) with onlyisavailable.Now some people claim that getting the UTF-8 encoded bytes from that
iswill return an array with the same values as the initial array:But that returns the UTF-8 encoding of the two characters
Band�and definitely returns the wrong result when re-interpreted as Windows-1252:This line produces the output "B�", which is totally wrong (it is also the same output that would be the result if the initial array contained the non-word "Bür" instead).
So in this case you can’t undo the operation, because some information was lost.
There are in fact cases where such mis-encodings can be undone. It’s more likely to work, when all possible (or at least occuring) byte sequences are valid in that encoding. Since UTF-8 has several byte sequences that are simply not valid values, you will have problems.