I have string that displays UTF-8 encoded characters, and I want to convert it back to Unicode.
For now, my implementation is the following:
public static string DecodeFromUtf8(this string utf8String)
{
// read the string as UTF-8 bytes.
byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String);
// convert them into unicode bytes.
byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes);
// builds the converted string.
return Encoding.Unicode.GetString(encodedBytes);
}
I am playing with the word "déjà". I have converted it into UTF-8 through this online tool, and so I started to test my method with the string "déjÃ".
Unfortunately, with this implementation the string just remains the same.
Where am I wrong?
So the issue is that UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C#
string. You simply need to verify that each code unit is within the range of a byte, copy those values into bytes, and then convert the new UTF-8 byte sequence into UTF-16.This is easy, however it would be best to find the root cause; the location where someone is copying UTF-8 code units into 16 bit code units. The likely culprit is somebody converting bytes into a C#
stringusing the wrong encoding. E.g.Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length).Alternatively, if you’re sure you know the incorrect encoding which was used to produce the string, and that incorrect encoding transformation was lossless (usually the case if the incorrect encoding is a single byte encoding), then you can simply do the inverse encoding step to get the original UTF-8 data, and then you can do the correct conversion from UTF-8 bytes: