I’ve tracked a problem I’m having down to the following inexplicable behaviour within the .NET System.Text.Encoding class:
byte[] original = new byte[] { 128 };
string encoded = System.Text.Encoding.UTF8.GetString(original);
byte[] decoded = System.Text.Encoding.UTF8.GetBytes(encoded);
Console.WriteLine(original[0] == decoded[0]);
Am I expecting too much that decoded should equal original in the above?
UTF8, UTF7, UTF32, Unicode and ASCII all produce various varieties of wrongness. What’s going on?
In general you can’t roundtrip in this way and you are wrong to expect to be able to do so for an arbitrary encoding and in particular for any of the UTF encodings.
However there is an encoding that will allow you to roundtrip for all byte values – Latin1 aka ISO-8859-1 aka CP28591. This encoding is similar but not identical to the default Windows ANSI encoding and is useful for scenarios where roundtripping in this way is important – e.g. writing a stream that mixes text and control characters to a serial port.
See this answer, or other questions that mention Latin1.