One thing I have never truly understood is the concept of character encoding. The way encoding is handled in memory and code often baffles me in that I just copy an example from the internet without truly understanding what it does. I feel it’s a really important and much overlooked subject that more people should take the time to get right (including myself).
I am looking for some good, to the point, resources for learning the different types of character encoding and converting between them (preferably in C#). Both books and online resources are welcome.
Thanks.
Edit 1:
Thanks for the responses so far. I am especially looking for some more info involving how .NET handles encoding. I know this may seem vague but I don’t really know what to ask for. I guess I am curious as to how encoding is represented say in a C# string class and whether the class itself can manage different encoding types or there are seperate classes for this?
I’d start with this question: what is a character?
This code transforms
in.txtfromwindows-1252toUTF-8and saves it asout.txt.Two transformations happen here. First, the bytes are decoded from
windows-1252toUTF-16(little endian, I think) into thecharbuffer. Then the buffer is transformed intoUTF-8.Codepoints
Some example code points:
Encodings
Anywhere you work with characters, it’ll be in an encoding of some form. C# uses UTF-16 for its char type, which it defines as 16 bits wide.
You can think of an encoding as a tabular mapping between codepoints and byte representations.
The System.Text.Encoding class exposes types/methods to perform the transformations.
Graphemes
The grapheme you see on the screen may be constructed from more than one codepoint. The character e-acute (é) can be represented with two codepoints, LATIN SMALL LETTER E U+0065 and COMBINING ACUTE ACCENT U+0301.
(‘é’ is more usually represented by the single codepoint U+00E9. You can switch between them using normalization. Not all combining sequences have a single character equivalent, though.)
Conclusions
(This is a little more long-winded than I intended, and probably more than you wanted, so I’ll stop. I wrote an even more long-winded post on Java encoding here.)