One thing I have never truly understood is the concept of character encoding. The

Question

0

Asked: May 12, 20262026-05-12T12:53:22+00:00 2026-05-12T12:53:22+00:00

One thing I have never truly understood is the concept of character encoding. The

0

One thing I have never truly understood is the concept of character encoding. The way encoding is handled in memory and code often baffles me in that I just copy an example from the internet without truly understanding what it does. I feel it’s a really important and much overlooked subject that more people should take the time to get right (including myself).

I am looking for some good, to the point, resources for learning the different types of character encoding and converting between them (preferably in C#). Both books and online resources are welcome.

Thanks.

Edit 1:

Thanks for the responses so far. I am especially looking for some more info involving how .NET handles encoding. I know this may seem vague but I don’t really know what to ask for. I guess I am curious as to how encoding is represented say in a C# string class and whether the class itself can manage different encoding types or there are seperate classes for this?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-12T12:53:22+00:00

I’d start with this question: what is a character?

The logical identity: a codepoint. Unicode assigns a number to each character that isn’t necessarily related to any bit/byte form. Encodings (like UTF-8) define the mapping to byte values.
The bits and bytes: the encoded form. One or more bytes per codepoint, values determined by the encoding used.
Thing you see on the screen: a grapheme. The grapheme is created from one or more codepoints. This is the stuff at the presentation end of things.

This code transforms in.txt from windows-1252 to UTF-8 and saves it as out.txt.

using System;
using System.IO;
using System.Text;
public class Enc {
  public static void Main(String[] args) {
    Encoding win1252 = Encoding.GetEncoding(1252);
    Encoding utf8 = Encoding.UTF8;
    using(StreamReader reader = new StreamReader("in.txt", win1252)) {
      using(StreamWriter writer = new StreamWriter("out.txt", false, utf8)) {
        char[] buffer = new char[1024];
        while(reader.Peek() > 0) {
          int r = reader.Read(buffer, 0, buffer.Length);
          writer.Write(buffer, 0, r); 
        }
      }
    }
  }
}

Two transformations happen here. First, the bytes are decoded from windows-1252 to UTF-16 (little endian, I think) into the char buffer. Then the buffer is transformed into UTF-8.

Codepoints

Some example code points:

U+0041 is LATIN CAPITAL LETTER A (A)
U+00A3 is POUND SIGN (£)
U+042F is CYRILLIC CAPITAL LETTER YA (Я)
U+1D50A is MATHEMATICAL FRAKTUR CAPITAL G (𝔊)

Encodings

Anywhere you work with characters, it’ll be in an encoding of some form. C# uses UTF-16 for its char type, which it defines as 16 bits wide.

You can think of an encoding as a tabular mapping between codepoints and byte representations.

CODEPOINT       UTF-16BE        UTF-8     WINDOWS-1252
U+0041 (A)         00 41           41               41
U+00A3 (£)         00 A3        C2 A3               A3
U+042F (Ya)        04 2F        D0 AF                -
U+1D50A      D8 35 DD 0A  F0 9D 94 8A                -

The System.Text.Encoding class exposes types/methods to perform the transformations.

Graphemes

The grapheme you see on the screen may be constructed from more than one codepoint. The character e-acute (é) can be represented with two codepoints, LATIN SMALL LETTER E U+0065 and COMBINING ACUTE ACCENT U+0301.

(‘é’ is more usually represented by the single codepoint U+00E9. You can switch between them using normalization. Not all combining sequences have a single character equivalent, though.)

Conclusions

When you encode a C# string to an encoding, you are performing a transformation from UTF-16 to that encoding.
Encoding can be a lossy transformation – most non-Unicode encodings can only encode a subset of existing characters.
Since not all codepoints can fit into a single C# char, the number of chars in string may be more than the number of codepoints and the number of codepoints may be greater than the number of rendered graphemes.
The “length” of a string is context-sensitive, so you need to know what meaning you’re applying and use the appropriate algorithm. How this is handled is defined by the programming language you’re using.
Giving Latin-1 characters identical values in many encodings gives some people delusions of ASCII.

(This is a little more long-winded than I intended, and probably more than you wanted, so I’ll stop. I wrote an even more long-winded post on Java encoding here.)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

One thing I have never truly understood is the concept of character encoding. The

Edit 1:

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply