Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 361411
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 12, 20262026-05-12T12:53:22+00:00 2026-05-12T12:53:22+00:00

One thing I have never truly understood is the concept of character encoding. The

  • 0

One thing I have never truly understood is the concept of character encoding. The way encoding is handled in memory and code often baffles me in that I just copy an example from the internet without truly understanding what it does. I feel it’s a really important and much overlooked subject that more people should take the time to get right (including myself).

I am looking for some good, to the point, resources for learning the different types of character encoding and converting between them (preferably in C#). Both books and online resources are welcome.

Thanks.


Edit 1:

Thanks for the responses so far. I am especially looking for some more info involving how .NET handles encoding. I know this may seem vague but I don’t really know what to ask for. I guess I am curious as to how encoding is represented say in a C# string class and whether the class itself can manage different encoding types or there are seperate classes for this?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-12T12:53:22+00:00Added an answer on May 12, 2026 at 12:53 pm

    I’d start with this question: what is a character?

    • The logical identity: a codepoint. Unicode assigns a number to each character that isn’t necessarily related to any bit/byte form. Encodings (like UTF-8) define the mapping to byte values.
    • The bits and bytes: the encoded form. One or more bytes per codepoint, values determined by the encoding used.
    • Thing you see on the screen: a grapheme. The grapheme is created from one or more codepoints. This is the stuff at the presentation end of things.

    This code transforms in.txt from windows-1252 to UTF-8 and saves it as out.txt.

    using System;
    using System.IO;
    using System.Text;
    public class Enc {
      public static void Main(String[] args) {
        Encoding win1252 = Encoding.GetEncoding(1252);
        Encoding utf8 = Encoding.UTF8;
        using(StreamReader reader = new StreamReader("in.txt", win1252)) {
          using(StreamWriter writer = new StreamWriter("out.txt", false, utf8)) {
            char[] buffer = new char[1024];
            while(reader.Peek() > 0) {
              int r = reader.Read(buffer, 0, buffer.Length);
              writer.Write(buffer, 0, r); 
            }
          }
        }
      }
    }
    

    Two transformations happen here. First, the bytes are decoded from windows-1252 to UTF-16 (little endian, I think) into the char buffer. Then the buffer is transformed into UTF-8.

    Codepoints

    Some example code points:

    • U+0041 is LATIN CAPITAL LETTER A (A)
    • U+00A3 is POUND SIGN (£)
    • U+042F is CYRILLIC CAPITAL LETTER YA (Я)
    • U+1D50A is MATHEMATICAL FRAKTUR CAPITAL G (𝔊)

    Encodings

    Anywhere you work with characters, it’ll be in an encoding of some form. C# uses UTF-16 for its char type, which it defines as 16 bits wide.

    You can think of an encoding as a tabular mapping between codepoints and byte representations.

    CODEPOINT       UTF-16BE        UTF-8     WINDOWS-1252
    U+0041 (A)         00 41           41               41
    U+00A3 (£)         00 A3        C2 A3               A3
    U+042F (Ya)        04 2F        D0 AF                -
    U+1D50A      D8 35 DD 0A  F0 9D 94 8A                -
    

    The System.Text.Encoding class exposes types/methods to perform the transformations.

    Graphemes

    The grapheme you see on the screen may be constructed from more than one codepoint. The character e-acute (é) can be represented with two codepoints, LATIN SMALL LETTER E U+0065 and COMBINING ACUTE ACCENT U+0301.

    (‘é’ is more usually represented by the single codepoint U+00E9. You can switch between them using normalization. Not all combining sequences have a single character equivalent, though.)

    Conclusions

    • When you encode a C# string to an encoding, you are performing a transformation from UTF-16 to that encoding.
    • Encoding can be a lossy transformation – most non-Unicode encodings can only encode a subset of existing characters.
    • Since not all codepoints can fit into a single C# char, the number of chars in string may be more than the number of codepoints and the number of codepoints may be greater than the number of rendered graphemes.
    • The “length” of a string is context-sensitive, so you need to know what meaning you’re applying and use the appropriate algorithm. How this is handled is defined by the programming language you’re using.
    • Giving Latin-1 characters identical values in many encodings gives some people delusions of ASCII.

    (This is a little more long-winded than I intended, and probably more than you wanted, so I’ll stop. I wrote an even more long-winded post on Java encoding here.)

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm stuck on one thing i can't get solved. I have part of code,
I have been thinking about a neat way of load balancing and one thing
I have a cool snippet of code that works well, except one thing. The
I have read that a variable should never do more than one thing. Overloading
One thing I have never used PHP for is on-demand image placement - but
One thing I have noticed a lot of back and forth on is where
One thing I have continually found very confusing about using an object database like
I am using below URL to show line chart but one thing I have
One thing with which I have long had problems, within the CakePHP framework, is
Can somebody explain me one thing. I have two methods in my controller :

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.