Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6087245
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T11:50:56+00:00 2026-05-23T11:50:56+00:00

What Unicode character encoding does a char object correspond to in: C# Java JavaScript

  • 0

What Unicode character encoding does a char object correspond to in:

  • C#

  • Java

  • JavaScript (I know there is not actually a char type but I am assuming that the String type is still implemented as an array of Unicode characters)

In general, is there a common convention among programming languages to use a specific character encoding?

Update

  1. I have tried to clarify my question. The changes I made are discussed in the comments below.
  2. Re: “What problem are you trying to solve?”, I am interested in code generation from language independent expressions, and the particular encoding of the file is relevant.
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T11:50:57+00:00Added an answer on May 23, 2026 at 11:50 am

    I’m not sure that I am answering your question, but let me make a few remarks that hopefully shed some light.

    At the core, general-purpose programming languages like the ones we are talking about (C, C++, C#, Java, PHP) do not have a notion of “text”, merely of “data”. Data consists of sequences of integral values (i.e. numbers). There is no inherent meaning behind those numbers.

    The process of turning a stream of numbers into a text is one of semantics, and it is usually left to the consumer to assign the relevant semantics to a data stream.

    Warning: I will now use the word “encoding”, which unfortunately has multiple inequivalent meanings. The first meaning of “encoding” is the assignment of meaning to a number. The semantic interpretation of a number is also called a “character”. For example, in the ASCII encoding, 32 means “space” and 65 means “captial A”. ASCII only assigns meanings to 128 numbers, so every ASCII character can be conveniently represented by a single 8-bit byte (with the top bit always 0). There are many encodings with assign characters to 256 numbers, thus all using one byte per character. In these fixed-width encodings, a text string has as many characters as it takes bytes to represent. There are also other encodings in which characters take a variable amount of bytes to represent.

    Now, Unicode is also an encoding, i.e. an assignment of meaning to numbers. On the first 128 numbers it is the same as ASCII, but it assigns meanings to (theoretically) 2^21 numbers. Because there are lots of meanings which aren’t strictly “characters” in the sense of writing (such as zero-width joiners or diacritic modifiers), the term “codepoint” is preferred over “character”. Nonetheless, any integral data type that is at least 21 bits wide can represent one codepoint. Typically one picks a 32-bit type, and this encoding, in which every element stands for one codepoint, is called UTF-32 or UCS-4.

    Now we have a second meaning of “encoding”: I can take a string of Unicode codepoints and transform it into a string of 8-bit or 16-bit values, thus further “encoding” the information. In this new, transformed form (called “unicode transformation format”, or “UTF”), we now have strings of 8-bit or 16-bit values (called “code units”), but each individual value does not in general correspond to anything meaningful — it first has to be decoded into a sequence of Unicode codepoints.

    Thus, from a programming perspective, if you want to modify text (not bytes), then you should store your text as a sequence of Unicode codepoints. Practically that means that you need a 32-bit data type. The char data type in C and C++ is usually 8 bits wide (though that’s only a minimum), while on C# and Java it is always 16 bits wide. An 8-bit char could conceivably be used to store a transformed UTF-8 string, and a 16-bit char could store a transformed UTF-16 string, but in order to get at the raw, meaningful Unicode codepoints (and in particular at the length of the string in codepoints) you will always have to perform decoding.

    Typically your text processing libraries will be able to do the decoding and encoding for you, so they will happily accept UTF8 and UTF16 strings (but at a price), but if you want to spare yourself this extra indirection, store your strings as raw Unicode codepoints in a sufficiently wide type.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

The usual method of URL-encoding a unicode character is to split it into 2
How can I check whether a character is a Unicode character or not with
Is there a way to get boost.format to use and return wide (Unicode) character
Is there a way to get the Unicode Block of a character in python?
What is the difference between wchar_t arry[] and char arry[] type initialization. For Unicode
a) Do fonts know anything about coded character sets (Unicode, ASCII, etc.)? In other
There's one mechanism concerning characters encoding which I'm really not familiar with and I'd
I previously only had vague awareness of character encoding issues, but answers to a
Does anyone know of a great small open source Unicode handling library for C
I'm not quite pro with encodings, but here's what I think I know (though

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.