Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 1106959
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 17, 20262026-05-17T01:52:05+00:00 2026-05-17T01:52:05+00:00

I have a part of my Unicode library that decodes UTF-16 into raw Unicode

  • 0

I have a part of my Unicode library that decodes UTF-16 into raw Unicode code points. However, it isn’t working as expected.

Here’s the relevant part of the code (omitting UTF-8 and string manipulation stuff):

typedef struct string {
    unsigned long length;
    unsigned *data;
} string;

string *upush(string *s, unsigned c) {
    if (!s->length) s->data = (unsigned *) malloc((s->length = 1) * sizeof(unsigned));
    else            s->data = (unsigned *) realloc(s->data, ++s->length * sizeof(unsigned));
    s->data[s->length - 1] = c;
    return s;
}

typedef struct string16 {
    unsigned long length;
    unsigned short *data;
} string16;

string u16tou(string16 old) {
    unsigned long i, cur = 0, need = 0;
    string new;
    new.length = 0;
    for (i = 0; i < old.length; i++)
        if (old.data[i] < 0xd800 || old.data[i] > 0xdfff) upush(&new, old.data[i]);
        else
            if (old.data[i] > 0xdbff && !need) {
                cur = 0; continue;
            } else if (old.data[i] < 0xdc00) {
                need = 1;
                cur = (old.data[i] & 0x3ff) << 10;
                printf("cur 1: %lx\n", cur);
            } else if (old.data[i] > 0xdbff) {
                cur |= old.data[i] & 0x3ff;
                upush(&new, cur);
                printf("cur 2: %lx\n", cur);
                cur = need = 0;
            }
    return new;
}

How does it work?

string is a struct that holds 32-bit values, and string16 is for 16-bit values like UTF-16. All upush does is add a full Unicode code point to a string, reallocating memory as needed.

u16tou is the part that I’m focusing on. It loops through the string16, passing non-surrogate values through as normal, and converting surrogate pairs into full code points. Misplaced surrogates are ignored.

The first surrogate in a pair has its lowest 10 bits shifted 10 bits to the left, resulting in it forming the high 10 bits of the final code point. The other surrogate has its lowest 10 bits added to the final, and then it is appended to the string.

The problem?

Let’s try the highest code point, shall we?

U+10FFFD, the last valid Unicode code point, is encoded as 0xDBFF 0xDFFD in UTF-16. Let’s try decoding that.

string16 b;
b.length = 2;
b.data = (unsigned short *) malloc(2 * sizeof(unsigned short));
b.data[0] = 0xdbff;
b.data[1] = 0xdffd;
string a = u16tou(b);
puts(utoc(a));

Using the utoc (not shown; I know it’s working (see below)) function to convert it back to a UTF-8 char * for printing, I can see in my terminal that I’m getting U+0FFFFD, not U+10FFFD as a result.

In the calculator

Doing all the conversions manually in gcalctool results in the same, wrong answer. So my syntax itself isn’t wrong, but the algorithm is. The algorithm seems right to me though, and yet it’s ending in the wrong answer.

What am I doing wrong?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-17T01:52:06+00:00Added an answer on May 17, 2026 at 1:52 am

    You need to add on 0x10000 when decoding the surrogate pair; to quote rfc 2781, the step you’re missing is number 5:

        1) If W1 < 0xD800 or W1 > 0xDFFF, the character value U is the value
           of W1. Terminate.
    
        2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence
           is in error and no valid character can be obtained using W1.
           Terminate.
    
        3) If there is no W2 (that is, the sequence ends with W1), or if W2
           is not between 0xDC00 and 0xDFFF, the sequence is in error.
           Terminate.
    
        4) Construct a 20-bit unsigned integer U', taking the 10 low-order
           bits of W1 as its 10 high-order bits and the 10 low-order bits of
           W2 as its 10 low-order bits.
    
        5) Add 0x10000 to U' to obtain the character value U. Terminate.
    

    ie. one fix would be to add an extra line after your first read:

    cur = (old.data[i] & 0x3ff) << 10;
    cur += 0x10000;
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Is there someway to have a part of the page that renders like a
In my (PHP) web app, I have a part of my site that keeps
I have a two part question Best-Practice I have an algorithm that performs some
I have a build script and as part of that script it copies a
I have several applications that are part of a suite of tools that various
In Toyota manufacturing lines they always know what path a part have traveled. Just
I have a web part with links to e.g. Manage Users i SharePoint (2003)
I'm using the content query web part and have exported it to a webpart
For part of my application I have a need to create an image of
Part of a new product I have been assigned to work on involves server-side

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.