I have a multi-byte string containing a mixture of japanese and latin characters. I’m

Question

0

Asked: May 12, 20262026-05-12T07:26:32+00:00 2026-05-12T07:26:32+00:00

I have a multi-byte string containing a mixture of japanese and latin characters. I’m

0

I have a multi-byte string containing a mixture of japanese and latin characters. I’m trying to copy parts of this string to a separate memory location. Since it’s a multi-byte string, some of the characters uses one byte and other characters uses two. When copying parts of the string, I must not copy “half” japanese characters. To be able to do this properly, I need to be able to determine where in the multi-byte string characters starts and ends.

As an example, if the string contains 3 characters which requires [2 byte][2 byte][1 byte], I must copy either 2, 4 or 5 bytes to the other location and not 3, since if I were copying 3 I would copy only half the second character.

To figure out where in the multi-byte string characters starts and ends, I’m trying to use the Windows API function CharNext and CharNextExA but without luck. When I use these functions, they navigate through my string one byte at a time, rather than one character at a time. According to MSDN, CharNext is supposed to The CharNext function retrieves a pointer to the next character in a string..

Here’s some code to illustrate this problem:

#include <windows.h>
#include <stdio.h>
#include <wchar.h>
#include <string.h>

/* string consisting of six "asian" characters */
wchar_t wcsString[] = L"\u9580\u961c\u9640\u963f\u963b\u9644";

int main() 
{
   // Convert the asian string from wide char to multi-byte.
   LPSTR mbString = new char[1000];
   WideCharToMultiByte( CP_UTF8, 0, wcsString, -1, mbString, 100,  NULL, NULL);

   // Count the number of characters in the string.
   int characterCount = 0;
   LPSTR currentCharacter = mbString;
   while (*currentCharacter)
   {
      characterCount++;

     currentCharacter = CharNextExA(CP_UTF8, currentCharacter, 0);
   }
}

(please ignore memory leak and failure to do error checking.)

Now, in the example above I would expect that characterCount becomes 6, since that’s the number of characters in the asian string. But instead, characterCount becomes 18 because mbString contains 18 characters:

é–€é˜œé™€é˜¿é˜»é™„

I don’t understand how it’s supposed to work. How is CharNext supposed to know whether “é–€é” in the string is an encoded version of a Japanese character, or in fact the characters é – € and é?

Some notes:

I’ve read Joels blog post about what every developer needs to know about Unicode. I may have misunderstood something in it though.
If all I wanted to do was to count the characters, I could count the characters in the asian string directly. Keep in mind that my real goal is copying parts of the multi-byte string to a separate location. The separate location only supports multi-byte, not widechar.
If I convert the content of mbString back to wide char using MultiByteToWideChar, I get the correct string (門阜陀阿阻附), which indicates that there’s nothing wrong with mbString.

EDIT:
Apparantly the CharNext functions doesn’t support UTF-8 but Microsoft forgot to document that. I threw/copiedpasted together my own routine, which I won’t use and which needs improving. I’m guessing it’s easily crashable.

  LPSTR CharMoveNext(LPSTR szString)
  {
     if (szString == 0 || *szString == 0)
        return 0;

     if ( (szString[0] & 0x80) == 0x00)
        return szString + 1;
     else if ( (szString[0] & 0xE0) == 0xC0)
        return szString + 2;
     else if ( (szString[0] & 0xF0) == 0xE0)
        return szString + 3;
     else if ( (szString[0] & 0xF8) == 0xF0)
        return szString + 4;
     else
        return szString +1;
  }

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-12T07:26:32+00:00

Editorial Team

2026-05-12T07:26:32+00:00Added an answer on May 12, 2026 at 7:26 am

Here is a really good explanation of what is going on here at the Sorting it All Out blog: Is CharNextExA broken?. In short, CharNext is not designed to work with UTF8 strings.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a multi-byte string containing a mixture of japanese and latin characters. I’m

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply