Sometimes manipulating character strings at the character level is unavoidable. Here I have a

Question

0

Asked: June 13, 20262026-06-13T00:12:04+00:00 2026-06-13T00:12:04+00:00

Sometimes manipulating character strings at the character level is unavoidable. Here I have a

0

Sometimes manipulating character strings at the character level is unavoidable.

Here I have a function written for ANSI/ASCII based character strings that replaces CR/LF sequences with LF only, and also replaces CR with LF. We use this because incoming text files often have goofy line endings due to various text or email programs that have made a mess of them, and I need them to be in a consistent format to make parsing / processing / output work properly down the road.

Here’s a fairly efficient implementation of this compression from various line-endings to LF only, for single byte per character implementations:

// returns the in-place conversion of a Mac or PC style string to a Unix style string (i.e. no CR/LF or CR only, but rather LF only)
char * AnsiToUnix(char * pszAnsi, size_t cchBuffer)
{
    size_t i, j;
    for (i = 0, j = 0; pszAnsi[i]; ++i, ++j)
    {
        // bounds checking
        ASSERT(i < cchBuffer);
        ASSERT(j <= i);

        switch (pszAnsi[i])
        {
            case '\n':
                if (pszAnsi[i + 1] == '\r')
                    ++i;
                break;

            case '\r':
                if (pszAnsi[i + 1] == '\n')
                    ++i;
                pszAnsi[j] = '\n';
                break;

            default:
                if (j != i)
                    pszAnsi[j] = pszAnsi[i];
        }

    }

    // append null terminator if we changed the length of the string buffer
    if (j != i)
        pszAnsi[j] = '\0';

    // bounds checking
    ASSERT(pszAnsi[j] == 0);

    return pszAnsi;
}

I’m trying to transform this into something that will work correctly with multibyte/unicode strings, where the size of the next character can be multible bytes wide.

So:

I need to look at a character only at a valid character-point (not in the middle of a character)
I need to copy over the portion of the character that is part of the rejected piece properly (i.e. copy whole characters, not just bytes)

I understand that _mbsinc() will give me the address of the next start of a real character. But what is the equivalent for Unicode (UTF16), and are there already primitives to be able to copy a full character (e.g. length_character(wsz))?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T00:12:06+00:00

One of the beautiful things about UTF-8 is that if you only care about the ASCII subset, your code doesn’t need to change at all. The non-ASCII characters get encoded to multi-byte sequences where all of the bytes have the upper bit set, keeping them out of the ASCII range themselves. Your CR/LF replacement should work without modification.

UTF-16 has the same property. Characters that can be encoded as a single 16-bit entity will never conflict with the characters that require multiple entities.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Sometimes manipulating character strings at the character level is unavoidable. Here I have a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply