I’d like to iterate through Unicode characters, gobbling up all combining characters that follow

Question

0

Asked: June 2, 20262026-06-02T23:46:53+00:00 2026-06-02T23:46:53+00:00

I’d like to iterate through Unicode characters, gobbling up all combining characters that follow

0

I’d like to iterate through Unicode characters, gobbling up all combining characters that follow the initial code point.

This is what I have so far, but it acts really weird for some random Unicode sequences I tried: (for example, when I pass it things like “a̔” (U+0061 LATIN SMALL LETTER A followed by U+0314 COMBINING REVERSED COMMA ABOVE) it sees it as two characters rather than one. Other things, like “e︠” (U+0065 LATIN SMALL LETTER E followed by U+FE20 COMBINING LIGATURE LEFT HALF) are seen as one character)

int COMBINING[] = {
    0x0300, 0x036F,
    0x1DC0, 0x1DFF,
    0x20D0, 0x20FF,
    0xFE20, 0xFE2F,
    0 //sentinel
};

utf8_index_t ut_nextchar(utf8_t source, utf8_index_t curr)
{
    int c = decode_cp(source, &curr);
    int comb = 0;
    if (c == 0)
        return -1;
    while (COMBINING[comb] != 0)
    {
        for (comb = 0; COMBINING[comb] != 0; comb += 2)
        {
            if (c >= COMBINING[comb] && c <= COMBINING[comb + 1])
            {
                c = decode_cp(source, &curr);
                if (c == 0)
                    return -1;
                break;
            }
        }
    }
    return curr;
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T23:46:54+00:00

Actually, Unicode characters are mostly 1:1 to Unicode codepoints – what you’re interested in are Unicode grapheme clusters, which correspond to so-called user-perceived characters.

You can find my implementation of the algorithm, including property data, here at bitbucket.

If you’re not interested in the full algorithm, you can use

gc_break_property(c) == GC_BP_Extend

to check for characters with property Grapheme_Extend and

gc_break_property(c) & GC_FLAG_POSTFIX

if you want to include spacing marks as well.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’d like to iterate through Unicode characters, gobbling up all combining characters that follow

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply