I’d like to iterate through Unicode characters, gobbling up all combining characters that follow the initial code point.
This is what I have so far, but it acts really weird for some random Unicode sequences I tried: (for example, when I pass it things like “a̔” (U+0061 LATIN SMALL LETTER A followed by U+0314 COMBINING REVERSED COMMA ABOVE) it sees it as two characters rather than one. Other things, like “e︠” (U+0065 LATIN SMALL LETTER E followed by U+FE20 COMBINING LIGATURE LEFT HALF) are seen as one character)
int COMBINING[] = {
0x0300, 0x036F,
0x1DC0, 0x1DFF,
0x20D0, 0x20FF,
0xFE20, 0xFE2F,
0 //sentinel
};
utf8_index_t ut_nextchar(utf8_t source, utf8_index_t curr)
{
int c = decode_cp(source, &curr);
int comb = 0;
if (c == 0)
return -1;
while (COMBINING[comb] != 0)
{
for (comb = 0; COMBINING[comb] != 0; comb += 2)
{
if (c >= COMBINING[comb] && c <= COMBINING[comb + 1])
{
c = decode_cp(source, &curr);
if (c == 0)
return -1;
break;
}
}
}
return curr;
}
Actually, Unicode characters are mostly 1:1 to Unicode codepoints – what you’re interested in are Unicode grapheme clusters, which correspond to so-called user-perceived characters.
You can find my implementation of the algorithm, including property data, here at bitbucket.
If you’re not interested in the full algorithm, you can use
to check for characters with property Grapheme_Extend and
if you want to include spacing marks as well.