I wrote a function which extends isalnum to recognize UTF-8 coded umlaut.
Is there maybe a more elegant way to solve this issue?
The code is as follows:
bool isalnumlaut(const char character) {
int cr = (int) (unsigned char) character;
if (isalnum(character)
|| cr == 195 // UTF-8
|| cr == 132 // Ä
|| cr == 164 // ä
|| cr == 150 // Ö
|| cr == 182 // ö
|| cr == 156 // Ü
|| cr == 188 // ü
|| cr == 159 // ß
) {
return true;
} else {
return false;
}
}
EDIT:
I tested my solution now several times, and it seems to do the job for my purpose though. Any strong objections?
Your code doesn’t do what you’re claiming.
The utf-8 representation of
Äis two bytes –0xC3,0x84. A lone byte with a value above0x7Fis meaningless in utf-8.Some general suggestions:
Unicode is large. Consider using a library that has already handled the issues you’re seeing, such as ICU.
It doesn’t often make sense for a function to operate on a single code unit or code point. It makes much more sense to have functions that operate on either ranges of code points or single glyphs (see here for definitions of those terms).
Your concept of alpha-numeric is likely to be underspecified for a character set as large as the Universal Character Set; do you want to treat the characters in the Cyrillic alphabet as alphanumerics? Unicode’s concept of what is alphabetic may not match yours – especially if you haven’t considered it.