I am using posix c regex library(regcomp/regexec) on my search application. My application supports different languages including those that uses multi-byte characters. I’m encountering a problem when using word boundary metacharacter (\b). For single-byte strings, it works just fine, e.g:
‘\bpaper\b’ matches ‘paper’
However, if the regex and query strings are multi-byte, it doesn’t seem to work correctly, e.g:
‘\b紙張\b’ doesn’t match ‘紙張’
Am I missing something? Any help would be highly appreciated.
Requested Info:
- Programming Language: C
- Regex Library: GNU C (regex.h)
Thanks.
What is “multi-byte” in this context? A string encoded into UTF-8 bytes? A locale-specific multibyte encoding such as GB?
If you’re not dealing with wide (Unicode) strings natively, you can’t expect any more support for non-ASCII characters than just detecting they’re there. POSIX regex doesn’t specify any character classes for bytes outside the ASCII range, so it doesn’t know that any of the bytes in ‘\xe7\xb4\x99’ (the UTF-8 representation of ‘紙’) could be considered word-letters; hence it sees no word boundaries.
What constitutes a letter or a word in Unicode is a more involved question than simple ASCII regex can cope with. (And obviously, what constitutes a ‘word’ in Chinese is arguable in itself.) If all you want to detect is plain old spaces, you could do that explicitly: