I am attempting to use the re module in Python 2.7.3 with Unicode encoded Devnagari text. I have added from __future__ import unicode_literals to the top of my code so all strings literals should be unicode objects.
However, I am running into some odd problems with Python’s regex matching. For instance, consider this name: “किशोरी”. This is a (mis-spelled) name, in Hindi, entered by one of my users. Any Hindi reader would recognise this as a word.
The following returns a match, as it should:
re.search("^[\w\s][\w\s]*","किशोरी",re.UNICODE)
But this does not:
re.search("^[\w\s][\w\s]*$","किशोरी",re.UNICODE)
Some spelunking revealed that only one character in this string, character 0915 (क), is recognised as falling within the \w character class. This is incorrect, as the Unicode Character Database file on “derived core properties” lists other characters (I have not checked all) in this string as alphabetic ones – as indeed they are.
Is this just a bug in Python’s implementation? I could get around this by manually defining all the Devnagari alphanumeric characters as a character range, but that would be painful. Or am I doing something wrong?
It is a bug in the
remodule and it is fixed in theregexmodule:The output shows that there are 6 codepoints in
"किशोरी", but only 3 user-perceived characters (extended grapheme clusters). It would be wrong to break a word inside a character. Unicode Text Segmentation says:here and further emphasis is mine
A word boundary
\bis defined as a transition from\wto\W(or in reverse) in the docs:Therefore either all codepoints that form a single character are
\wor they are all\W.In this case
"किशोरी"matches^\w{6}$.From the docs for
\win Python 2:in Python 3:
From
regexdocs:According to unicode.org
U+093F(DEVANAGARI VOWEL SIGN I) is alnum and alphabetic soregexis also correct to consider it\weven if we follow definitions that are not based on word boundaries.