I am attempting to use the re module in Python 2.7.3 with Unicode encoded

Question

0

Asked: June 12, 20262026-06-12T15:28:21+00:00 2026-06-12T15:28:21+00:00

I am attempting to use the re module in Python 2.7.3 with Unicode encoded

0

I am attempting to use the re module in Python 2.7.3 with Unicode encoded Devnagari text. I have added from __future__ import unicode_literals to the top of my code so all strings literals should be unicode objects.

However, I am running into some odd problems with Python’s regex matching. For instance, consider this name: “किशोरी”. This is a (mis-spelled) name, in Hindi, entered by one of my users. Any Hindi reader would recognise this as a word.

The following returns a match, as it should:

re.search("^[\w\s][\w\s]*","किशोरी",re.UNICODE)

But this does not:

re.search("^[\w\s][\w\s]*$","किशोरी",re.UNICODE)

Some spelunking revealed that only one character in this string, character 0915 (क), is recognised as falling within the \w character class. This is incorrect, as the Unicode Character Database file on “derived core properties” lists other characters (I have not checked all) in this string as alphabetic ones – as indeed they are.

Is this just a bug in Python’s implementation? I could get around this by manually defining all the Devnagari alphanumeric characters as a character range, but that would be painful. Or am I doing something wrong?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T15:28:23+00:00

It is a bug in the re module and it is fixed in the regex module:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import unicodedata
import re
import regex  # $ pip install regex

word = "किशोरी"


def test(re_):
    assert re_.search("^\\w+$", word, flags=re_.UNICODE)

print([unicodedata.category(cp) for cp in word])
print(" ".join(ch for ch in regex.findall("\\X", word)))
assert all(regex.match("\\w$", c) for c in ["a", "\u093f", "\u0915"])

test(regex)
test(re)  # fails

The output shows that there are 6 codepoints in "किशोरी", but only 3 user-perceived characters (extended grapheme clusters). It would be wrong to break a word inside a character. Unicode Text Segmentation says:

Word boundaries, line boundaries, and sentence boundaries should not
occur within a grapheme cluster: in other words, a grapheme cluster
should be an atomic unit with respect to the process of determining
these other boundaries.

^{here and further emphasis is mine}

A word boundary \b is defined as a transition from \w to \W (or in reverse) in the docs:

Note that formally, \b is defined as the boundary between a \w and a
\W character (or vice versa), or between \w and the beginning/end of
the string, …

Therefore either all codepoints that form a single character are \w or they are all \W.
In this case "किशोरी" matches ^\w{6}$.

From the docs for \w in Python 2:

If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.

in Python 3:

Matches Unicode word characters; this includes most characters that
can be part of a word in any language, as well as numbers and the
underscore.

From regex docs:

Definition of ‘word’ character (issue #1693050):

The definition of a ‘word’ character has been expanded for Unicode. It now conforms to the Unicode specification at
http://www.unicode.org/reports/tr29/. This applies to \w, \W, \b and
\B.

According to unicode.org U+093F (DEVANAGARI VOWEL SIGN I) is alnum and alphabetic so regex is also correct to consider it \w even if we follow definitions that are not based on word boundaries.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am attempting to use the re module in Python 2.7.3 with Unicode encoded

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply