Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8657237
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 12, 20262026-06-12T15:28:21+00:00 2026-06-12T15:28:21+00:00

I am attempting to use the re module in Python 2.7.3 with Unicode encoded

  • 0

I am attempting to use the re module in Python 2.7.3 with Unicode encoded Devnagari text. I have added from __future__ import unicode_literals to the top of my code so all strings literals should be unicode objects.

However, I am running into some odd problems with Python’s regex matching. For instance, consider this name: “किशोरी”. This is a (mis-spelled) name, in Hindi, entered by one of my users. Any Hindi reader would recognise this as a word.

The following returns a match, as it should:

re.search("^[\w\s][\w\s]*","किशोरी",re.UNICODE)

But this does not:

re.search("^[\w\s][\w\s]*$","किशोरी",re.UNICODE)

Some spelunking revealed that only one character in this string, character 0915 (क), is recognised as falling within the \w character class. This is incorrect, as the Unicode Character Database file on “derived core properties” lists other characters (I have not checked all) in this string as alphabetic ones – as indeed they are.

Is this just a bug in Python’s implementation? I could get around this by manually defining all the Devnagari alphanumeric characters as a character range, but that would be painful. Or am I doing something wrong?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-12T15:28:23+00:00Added an answer on June 12, 2026 at 3:28 pm

    It is a bug in the re module and it is fixed in the regex module:

    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import unicodedata
    import re
    import regex  # $ pip install regex
    
    word = "किशोरी"
    
    
    def test(re_):
        assert re_.search("^\\w+$", word, flags=re_.UNICODE)
    
    print([unicodedata.category(cp) for cp in word])
    print(" ".join(ch for ch in regex.findall("\\X", word)))
    assert all(regex.match("\\w$", c) for c in ["a", "\u093f", "\u0915"])
    
    test(regex)
    test(re)  # fails
    

    The output shows that there are 6 codepoints in "किशोरी", but only 3 user-perceived characters (extended grapheme clusters). It would be wrong to break a word inside a character. Unicode Text Segmentation says:

    Word boundaries, line boundaries, and sentence boundaries should not
    occur within a grapheme cluster
    : in other words, a grapheme cluster
    should be an atomic unit with respect to the process of determining
    these other boundaries.

    here and further emphasis is mine

    A word boundary \b is defined as a transition from \w to \W (or in reverse) in the docs:

    Note that formally, \b is defined as the boundary between a \w and a
    \W character (or vice versa), or between \w and the beginning/end of
    the string, …

    Therefore either all codepoints that form a single character are \w or they are all \W.
    In this case "किशोरी" matches ^\w{6}$.


    From the docs for \w in Python 2:

    If UNICODE is set, this will match the characters [0-9_] plus
    whatever is classified as alphanumeric in the Unicode character
    properties database
    .

    in Python 3:

    Matches Unicode word characters; this includes most characters that
    can be part of a word in any language
    , as well as numbers and the
    underscore.

    From regex docs:

    Definition of ‘word’ character (issue #1693050):

    The definition of a ‘word’ character has been expanded for Unicode. It now conforms to the Unicode specification at
    http://www.unicode.org/reports/tr29/. This applies to \w, \W, \b and
    \B.

    According to unicode.org U+093F (DEVANAGARI VOWEL SIGN I) is alnum and alphabetic so regex is also correct to consider it \w even if we follow definitions that are not based on word boundaries.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

From the Django shell (manage.py shell), when attempting to import a python module that
I'm attempting to use the python subprocess module to log in to a secure
I am running python 3.1.4 from macports and I am attempting to use the
I'm attempting to use the python logging module to do complex things. I'll leave
I'm attempting to use savepoints with the sqlite3 module built into python 2.6. Every
I am attempting to use the 'tempfile' module for manipulating and creating text files.
I'm attempting to use Python's logging module to send emails containing logs. The problem
I'm attempting to use Python's tarfile module to extract a tar.gz archive. I'd like
I'm attempting to use the fft module in numpy: import Image, numpy i =
I'm attempting to use the XML::Simple CPAN module to convert output from our database

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.