I am trying to do this:
val = re.sub(r'\b' + u_word +'\b', unicode(new_word), u_text)
(All strings are non-latin.)
It does not work, at all!.
Is it possible to find-replace non-latin words (whole words) in a non-latin text with regex?
How?
EDIT:
If you want to test try these strings:
>>> u_word = u'αβ'
>>> u_text = u'αβγ αβ αβγδ δαβ'
>>> new_word = u'χχ'
>>> val = re.sub(r'\b' + u_word +r'\b', unicode(new_word), u_text)
>>> val
u'\u03b1\u03b2\u03b3 \u03b1\u03b2 \u03b1\u03b2\u03b3\u03b4 \u03b4\u03b1\u03b2'
>>> u_text
u'\u03b1\u03b2\u03b3 \u03b1\u03b2 \u03b1\u03b2\u03b3\u03b4 \u03b4\u03b1\u03b2'
>>>
You need to pass the
re.UNICODEflag tosub, like so:\bis a word boundary. Without there.UNICODEflag, a “word” contains only characters from the set[a-zA-Z0-9_], soαβisn’t seen as a “word”. For more information see theredocumentation (specifically\b,\w, andre.UNICODE).FYI:
new_wordis already a unicode string (as in your example),unicode(new_word)is superfluous, it returnsnew_wordunmodified.unicode()which was removed because it’s no longer necessary).