I’m matching identifiers, but now I have a problem: my identifiers are allowed to

Question

0

Asked: May 10, 20262026-05-10T18:16:08+00:00 2026-05-10T18:16:08+00:00

I’m matching identifiers, but now I have a problem: my identifiers are allowed to

0

I’m matching identifiers, but now I have a problem: my identifiers are allowed to contain unicode characters. Therefore the old way to do things is not enough:

t_IDENTIFIER = r'[A-Za-z](\\.|[A-Za-z_0-9])*'

In my markup language parser I match unicode characters by allowing all the characters except those I explicitly use, because my markup language only has two or three of characters I need to escape that way.

How do I match all unicode characters with python regexs and ply? Also is this a good idea at all?

I’d want to let people use identifiers like Ω » « ° foo² väli π as an identifiers (variable names and such) in their programs. Heck! I want that people could write programs in their own language if it’s practical! Anyway unicode is supported nowadays in wide variety of places, and it should spread.

Edit: POSIX character classes doesnt seem to be recognised by python regexes.

>>> import re >>> item = re.compile(r'[[:word:]]') >>> print item.match('e') None

Edit: To explain better what I need. I’d need a regex -thing that matches all the unicode printable characters but not ASCII characters at all.

Edit: r’\w’ does a bit stuff what I want, but it does not match « », and I also need a regex that does not match numbers.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-10T18:16:09+00:00

the re module supports the \w syntax which:

If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

therefore the following examples shows how to match unicode identifiers:

>>> import re >>> m = re.compile('(?u)[^\W0-9]\w*') >>> m.match('a') <_sre.SRE_Match object at 0xb7d75410> >>> m.match('9') >>> m.match('ab') <_sre.SRE_Match object at 0xb7c258e0> >>> m.match('a9') <_sre.SRE_Match object at 0xb7d75410> >>> m.match('unicöde') <_sre.SRE_Match object at 0xb7c258e0> >>> m.match('ödipus') <_sre.SRE_Match object at 0xb7d75410>

So the expression you look for is: (?u)[^\W0-9]\w*

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m matching identifiers, but now I have a problem: my identifiers are allowed to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply