I’m working on a little Python script that is supposed to match a series of authors and I’m using the re-module for that. I came across something unexpected and I have been able to reduce it to the following very simple example:
>>> import re
>>> s = "$word1$, $word2$, $word3$, $word4$"
>>> word = r'\$(word\d)\$'
>>> m = re.match(word+'(?:, ' + word + r')*', s)
>>> m.groups()
('word1', 'word4')
So I’m defining a ‘basic’ regexp that matches the main parts of my input, with some recognizable features (in this case I used the $-signs) and than I try to match one word plus a possible additional list of words.
I’d have expected that m.groups() would’ve displayed:
>>> m.groups()
('word1', 'word2', 'word3', 'word4')
But apparently I’m doing something wrong. I’d like to know why this solution does not work and how to change it, such that I get the result I’m looking for. BTW, this is with Python 2.6.6 on a Linux machine, in case that matters.
Although you’re re is matching every
$word#$, the second capture group is continuously getting replaced by the last item matched.Let’s take a look at the debugger:
As you can see, there are only 2 capture groups:
subpattern 1andsubpattern 2. Every time another$word#$is found,subpattern 2gets overwritten.As for a potential solution, I would recommend using
re.findall()instead ofre.match():