I came across this problem as I was working on the Python Challenge. Number 10 to be exact. I decided to try and solve it using regexes – pulling out the repeating sequences, counting their length, and building the next item in the sequence off of that.
So the regex I developed was: '(\d)\1*'
It worked well on the online regex tester, but when using it in my script it didn’t perform the same:
regex = re.compile('(\d)\1*')
text = '111122223333'
re.findall(regex, text)
> ['1', '1', '1', '1', '2', '2', '2',...]
And so on and so forth. So I learn about raw type in the re module for Python. Which is my first question: can someone please explain what exactly this does? The doc described it as reducing the need to escape backslashes, but it doesn’t appear that it’s required for simpler regexes such as \d+ and I don’t understand why.
So I change my regex to r'(\d)\1*' and now try and use findall() to make a list of the sequences. And I get
> ['1', '2', '3']
Very confused again. I still don’t understand this. Help please?
I decided to do this to get around this:
[m.group() for m in regex.finditer(text)]
> ['1111', '2222', '3333']
And get what I’ve been looking for. Then, based off of this thread, I try doing findall() adding a group to the whole regex -> r'((\d)\2*)'.
I end up getting:
> [('1111', '1'), ('2222', '2'), ('3333', '3')]
At this point I’m all kinds of confused. I know that this result has something to do with multiple groups, but I’m just not sure.
Also, this is my first time posting so I apologize if my etiquette isn’t correct. Please feel free to correct me on that as well. Thanks!
Since this is the challenge I won’t give you a complete answer. You are on the right track however.
The
finditermethod returnsMatchObjectinstances. You want to look at the.group()method on these and read the documentation carefully. Think about what the difference is between.group(0)and.group(1)there; plain.group()is the same as.group(0).As for the
\descape character; because that particular escape combination has no meaning as a python string escape character, Python ignores it and leaves it as a backslash and letterd. It would indeed be better to use ther''literal string format, as it would prevent nasty surprises when you do want to use a regular expression character set that also happens to be an escape sequence python does recognize. See the python documentation on string literals for more information.Your
.findall()with ther'((\d)\2*)'expression returns 2 elements per match as you have 2 groups in your pattern; the outer, whole group matching(\d)\2*and the inner group matching\d. From the.findall()documentation: