I wrote a small, naive regular expression that was supposed to find text inside parentheses:
re.search(r'\((.|\s)*\)', name)
I know this is not the best way to do it for a few reasons, but it was working just fine. What I am looking for is simply an explanation as to why for some strings this expression starts taking exponentially longer and then never finishes. Last night, after running this code for months, one of our servers suddenly got stuck matching a string similar to the following:
x (y) z
I’ve experimented with it and determined that the time taken doubles for every space in between the ‘y’ and ‘z’:
In [62]: %timeit re.search(r'\((.|\s)*\)', 'x (y)' + (22 * ' ') + 'z')
1 loops, best of 3: 1.23 s per loop
In [63]: %timeit re.search(r'\((.|\s)*\)', 'x (y)' + (23 * ' ') + 'z')
1 loops, best of 3: 2.46 s per loop
In [64]: %timeit re.search(r'\((.|\s)*\)', 'x (y)' + (24 * ' ') + 'z')
1 loops, best of 3: 4.91 s per loop
But also that characters other than a space do not have the same effect:
In [65]: %timeit re.search(r'\((.|\s)*\)', 'x (y)' + (24 * 'a') + 'z')
100000 loops, best of 3: 5.23 us per loop
Note: I am not looking for a better regular expression or another solution to this problem. We are no longer using it.
Catastrophic Backtracking
As CaffGeek’s answer correctly implies, the problem is due to one form of catastrophic backtracking. The two alternatives both match a space (or tab) and this is applied unlimited times greedily. Additionally the dot matches the closing parentheses so once the opening paren is matched this expression always matches to the very end of the string before it must painstakingly backtrack to find the closing bracket. And during this backtracking process, the other alternative is tried at each location for (which is also successful for spaces or tabs). Thus, every possible matching combination sequence must be tried before the engine can backtrack one position. With a lot of spaces after the closing paren, this adds up quickly. The specific problem for the case where there is a matching close paren can be solved by simply making the star quantifier lazy (i.e.
r'\((.|\s)*?\)'), but the runaway regex problem still exists for the non-matching case where there is an opening paren with no matching close paren in the subject string.The original regex is really, really bad! (and also does not correctly match up closing parens when there are more than one pair).
The correct expression to match innermost parentheses (which is very fast for both matching and non-matching cases), is of course:
All regex authors should read MRE3!
This is all explained in great detail, (with thorough examples and recommended best practices) in Jeffrey Friedl’s must-read-for-regex-authors: Mastering Regular Expressions (3rd Edition). I can honestly say that this is the most useful book I’ve ever read. Regular expressions are a very powerful tool but like a loaded weapon must be applied with great care and precision (or you will shoot yourself in the foot!)