I found some problem while testing my NLP system. I have a java regex "(.*\\.\\s*)*Dendryt.*" and for string "v Table of Contents List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . " it just dont stop computing.
Its clear that this regex complexity is very high, I will try to refactor it. Have you some suggestions for me for a future regex development ???
Thanks.
You’re running into catastrophic backtracking by repeating a group containing repeated quantifiers. The combinatorial explosion that follows will then (given enough input) lead to a (tada!) Stack Overflow.
Simplified, your regex tries to
(.*\.\s*)match any succession of characters including dots and spaces, followed by a dot, followed by zero or more spaces, then(...)*repeat this any number of times.DendrytOnly then it tries to match “Dendryt”.Since this fails, the engine backtracks, trying a different permutation. The possibilities are nearly endless…
To illustrate, here’s a screenshot of RegexBuddy’s regex debugger on a simplified version of your data:
RegexBuddy Screen Shot http://img714.imageshack.us/img714/3275/screen017.png
The engine gives up after 1 million permutations.
Your regex would be a little better like this (don’t forget to escape the backslashes when converting it to a Java string):
In this case the
*+, a so-called possessive quantifier, will refuse to backtrack once it has matched. That way, the regex engine can fail much faster, but it’s still bad because(.*)matches anything, even the dots.is safe, unless your data can contain dots before the “dotted line bit”. All in all, please state your requirements a bit more clearly, then a better regex can be built.