I noticed that it is very slow for a Regex to finish a XML file with 3000 lines [1]:
\(<Annotation\(\s*\w\+='[^']\{-}'\s\{-}\)*>\)\@<=\(\(<\/Annotation\)\@!\_.\)\{-}'MATCH\_.\{-}\(<\/Annotation>\)\@=
I always thought that Regexes are efficient. Why does it take so long to finish the Regex?
It depends on the regular expression itself if it is efficient or not. What it makes inefficient is backtracking. And to avoid this, the regular expression has to be as distinct as possible.
Let’s take the regular expression
a.*bas an example and apply it to the stringabcdef. The algorithm will first match the literalaina.*bto theainabcdef. Next the expression.*will be processed. In the normal greedy mode, where multipliers are expanded to the maximum, it will match to the whole restbcdefinabdef. Than the last literalbina.*bwill be processed. But as the end of the string is already reached and a mulpliplier is in use, the algorithm will try backtracking to match the whole pattern. The match of.*(bcdef) will be decreased by one character (bcde) and the algorithm tries to comply the rest of the pattern. But thebina.*bdoesn’t match thefinabcdef. So.*will be decreased by one more character until it matches the empty string (thus.is repeated zero times) and thebina.*bmatches thebinabcdef.As you can see,
a.*bapplied toabdefneeds 6 backtracking approaches for.*until the whole regular expression matches. But if we alter the regular expression and make it distinct by usinga[^b]*binstead, there is be no backtracking necessary and the regular expression can be matches within the first approach.And if you now consider using lazy modifiers instead, I’ve to tell you, that this rules apply to every modifier, both the greedy and lazy modifiers. The difference is instead of first expanding the match to the maximum and than doing backtracking by decreasing the match one character at a time (greedy), the lazy modifiers will first be expanded to the minimum match and than be increased one character at a time.