Using ICU 4.0 regex library, I find that the following regex is exhibiting exponential time:
actual: '[^<]*<\?' C code: '[^<]*<\\?'
Aim: find ‘<?’ where there is no other ‘<‘ before it
When running this regex on plain text with no ‘<‘ characters at all it appears to take exponential time. If the text has at least a single ‘<‘ then it is quick. I don’t understand why.
Shouldn’t the required match on ‘<?’ prevent this from needing to backtrack? I would have thought that it would try to find the first ‘<‘ and then test the rest of the expression. If it can’t find a ‘<‘ then it would give up because the pattern obviously can’t match.
Is this a bug in the ICU regex or is it expected?
You will find an explanation at Regular Expression Matching Can Be Simple And Fast.
As MizardX said, if the match fails at position 0, the engine will try again at position 1, 2, etc. If the text is long, be ready to wait for quite some time…
The solution is to anchor your expression:
'^[^<]*<\?'