I have a program that runs a large number of regular expressions (10+) on a fairly long set of texts (5-15 texts about 1000 words each)
Every time that is done I feel like I forgot a Thread.Sleep(5000) in there somewhere. Are regular expressions really processor-heavy or something? It’d seem like a computer should crank through a task like that in a millisecond.
Should I try and group all the regular expressions into ONE monster expression? Would that help?
Thanks
EDIT: Here’s a regex that runs right now:
Regex rgx = new Regex(@"((.*(\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*).*)|(.*(keyword1)).*|.*(keyword2).*|.*(keyword3).*|.*(keyword4).*|.*(keyword5).*|.*(keyword6).*|.*(keyword7).*|.*(keyword8).*|.*(count:\n[0-9]|count:\n\n[0-9]|Count:\n[0-9]|Count:\n\n[0-9]|Count:\n).*|.*(keyword10).*|.*(summary: \n|Summary:\n).*|.*(count:).*)", RegexOptions.Compiled | RegexOptions.IgnoreCase);
Regex regex = new Regex(@".*(\.com|\.biz|\.net|\.org|\.co\.uk|\.bz|\.info|\.us|\.cm|(<a href=)).*", RegexOptions.Compiled | RegexOptions.IgnoreCase);
It’s pretty huge, no doubt about it. The idea is if it gets to any of the keywords or the link it will just take out the whole paragraph surrounding it.
Regexes don’t kill CPU’s, regex authors do. 😉
But seriously, if regexes always ran as slowly as you describe, nobody would be using them. Before you start loading up silver bullets like the
Compiledoption, you should go back to your regex and see if it can be improved.And it can. Each keyword is in its own branch/alternative, and each branch starts with
.*, so the first thing each branch does is consume the remainder of the current paragraph (i.e., everything up to the next newline). Then it starts backtracking as it tries to match the keyword. If it gets back to the position it started from, the next branch takes over and does the same thing.When all branches have reported failure, the regex engine bumps ahead one position and goes through all the branches again. That’s over a dozen branches, times the number of characters in the paragraph, times the number of paragraphs… I think you get the point. Compare that to this regex:
There are three major changes:
.*.*?, making it non-greedy^and$inMultilinemode)Now it only makes one match attempt per paragraph (pass or fail), and it practically never backtracks. I could probably make it even more efficient if I knew more about your data. For example, if every keyword/token/whatever starts with a letter, a word boundary would have an appreciable effect (e.g.
^.*?\b(\w+...).The
ExplicitCaptureoption makes all the “bare” groups ((...)) act like non-capturing groups ((?:...)), reducing the overhead a little more without adding clutter to the regex. If you want to capture the token, just change that first group to a named group (e.g.(?<token>\w+...).