I have a program that runs a large number of regular expressions (10+) on

Question

0

Asked: May 28, 20262026-05-28T02:11:40+00:00 2026-05-28T02:11:40+00:00

I have a program that runs a large number of regular expressions (10+) on

0

I have a program that runs a large number of regular expressions (10+) on a fairly long set of texts (5-15 texts about 1000 words each)

Every time that is done I feel like I forgot a Thread.Sleep(5000) in there somewhere. Are regular expressions really processor-heavy or something? It’d seem like a computer should crank through a task like that in a millisecond.

Should I try and group all the regular expressions into ONE monster expression? Would that help?

Thanks

EDIT: Here’s a regex that runs right now:

Regex rgx = new Regex(@"((.*(\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*).*)|(.*(keyword1)).*|.*(keyword2).*|.*(keyword3).*|.*(keyword4).*|.*(keyword5).*|.*(keyword6).*|.*(keyword7).*|.*(keyword8).*|.*(count:\n[0-9]|count:\n\n[0-9]|Count:\n[0-9]|Count:\n\n[0-9]|Count:\n).*|.*(keyword10).*|.*(summary: \n|Summary:\n).*|.*(count:).*)", RegexOptions.Compiled | RegexOptions.IgnoreCase);

Regex regex = new Regex(@".*(\.com|\.biz|\.net|\.org|\.co\.uk|\.bz|\.info|\.us|\.cm|(<a href=)).*", RegexOptions.Compiled | RegexOptions.IgnoreCase);

It’s pretty huge, no doubt about it. The idea is if it gets to any of the keywords or the link it will just take out the whole paragraph surrounding it.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T02:11:41+00:00

Regexes don’t kill CPU’s, regex authors do. 😉

But seriously, if regexes always ran as slowly as you describe, nobody would be using them. Before you start loading up silver bullets like the Compiled option, you should go back to your regex and see if it can be improved.

And it can. Each keyword is in its own branch/alternative, and each branch starts with .*, so the first thing each branch does is consume the remainder of the current paragraph (i.e., everything up to the next newline). Then it starts backtracking as it tries to match the keyword. If it gets back to the position it started from, the next branch takes over and does the same thing.

When all branches have reported failure, the regex engine bumps ahead one position and goes through all the branches again. That’s over a dozen branches, times the number of characters in the paragraph, times the number of paragraphs… I think you get the point. Compare that to this regex:

Regex re = new Regex(@"^.*?(\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*|keyword1|keyword2|keyword3|keyword4|keyword5|keyword6|keyword7|keyword8|count:(\n\n?[0-9]?)?|keyword10|summary: \n).*$", 
    RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

There are three major changes:

I factored out the leading and trailing .*
I changed the leading one to .*?, making it non-greedy
I added start-of-line and end-of-line anchors (^ and $ in Multiline mode)

Now it only makes one match attempt per paragraph (pass or fail), and it practically never backtracks. I could probably make it even more efficient if I knew more about your data. For example, if every keyword/token/whatever starts with a letter, a word boundary would have an appreciable effect (e.g. ^.*?\b(\w+...).

The ExplicitCapture option makes all the “bare” groups ((...)) act like non-capturing groups ((?:...)), reducing the overhead a little more without adding clutter to the regex. If you want to capture the token, just change that first group to a named group (e.g.(?<token>\w+...).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a program that runs a large number of regular expressions (10+) on

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply