The code below contains a regular expression designed to extract a C# string literal but the performance of the regex matching for input strings of more than a few characters is woeful.
class Program
{
private static void StringMatch(string s)
{
// regex: quote, zero-or-more-(zero-or-more-non-backslash-quote, optional-backslash-anychar), quote
Match m = Regex.Match(s, "\"(([^\\\\\"]*)(\\\\.)?)*\"");
if (m.Success)
Trace.WriteLine(m.Value);
else
Trace.WriteLine("no match");
}
public static void Main()
{
// this first string is unterminated (so the match fails), but it returns instantly
StringMatch("\"OK");
// this string is terminated (the match succeeds)
StringMatch("\"This is a longer terminated string - it matches and returns instantly\"");
// this string is unterminated (so the match will fail), but it never returns
StringMatch("\"This is another unterminated string and takes FOREVER to match");
}
}
I can refactor the regex into a different form, but can anyone offer an explanation why the performance is so bad?
You’re running into catastrophic backtracking:
Let’s simplify the regex a bit (without the escaped quotes and without the second optional group because, as in your comment, it’s irrelevant for the tested strings):
([^\\"]*)matches any string except quotes or backslashes. This again is enclosed in an optional group that can repeat any number of times.Now for the string
"ABC, the regex engine needs to try the following permutations:",ABC",ABC,<empty string>",AB,C",AB,C,<empty string>",AB,<empty string>,C",AB,<empty string>,C,<empty string>",<empty string>,AB,C",<empty string>,AB,C,<empty string>",<empty string>,AB,<empty string>,C,<empty string>",<empty string>,AB,<empty string>,C",A,BC",A,BC,<empty string>",A,<empty string>,BC",<empty string>,A,BC",A,B,C",A,B,C,<empty string>",A,B,<empty string>,Ceach of which then fails because there is no following
".Also, you’re only testing for substrings instead of forcing the regex to match the entire string. And you usually want to use verbatim strings for regexes to cut down on the number of backslashes you need. How about this: