Can someone help me rewrite this regex to be non-exponential?
I’m using perl to parse email data. I want to extract email addresses from the data. Here is a shortened version of the regex that I’ve been using:
my $email_address = qr/(?:[^\s@<>,":;\[\]\(\)\\]+?|"[^\"]+?")@/i
For simplicity I’ve removed the later domain part of the regex. (It isn’t causing any problems.)
This will find an RFC compliant email address that either contains non-email meta chars OR a “quoted” string followed by @. Using the OR ‘|’ part of the regex with the two different multicharacter patterns creates an exponential problem.
The problem is, when I unleash this on a line of data that is several thousands of characters long.
$ wc line7.txt
1 221 497819 line7.txt
(I’m sorry but I cannot provide input data at this time, I may be able to mock some up later.)
Much like rewriting (a*b*)* to (a|b)*, I need to rewrite this regex.
Splitting it into two separate regex’s creates more work in code changes then I am willing to perform at this point. Although it would solve my problem.
The eventual target machine is on a Hadoop cluster. So I would like to avoid CPAN modules that don’t come with Hadoop’s version of perl. (I’ll have to check if Email::Find can even be used.) This is a problem I encountered at work.
The
(?>expression)part prevents backtracking. It should be safe because there can be no overlap between the non-quoted part and the quoted part.I removed the lazy repeats
+?because the parts of the alternation already look for the@and"respectively. Phrases could be a large source of backtracking, so I looked at the Wikipedia article which states that the local part (before the @) can be only 64 characters long (subtracting two quotes yields{0,62}(if""@is not valid, then change it to{1,62}…. I do not intend for this to be a completely functional email parser. That is your job. I simply provide help for the catastrophic backtracking.) Best of luck!