I have the following script, which grabs a webpage, then does a regex to find items I’m looking for:
use warnings;
use strict;
use LWP::Simple;
my $content=get('http://mytempscripts.com/2011/09/temporary-post.html') or die $!;
$content=~s/\n//g;
$content=~s/ / /g;
$content=~/<b>this is a temp post<\/b><br \/><br \/>(.*?)<div style='clear: both;'><\/div>/;
my $temp=$1;
while($temp=~/((.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9] {1,})(.*?)\s+)/g){
print "found a match\n";
}
This works, but takes a long, long time. When I shorten the regex to the following, I get the results in less than a second. Why does my original regex take so long? How do I correct it?
while($temp=~/((.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9] {1,})(.*?)\s+)/g){
print "found a match\n";
}
Regular expressions are like the
sortfunction in Perl. You think it’s pretty simple because it’s just a single command, but in the end, it uses a lot of processing power to do the job.There are certain things you can do to help out:
.*).The wretched truth is that after decades of writing in Perl, I’ve never masted the deep dark secrets of regular expression parsing. I’ve tried many times to understand it, but that usually means doing research on the Web, and …well… I get distracted by all of the other stuff on the Web.
And, it’s not that difficult, any half decent developer with an IQ of 240, and a penchant for sadism should easily be able to pick it up.
Let’s take a simple example:
What will
$resultbe? It will ber. You see how that works? Let’s see what happens.Originally, the regular expression is broken into tokens, and the first token
foo.*is used. That actually matches the whole string:However, if the first regular expression token captures the whole string, the rest of the regular expression fails. Therefore, the regular expression matching algorithm has to back track:
Now, the same happens for the rest of the string:
Backtracking tends to happen with the
.*and.+expression since they’re so loose. I see you’re using non-greedy matches which can help, but it can still be an issue if you are not careful — especially if you have very long and complex regular expressions.I hope this helps explain backtracking.
The issue you’re running into isn’t that your program doesn’t work, but that it takes a long, long time.
I was hoping that the general gist of my answer is that regular expression parsing isn’t as simple as Perl makes it out to be. I can see the command
sort @foo;in a program, but forget that if@foocontains a million or so entries, it might take a while. In theory, Perl could be using a bubble sort and thus the algorithm is a O2. I hope that Perl is actually using a more efficient algorithm and my actual time will be closer to O * log (O). However, all this is hidden by my simple one line statement.I don’t know if backtracking is an issue in your case, but you’re treating an entire webpage output as a single string to match against a regular expression which could result in a very long string. You attempt to match it against another regular expression which you do over and over again. Apparently, that is quite a process intensive step which is hidden by the fact it’s a single Perl statement (much like
sort @foohides its complexity).Thinking about this on and off over the weekend, you really should not attempt to parse HTML or XML with regular expressions because it is so sloppy. You end up with something rather inefficient and fragile.
In cases like this may be better off using something like HTML::Parser or XML::Simple which I’m more familiar with, but doesn’t necessarily work with poorly formatted HTML.
Perl regular expressions are nice, but they can easily get out of our control.