I have the following script, which grabs a webpage, then does a regex to

Question

0

Asked: May 25, 20262026-05-25T15:46:20+00:00 2026-05-25T15:46:20+00:00

I have the following script, which grabs a webpage, then does a regex to

0

I have the following script, which grabs a webpage, then does a regex to find items I’m looking for:

use warnings;
use strict;
use LWP::Simple;

my $content=get('http://mytempscripts.com/2011/09/temporary-post.html') or die $!;
$content=~s/\n//g;
$content=~s/&nbsp;/ /g;
$content=~/<b>this is a temp post<\/b><br \/><br \/>(.*?)<div style='clear: both;'><\/div>/;
my $temp=$1;


while($temp=~/((.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9]    {1,})(.*?)\s+)/g){
print "found a match\n";
}

This works, but takes a long, long time. When I shorten the regex to the following, I get the results in less than a second. Why does my original regex take so long? How do I correct it?

while($temp=~/((.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9]    {1,})(.*?)\s+)/g){
print "found a match\n";
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T15:46:21+00:00

Regular expressions are like the sort function in Perl. You think it’s pretty simple because it’s just a single command, but in the end, it uses a lot of processing power to do the job.

There are certain things you can do to help out:

Keep your syntax simple as possible.
Precompile your regular expression pattern by using qr// if you’re using that regular expression in a loop. That’ll prevent Perl from having to compile your regular expression with each loop.
Try to avoid regular expression syntax that has to do backtracking. This usually ends up being the most general matching patterns (such as .*).

The wretched truth is that after decades of writing in Perl, I’ve never masted the deep dark secrets of regular expression parsing. I’ve tried many times to understand it, but that usually means doing research on the Web, and …well… I get distracted by all of the other stuff on the Web.

And, it’s not that difficult, any half decent developer with an IQ of 240, and a penchant for sadism should easily be able to pick it up.

@David W.: I guess I’m confused on backtracking. I had to read your link several times but still don’t quite understand how to implement it (or, not implement it) in my case. – user522962

Let’s take a simple example:

my $string = 'foobarfubar';
$string =~ /foo.*bar.*(.+)/;
my $result = $1;

What will $result be? It will be r. You see how that works? Let’s see what happens.

Originally, the regular expression is broken into tokens, and the first token foo.* is used. That actually matches the whole string:

"foobarfubar" =~ /foo.*/

However, if the first regular expression token captures the whole string, the rest of the regular expression fails. Therefore, the regular expression matching algorithm has to back track:

"foobarfubar" =~ /foo.*/    #/bar.*/ doesn't match
"foobarfuba" =~ /foo.*/     #/bar.*/ doesn't match.
"foobarfub" =~ /foo.*/      #/bar.*/ doesn't match.
"foobarfu" =~ /foo.*/       #/bar.*/ doesn't match.
"foobarf" =~ /foo.*/        #/bar.*/ doesn't match.
"foobar" =~ /foo.*/         #/bar.*/ doesn't match.
 ...
"foo" =~ /foo.*/            #Now /bar.*/ can match!

Now, the same happens for the rest of the string:

"foobarfubar" =~ /foo.*bar.*/  #But the final /.+/ doesn't match
"foobarfuba"  =~ /foo.*bar.*/  #And the final /.+/ can match the "r"!

Backtracking tends to happen with the .* and .+ expression since they’re so loose. I see you’re using non-greedy matches which can help, but it can still be an issue if you are not careful — especially if you have very long and complex regular expressions.

I hope this helps explain backtracking.

The issue you’re running into isn’t that your program doesn’t work, but that it takes a long, long time.

I was hoping that the general gist of my answer is that regular expression parsing isn’t as simple as Perl makes it out to be. I can see the command sort @foo; in a program, but forget that if @foo contains a million or so entries, it might take a while. In theory, Perl could be using a bubble sort and thus the algorithm is a O². I hope that Perl is actually using a more efficient algorithm and my actual time will be closer to O * log (O). However, all this is hidden by my simple one line statement.

I don’t know if backtracking is an issue in your case, but you’re treating an entire webpage output as a single string to match against a regular expression which could result in a very long string. You attempt to match it against another regular expression which you do over and over again. Apparently, that is quite a process intensive step which is hidden by the fact it’s a single Perl statement (much like sort @foo hides its complexity).

Thinking about this on and off over the weekend, you really should not attempt to parse HTML or XML with regular expressions because it is so sloppy. You end up with something rather inefficient and fragile.

In cases like this may be better off using something like HTML::Parser or XML::Simple which I’m more familiar with, but doesn’t necessarily work with poorly formatted HTML.

Perl regular expressions are nice, but they can easily get out of our control.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have the following script, which grabs a webpage, then does a regex to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply