I am currently attempting to create a Perl webspider using WWW::Mechanize. What I am

Question

0

Asked: June 14, 20262026-06-14T05:37:11+00:00 2026-06-14T05:37:11+00:00

I am currently attempting to create a Perl webspider using WWW::Mechanize. What I am

0

I am currently attempting to create a Perl webspider using WWW::Mechanize.

What I am trying to do is create a webspider that will crawl the whole site of the URL (entered by the user) and extract all of the links from every page on the site.

But I have a problem with how to spider the whole site to get every link, without duplicates
What I have done so far (the part im having trouble with anyway):

foreach (@nonduplicates) {   #array contain urls like www.tree.com/contact-us, www.tree.com/varieties....
$mech->get($_);
my @list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);  #find all links on this page that starts with http://www.tree.com

#NOW THIS IS WHAT I WANT IT TO DO AFTER THE ABOVE (IN PSEUDOCODE), BUT CANT GET WORKING
#foreach (@list) {
#if $_ is already in @nonduplicates
#then do nothing because that link has already been found
#} else {
#append the link to the end of @nonduplicates so that if it has not been crawled for links already, it will be

How would I be able to do the above?

I am doing this to try and spider the whole site to get a comprehensive list of every URL on the site, without duplicates.

If you think this is not the best/easiest method of achieving the same result I’m open to ideas.

Your help is much appreciated, thanks.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T05:37:12+00:00

Create a hash to track which links you’ve seen before and put any unseen ones onto @nonduplicates for processing:

$| = 1;
my $scanned = 0;

my @nonduplicates = ( $urlToSpider ); # Add the first link to the queue.
my %link_tracker = map { $_ => 1 } @nonduplicates; # Keep track of what links we've found already.

while (my $queued_link = pop @nonduplicates) {
    $mech->get($queued_link);
    my @list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);

    for my $new_link (@list) {
        # Add the link to the queue unless we already encountered it.
        # Increment so we don't add it again.
        push @nonduplicates, $new_link->url_abs() unless $link_tracker{$new_link->url_abs()}++;
    }
    printf "\rPages scanned: [%d] Unique Links: [%s] Queued: [%s]", ++$scanned, scalar keys %link_tracker, scalar @nonduplicates;
}
use Data::Dumper;
print Dumper(\%link_tracker);

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am currently attempting to create a Perl webspider using WWW::Mechanize. What I am

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply