I’m trying to build a simple crawler, but it seems all threads never finishes, even the queue is empty:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
use LWP::UserAgent;
use HTML::LinkExtor;
my $ua = new LWP::UserAgent;
my %visited = ();
my $workQueue = new Thread::Queue;
sub work {
my ($ua, $queue, $hashref) = @_;
my $tid = threads->self->tid;
my $linkExtor = new HTML::LinkExtor;
while (my $next = $queue->dequeue)
{
print "Processin ($tid): ", $next, "\n";
my $resp = $ua->get ($next);
if ($resp->is_success)
{
$linkExtor->parse ($resp->content);
my @links = map { my($tag, %attrs) = @$_;
($tag eq 'a')
? $attrs{href} : () } $linkExtor->links;
$queue->enqueue (@links);
}
}
};
$workQueue->enqueue ('http://localhost');
my @threads = map { threads->create (\&work, $ua, $workQueue, \%visited) } 1..10;
$_->join for @threads;
So what’s the right way to wait for those threads to finish? It never jump out of that while loop.
Your
$queue->dequeueis blocking and waiting for another thread toenqueuesomething. From the perldoc:dequeue_nb()will return undef if the queue is empty. But in this case, if one thread has dequeued the first URL, the rest will have stopped before any items are queued.Off the top off my head, an alternative approach might be to keep a count of threads that are currently engaged in some activity and terminate when that hits 0?