In this post I learned that Mechanize in Ruby/Perl is easier to use than HTML::TreeBuilder 3 in that particular example.
Is Mechanize superior to HTML::TokeParser?
Would the below also have been easier to write in Ruby using Mechanize?
sub get_img_page_urls {
my $url = shift;
my $ua = LWP::UserAgent->new;
$ua->agent("$0/0.1 " . $ua->agent);
$ua->agent("Mozilla/8.0");
my $req = new HTTP::Request 'GET' => "$url";
$req->header('Accept' => 'text/html');
$response_u = $ua->request($req); # send request
die "Error: ", $response_u->status_line unless $response_u->is_success;
my $stream = HTML::TokeParser->new(\$response_u->content);
my %urls = ();
my $found_thumbnails = 0;
my $found_thumb = 0;
while (my $token = $stream->get_token) {
# <div class="thumb-box" ... >
if ($token->[0] eq 'S' and $token->[1] eq 'div' and $token->[2]{class} eq 'thumb-box') {
$found_thumbnails = 1;
}
# <div class="thumb" ... >
if ($token->[0] eq 'S' and $token->[1] eq 'div' and $token->[2]{class} eq 'thumb') {
$found_thumb = 1;
}
# <a ... >
if ($found_thumbnails and $found_thumb and $token->[0] eq 'S' and $token->[1] eq 'a') {
$urls{'http://example.com' . "$token->[2]{href}"} = 1;
# one url have been found. Now start all over.
$found_thumb = 0;
$found_thumbnails = 0;
}
}
return %urls;
}
Mechanize is more than a parser. It adds an emulated browser, which allows you to navigate a site, fill out forms, etc. But it also includes a parser, making web scraping very simple. Here’s your method rewritten using ruby Mechanize: