I have a HUGE html which has many things I don’t need, but inside

Question

0

Asked: May 16, 20262026-05-16T23:34:45+00:00 2026-05-16T23:34:45+00:00

I have a HUGE html which has many things I don’t need, but inside

0

I have a HUGE html which has many things I don’t need, but inside it has URLs that are provided in the following format:

<a href="http://www.retailmenot.com/" class=l

I’m trying to extract the URLs… I tried, to no avail:

open(FILE,"<","HTML.htm") or die "$!";
my @str = <FILE>;

my @matches = grep { m/a href="(.+?") class=l/ } @str

Any idea on how to match this?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T23:34:46+00:00

Use HTML::SimpleLinkExtor, HTML::LinkExtor, or one of the other link extracting Perl modules. You don’t need a regex at all.

Here’s a short example. You don’t have to subclass. You just have to tell %HTML::Tagset::linkElements which attributes to collect:

#!perl
use HTML::LinkExtor;

$HTML::Tagset::linkElements{'a'} = [ qw( href class ) ];

$p = HTML::LinkExtor->new;
$p->parse( do { local $/; <> } );

my @links = grep { 
    my( $tag, %hash ) = @$_;
    no warnings 'uninitialized';
    $hash{class} eq 'foo';
    } $p->links;

If you need to collect URLs for any other tags, you make similar adjustments.

If you’d rather have a callback routine, that’s not so hard either. You can watch the links as the parser runs into them:

use HTML::LinkExtor;

$HTML::Tagset::linkElements{'a'} = [ qw( href class ) ];

my @links;
my $callback = sub {
    my( $tag, %hash ) = @_;
    no warnings 'uninitialized';
    push @links, $hash{href} if $hash{class} eq 'foo';
    };

my $p = HTML::LinkExtor->new( $callback );
$p->parse( do { local $/; <DATA> } );

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a HUGE html which has many things I don’t need, but inside

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply