I have a HUGE html which has many things I don’t need, but inside it has URLs that are provided in the following format:
<a href="http://www.retailmenot.com/" class=l
I’m trying to extract the URLs… I tried, to no avail:
open(FILE,"<","HTML.htm") or die "$!";
my @str = <FILE>;
my @matches = grep { m/a href="(.+?") class=l/ } @str
Any idea on how to match this?
Use HTML::SimpleLinkExtor, HTML::LinkExtor, or one of the other link extracting Perl modules. You don’t need a regex at all.
Here’s a short example. You don’t have to subclass. You just have to tell
%HTML::Tagset::linkElementswhich attributes to collect:If you need to collect URLs for any other tags, you make similar adjustments.
If you’d rather have a callback routine, that’s not so hard either. You can watch the links as the parser runs into them: