I have a really weird problem: i searching for URLs on a html site and want only a specific part of the url. In my test html page the link occurs only once, but instead of one result i get about 20…
this is my regex im using:
perl -ne 'm/http\:\/\myurl\.com\/somefile\.php.+\/afolder\/(.*)\.(rar|zip|tar|gz)/; print "$1.$2\n";'
sample input would be something like this:
<html><body><a href="http://myurl.com/somefile.php&x=foo?y=bla?z=sdf?path=/foo/bar/afolder/testfile.zip?more=arguments?and=evenmore">Somelinknme</a></body></html>
which is a very easy example. so in real the link would apper on a normal website with content around…
my result should be something like this:
testfile.zip
but instead i see this line very often… Is this a problem with the regex or with something else?
Yes, the regex is greedy.
Use an appropriate tool for HTML instead: HTML::LinkExtor or one of the link methods in WWW::Mechanize, then URI to extract a specific part.