I have a really weird problem: i searching for URLs on a html site

Question

0

Asked: June 3, 20262026-06-03T20:23:25+00:00 2026-06-03T20:23:25+00:00

I have a really weird problem: i searching for URLs on a html site

0

I have a really weird problem: i searching for URLs on a html site and want only a specific part of the url. In my test html page the link occurs only once, but instead of one result i get about 20…

this is my regex im using:

perl -ne 'm/http\:\/\myurl\.com\/somefile\.php.+\/afolder\/(.*)\.(rar|zip|tar|gz)/; print "$1.$2\n";'

sample input would be something like this:

<html><body><a href="http://myurl.com/somefile.php&x=foo?y=bla?z=sdf?path=/foo/bar/afolder/testfile.zip?more=arguments?and=evenmore">Somelinknme</a></body></html>

which is a very easy example. so in real the link would apper on a normal website with content around…

my result should be something like this:

testfile.zip

but instead i see this line very often… Is this a problem with the regex or with something else?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T20:23:27+00:00

Yes, the regex is greedy.

Use an appropriate tool for HTML instead: HTML::LinkExtor or one of the link methods in WWW::Mechanize, then URI to extract a specific part.

use 5.010;
use WWW::Mechanize qw();
use URI qw();
use URI::QueryParam qw();

my $w = WWW::Mechanize->new;
$w->get('file:///tmp/so10549258.html');
for my $link ($w->links) {
    my $u = URI->new($link->url);
    # 'http://myurl.com/somefile.php?x=foo&y=bla&z=sdf&path=/foo/bar/afolder/testfile.zip&more=arguments&and=evenmore'
    say $u->query_param('path');
    # '/foo/bar/afolder/testfile.zip'
    $u = URI->new($u->query_param('path'));
    say (($u->path_segments)[-1]);
    # 'testfile.zip'
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a really weird problem: i searching for URLs on a html site

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply