I need to pull all the links for a page that resides on an

Question

0

Asked: June 5, 20262026-06-05T15:05:32+00:00 2026-06-05T15:05:32+00:00

I need to pull all the links for a page that resides on an

0

I need to pull all the links for a page that resides on an Intranet however am unsure how best to do it. The structure of the site is as follows

List of topics

Topic 1
Topic 2
Topic 3

etc

Now the links reside in each of the topic pages. I want to avoid going through in excess of 500 topic pages manually to extract the URI.

Each of the topic pages has the following structure

http://alias/filename.php?cat=6&number=1

The cat parameter refers to the category and the number parameter refers to the topic.

Once in the topic page the URI I need to extract exists in a particular format again

http://alias/value?id=somevalue

Caveats

I don’t have access to the database so the option to trawl through it is not an option
There is only ever a single URI in each topic page
I need to extract the list to a file that simply lists each URI in a new line

I would like to execute some sort of script I can run from the terminal via BASH that will trawl through the topical URI and then the URI in each of the topics.

In a nutshell

How can I extract a list using a script I can run using BASH that will recursively go through all the list of topics and then extract the URI in each of the topic pages and spit out a text file with the each of extracted URI in a new line.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T15:05:33+00:00

I implement this with Perl, using the HTML::TokeParser and WWW::Mechanize modules:

use HTML::TokeParser;
use WWW::Mechanize;

my $site = WWW::Mechanize->new(autocheck =>1);
my $topicmax = 500;  #Note:  adjust this to the number of topic pages you have

# loop through each topic page
foreach(1..$topicmax) {
    my $topicurl = "http://alias/filename.php?cat=6&number=$_";

    # get the page
    $site->get($topicurl);
    $p = HTML::TokeParser->new(\$site->{content});

    # parse the page and extract the links
    while (my $token = $p->get_tag("a")) {
        my $url = $token->[1]{href};
        # use a regex to test for the link format we want
        if($url =~ /^http:\/\/alias\/value\?id=/) {
            print "$url\n";
        }
    }
}

The script prints to stdout, so you just need to redirect it to a file.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to pull all the links for a page that resides on an

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply