I need to pull all the links for a page that resides on an Intranet however am unsure how best to do it. The structure of the site is as follows
List of topics
-
Topic 1
-
Topic 2
-
Topic 3
etc
Now the links reside in each of the topic pages. I want to avoid going through in excess of 500 topic pages manually to extract the URI.
Each of the topic pages has the following structure
http://alias/filename.php?cat=6&number=1
The cat parameter refers to the category and the number parameter refers to the topic.
Once in the topic page the URI I need to extract exists in a particular format again
http://alias/value?id=somevalue
Caveats
- I don’t have access to the database so the option to trawl through it is not an option
- There is only ever a single URI in each topic page
- I need to extract the list to a file that simply lists each URI in a new line
I would like to execute some sort of script I can run from the terminal via BASH that will trawl through the topical URI and then the URI in each of the topics.
In a nutshell
How can I extract a list using a script I can run using BASH that will recursively go through all the list of topics and then extract the URI in each of the topic pages and spit out a text file with the each of extracted URI in a new line.
I implement this with Perl, using the HTML::TokeParser and WWW::Mechanize modules:
The script prints to stdout, so you just need to redirect it to a file.