I want to download about 200 different HTML files over HTTPS and extract the title of the page from each file and put the titles into a text document.
How would I go about using Perl to download files using HTTPS? I searched Google, but I didn’t find very much helpful information or examples.
A good place to look for information on the downloading part is the libwww-perl cookbook.
Here’s some rudimentary sample code. It isn’t necessarily the best way, but it’s one that should work, assuming you have the LWP module (available from CPAN).
You might want to add more bells and whistles, for unescaping text, handling error conditions, doing requests in parallel with multiple threads, faking user-agent as Mozilla etc 🙂
If you saved this as titlegrab.pl, and you had a list of sites in sites.list (one URL per line), you could use this with
$ cat sites.list | perl titlegrab.plto see all the titles.Or.. redirect to some output file, e.g.
$ cat sites.list | perl titlegrab.pl > results.txt