I often use wget to mirror very large websites. Sites that contain hotlinked content (be it images, video, css, js) pose a problem, as I seem unable to specify that I would like wget to grab page requisites that are on other hosts, without having the crawl also follow hyperlinks to other hosts.
For example, let’s look at this page
https://dl.dropbox.com/u/11471672/wget-all-the-things.html
Let’s pretend that this is a large site that I would like to completely mirror, including all page requisites – including those that are hotlinked.
wget -e robots=off -r -l inf -pk
^^ gets everything but the hotlinked image
wget -e robots=off -r -l inf -pk -H
^^ gets everything, including hotlinked image, but goes wildly out of control, proceeding to download the entire web
wget -e robots=off -r -l inf -pk -H --ignore-tags=a
^^ gets the first page, including both hotlinked and local image, does not follow the hyperlink to the site outside of scope, but obviously also does not follow the hyperlink to the next page of the site.
I know that there are various other tools and methods of accomplishing this (HTTrack and Heritrix allow for the user to make a distinction between hotlinked content on other hosts vs hyperlinks to other hosts) but I’d like to see if this is possible with wget. Ideally this would not be done in post-processing, as I would like the external content, requests, and headers to be included in the WARC file I’m outputting.
You can’t specify to span hosts for page-reqs only; -H is all or nothing. Since -r and -H will pull down the entire Internet, you’ll want to split the crawls that use them. To grab hotlinked page-reqs, you’ll have to run wget twice: once to recurse through the site’s structure, and once to grab hotlinked reqs. I’ve had luck with this method:
1)
wget -r -l inf [other non-H non-p switches] http://www.example.com2) build a list of all HTML files in the site structure (
find . | grep html) and pipe to file3)
wget -pH [other non-r switches] -i [infile]Step 1 builds the site’s structure on your local machine, and gives you any HTML pages in it. Step 2 gives you a list of the pages, and step 3 wgets all assets used on those pages. This will build a complete mirror on your local machine, so long as the hotlinked assets are still live.