Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8712613
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 13, 20262026-06-13T05:08:46+00:00 2026-06-13T05:08:46+00:00

I often use wget to mirror very large websites. Sites that contain hotlinked content

  • 0

I often use wget to mirror very large websites. Sites that contain hotlinked content (be it images, video, css, js) pose a problem, as I seem unable to specify that I would like wget to grab page requisites that are on other hosts, without having the crawl also follow hyperlinks to other hosts.

For example, let’s look at this page
https://dl.dropbox.com/u/11471672/wget-all-the-things.html

Let’s pretend that this is a large site that I would like to completely mirror, including all page requisites – including those that are hotlinked.

wget -e robots=off -r -l inf -pk 

^^ gets everything but the hotlinked image

wget -e robots=off -r -l inf -pk -H

^^ gets everything, including hotlinked image, but goes wildly out of control, proceeding to download the entire web

wget -e robots=off -r -l inf -pk -H --ignore-tags=a

^^ gets the first page, including both hotlinked and local image, does not follow the hyperlink to the site outside of scope, but obviously also does not follow the hyperlink to the next page of the site.

I know that there are various other tools and methods of accomplishing this (HTTrack and Heritrix allow for the user to make a distinction between hotlinked content on other hosts vs hyperlinks to other hosts) but I’d like to see if this is possible with wget. Ideally this would not be done in post-processing, as I would like the external content, requests, and headers to be included in the WARC file I’m outputting.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-13T05:08:47+00:00Added an answer on June 13, 2026 at 5:08 am

    You can’t specify to span hosts for page-reqs only; -H is all or nothing. Since -r and -H will pull down the entire Internet, you’ll want to split the crawls that use them. To grab hotlinked page-reqs, you’ll have to run wget twice: once to recurse through the site’s structure, and once to grab hotlinked reqs. I’ve had luck with this method:

    1) wget -r -l inf [other non-H non-p switches] http://www.example.com

    2) build a list of all HTML files in the site structure (find . | grep html) and pipe to file

    3) wget -pH [other non-r switches] -i [infile]

    Step 1 builds the site’s structure on your local machine, and gives you any HTML pages in it. Step 2 gives you a list of the pages, and step 3 wgets all assets used on those pages. This will build a complete mirror on your local machine, so long as the hotlinked assets are still live.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I often use convenience functions that return pointers to static buffers like this: char*
I quite often use Drupal's Views Module to build SQL that I paste into
I often use regex expression validators that are also a required field. Which leads
I often use HTML template for my application and within the template content there
I often use CSS ids or classes to select elements in Javascript. Many of
In my Java code I often use the very handy method(Class... args) varargs. As
I often use Dictionary in C#2.0 with the first key as string that was
I often use simple Chinese phrases like 你好 to test that my code can
It seems that developers often use these terms interchangeably when referring to a piece
I often use links with href='#' when calling ajax resources. I noticed that IE

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.