I only want the folder structure, but I couldn’t figure out how with wget.

Question

0

Editorial Team

Asked: June 12, 20262026-06-12T07:43:53+00:00 2026-06-12T07:43:53+00:00

I only want the folder structure, but I couldn’t figure out how with wget.

0

I only want the folder structure, but I couldn’t figure out how with wget. Instead I am using this:

wget -R pdf,css,gif,txt,png -np -r http://example.com

Which should reject all the files after -R, but it seems to me wget still downloads the file, then deletes it.

Is there a better way to just get the folder structure?

TTP request sent, awaiting response…
200 OK Length: 136796 (134K)
[application/x-download] Saving to:
“example.com/file.pdf”

100%[=====================================>] 136,796 853K/s in 0.2s

2012-10-03 03:51:41 (853 KB/s) –
“example.com/file.pdf”
saved [136796/136796]

Removing
example.com/file.pdf since it should be rejected.

If anyone was wondering this is for a client which they can tell me the structure but it’s a hassle since their IT guy has to do it, so I wanted to just get it myself.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T07:43:54+00:00

That appears to be how wget was designed to work. When performing recursive downloads, non-leaf files that match the reject list are still downloaded so they can be harvested for links, then deleted.

From the in-code comments (recur.c):

Either –delete-after was specified, or we loaded this
otherwise rejected (e.g. by -R) HTML file just so we
could harvest its hyperlinks — in either case, delete
the local file.

We’ve had a run-in with this in a past project where we had to mirror an authenticated site and wget keeps hitting the logout pages even when it was meant to reject those URLs. We could not find any options to change the behaviour of wget.

The solution we ended up with was to download, hack and build our own version of wget. There’s probably a more elegant approach to this, but the quick fix we used was to add the following rules to the end of the download_child_p() routine (modified to match your requirements):

  /* Extra rules */
  if (match_tail(url, ".pdf", 0)) goto out;
  if (match_tail(url, ".css", 0)) goto out;
  if (match_tail(url, ".gif", 0)) goto out;
  if (match_tail(url, ".txt", 0)) goto out;
  if (match_tail(url, ".png", 0)) goto out;
  /* --- end extra rules --- */

  /* The URL has passed all the tests.  It can be placed in the
     download queue. */
  DEBUGP (("Decided to load it.\n"));

  return 1;

 out:
  DEBUGP (("Decided NOT to load it.\n"));

  return 0;
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I only want the folder structure, but I couldn’t figure out how with wget.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply