I have to scrape a website and save all pages as HTML and put

Question

0

Asked: June 7, 20262026-06-07T05:29:34+00:00 2026-06-07T05:29:34+00:00

I have to scrape a website and save all pages as HTML and put

0

I have to scrape a website and save all pages as HTML and put it entirely on a DVD. I’ve done this, but now all the links start with a /, and that grabs my root directory. I would like to change all hrefs of all files (1500 pages) to href="./" so it always grabs the working directory.

I’ve seen things about sed in bash, but I didn’t quite catch how to dynamically grab all hrefs and change them.

How could I do this in an efficient way?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-07T05:29:35+00:00

As I said in my comment above, depending on which tool you’re using to scrape the site, you could start with checking whether it supports rewriting links. wget will let you do exactly this by passing the -k option:

-k,  --convert-links      make links in downloaded HTML or CSS point to
                          local files.

I don’t think Ugo Méda’s suggestion, the base tag, will work, since your URLs are absolute, and the base tag only lets you specify a base for relative URLs:

href = uri [CT]
This attribute specifies an absolute URI that acts as the base URI for resolving relative URIs.

To rewrite every href is tricky, since it’s so hard to know you’re doing the right thing — it depends on the structure of the site. Consider the following example:

/foo/bar.html:

<a href="/bar/baz.html">baz</a>

If you rewrite that per your suggestion, it will be:

<a href="./bar/baz.html">baz</a>

But that won’t work, since the browser will resolve that to /foo/bar/baz.html, when the file is really at [SOME DIR]/bar/baz.html. In that case, you really want:

<a href="../bar/baz.html">baz</a>

What I’m trying to say is that the correct (rewritten) URL is always depending on the location of the current file and the location of the target file. In summary, I think your best bet is using wget or some other tool which supports URL rewriting, or you will need some more advanced program than just sed, which lacks the context needed to correctly convert the link.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have to scrape a website and save all pages as HTML and put

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply