I have to scrape a website and save all pages as HTML and put it entirely on a DVD. I’ve done this, but now all the links start with a /, and that grabs my root directory. I would like to change all hrefs of all files (1500 pages) to href="./" so it always grabs the working directory.
I’ve seen things about sed in bash, but I didn’t quite catch how to dynamically grab all hrefs and change them.
How could I do this in an efficient way?
As I said in my comment above, depending on which tool you’re using to scrape the site, you could start with checking whether it supports rewriting links.
wgetwill let you do exactly this by passing the-koption:I don’t think Ugo Méda’s suggestion, the
basetag, will work, since your URLs are absolute, and the base tag only lets you specify a base for relative URLs:To rewrite every
hrefis tricky, since it’s so hard to know you’re doing the right thing — it depends on the structure of the site. Consider the following example:/foo/bar.html:
If you rewrite that per your suggestion, it will be:
But that won’t work, since the browser will resolve that to
/foo/bar/baz.html, when the file is really at[SOME DIR]/bar/baz.html. In that case, you really want:What I’m trying to say is that the correct (rewritten) URL is always depending on the location of the current file and the location of the target file. In summary, I think your best bet is using
wgetor some other tool which supports URL rewriting, or you will need some more advanced program than justsed, which lacks the context needed to correctly convert the link.