I have about 150 HTML files in a given directory that I’d like to make some changes to. Some of the anchor tags have an href along the following lines: index.php?page=something. I’d like all of those to be changed to something.html. Simple regex, simple script. I can’t seem to get it correct, though. Can somebody weigh in on what I’m doing wrong?
Sample html, before and after output:
<!-- Before -->
<ul>
<li><a href="#">Apple</a></li>
<li><a href="index.php?page=dandelion">Dandelion</a></li>
<li><a href="index.php?page=elephant">Elephant</a></li>
<li><a href="index.php?page=resonate">Resonate</a></li>
</ul>
<!-- After -->
<ul>
<li><a href="#">Apple</a></li>
<li><a href="dandelion.html">Dandelion</a></li>
<li><a href="elephant.html">Elephant</a></li>
<li><a href="resonate.html">Resonate</a></li>
</ul>
Script file:
#! /bin/bash
for f in *.html
do
sed s/\"index\.php?page=\([.]*\)\"/\1\.html/g < $f >! $f
done
It’s your regex, and the fact that the shell is trying to interpret bits of your regex.
First – the
[.]*matches any number of literal dots.. Change it to.*.Secondly, enclose the entire regex in single quotes
'to prevent the bash shell from interpreting any of it.Also, instead of
< $f >! $fyou can just feed in the ‘-i’ switch to sed to have it operate in-place:(Also, as another point I think in your replacement you want double quotes around the
\1.htmlso that the new URL is quoted within the HTML. I also quoted your$fto"$f", because if the file name contains spaces bash will complain).EDIT: as @TimPote notes, the standard way to match something within quotes is either
".*?"(so that the.*is non-greedy) or"[^"]+". Sed doesn’t support the former, so try:This is to prevent (for example)
<a href="index.php?page=asdf">"asdf"</a>from being turned into<a href="asdf">"asdf.html"</a>(where the(.*)capturedasdf">"asdf, being greedy).