The problem i’m facing is badly named links…
There are few hundred bad links in different files.
So I write bash to replace links
<a href="../../../external.html?link=http://www.twitter.com"><a href="../../external.html?link=http://www.facebook.com/pages/somepage/">
<a href="../external.html?link=http://www.tumblr.com/">
to direct links like
<a href="http://www.twitter.com>
I know we have pattern ../ repeating one or more times. Also external.html?link which also should be removed.
How would recommend to do this? awk, sed, maybe python??
Will i need regex?
Thanks for opinions…
This could be a place where regular expressions are the correct solution. You are only searching for text in attributes, and the contents are regular, fitting a pattern.
The following python regular expression would locate these links for you:
The pattern we look for is something inside a
href=""chunk of text, where that ‘something’ starts with one or more instances of../, followed byexternal.html?link=, then followed with any text that does not contain a"quote.The matched text after the equals sign is grouped in group 2 for easy retrieval, group 1 holds the
../../external.html?link=part.If all you want to do is remove the
../../external.html?link=part altogether (so the links point directly to the endpoint instead of going via the redirect page), leave off the first group and do a simple.sub()on your HTML files:Note that this could also match any body text (so outside HTML tags), this is not a HTML-aware solution. Chances are there is no such body text though. But if there is, you’ll need a full-blown HTML parser like BeautifulSoup or lxml instead.