The problem i’m facing is badly named links… There are few hundred bad links

Question

0

Asked: June 10, 20262026-06-10T13:11:13+00:00 2026-06-10T13:11:13+00:00

The problem i’m facing is badly named links… There are few hundred bad links

0

The problem i’m facing is badly named links…
There are few hundred bad links in different files.

So I write bash to replace links

<a href="../../../external.html?link=http://www.twitter.com">
<a href="../../external.html?link=http://www.facebook.com/pages/somepage/">

<a href="../external.html?link=http://www.tumblr.com/">
to direct links like
<a href="http://www.twitter.com>

I know we have pattern ../ repeating one or more times. Also external.html?link which also should be removed.

How would recommend to do this? awk, sed, maybe python??
Will i need regex?

Thanks for opinions…

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T13:11:15+00:00

This could be a place where regular expressions are the correct solution. You are only searching for text in attributes, and the contents are regular, fitting a pattern.

The following python regular expression would locate these links for you:

r'href="((?:\.\./)+external\.html\?link=)([^"]+)"'

The pattern we look for is something inside a href="" chunk of text, where that ‘something’ starts with one or more instances of ../, followed by external.html?link=, then followed with any text that does not contain a " quote.

The matched text after the equals sign is grouped in group 2 for easy retrieval, group 1 holds the ../../external.html?link= part.

If all you want to do is remove the ../../external.html?link= part altogether (so the links point directly to the endpoint instead of going via the redirect page), leave off the first group and do a simple .sub() on your HTML files:

import re
redirects = re.compile(r'href="(?:\.\./)+external\.html\?link=([^"]+)"')

# ...
redirects.sub(r'href="\1"', somehtmlstring)

Note that this could also match any body text (so outside HTML tags), this is not a HTML-aware solution. Chances are there is no such body text though. But if there is, you’ll need a full-blown HTML parser like BeautifulSoup or lxml instead.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

The problem i’m facing is badly named links… There are few hundred bad links

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply