I’m writing a script to grab the URLs from my blog posts and run

Question

0

Asked: May 27, 20262026-05-27T12:19:33+00:00 2026-05-27T12:19:33+00:00

I’m writing a script to grab the URLs from my blog posts and run

0

I’m writing a script to grab the URLs from my blog posts and run curl -I over them so I can check they are still good. However I am having trouble writing the grep pattern.

<p><a href="http://example.com/fujipol/2004/may/5/16:10:47/400x345">foobar</a></p>

So here I want just http://example.com/fujipol/2004/may/5/16:10:47/400x345.

Or in markdown like:

[Example markdown link](https://example.com)

Want https://example.com

<http://example.com/?foo=bar>

In this case I need http://example.com/?foo=bar

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T12:19:33+00:00

Created file with links from your examples:

$> cat ./text
<p><a href="http://example.com/fujipol/2004/may/5/16:10:47/400x345">foobar</a></p>
[Example markdown link](https://example.com)
<http://example.com/?foo=bar>
<a href="http://people.debian.org/~dilinger/backports/wordpress">http://people.debian.org/~dilinger/backports/wordpress</a>

“Greped” it with some regular expression and got all urls from it:

$> grep --only-matching --perl-regexp "http(s?):\/\/[^ \"\(\)\<\>]*" ./text
http://example.com/fujipol/2004/may/5/16:10:47/400x345
https://example.com
http://example.com/?foo=bar
http://people.debian.org/~dilinger/backports/wordpress
http://people.debian.org/~dilinger/backports/wordpress

Done.

http(s?):\/\/[^ \"\(\)\<\>]*

What we’ve done here is matched http(s) (url could start with http:// or https://), than we matched // and escaped it. And finally we matched sequence of symbols not equal to or " or ( or ) or < or >.

Finally, the whole problem in tasks like that is figured out how me decide that section we needed starts (http(s):// in that case) and ends (, ", (, ), <, > ).

Frankly speaking, that solution is not really perfect. Some url standards said much more information about symbols that url can include or can’t. So, at once you will figured out, that regex using in my answer is not valid. But in cases that you described it works sell.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m writing a script to grab the URLs from my blog posts and run

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply