I’m writing a script to grab the URLs from my blog posts and run curl -I over them so I can check they are still good. However I am having trouble writing the grep pattern.
<p><a href="http://example.com/fujipol/2004/may/5/16:10:47/400x345">foobar</a></p>
So here I want just http://example.com/fujipol/2004/may/5/16:10:47/400x345.
Or in markdown like:
[Example markdown link](https://example.com)
Want https://example.com
<http://example.com/?foo=bar>
In this case I need http://example.com/?foo=bar
Created file with links from your examples:
“Greped” it with some regular expression and got all urls from it:
Done.
What we’ve done here is matched
http(s)(url could start withhttp://orhttps://), than we matched//and escaped it. And finally we matched sequence of symbols not equal toor"or(or)or<or>.Finally, the whole problem in tasks like that is figured out how me decide that section we needed starts (
http(s)://in that case) and ends (,",(,),<,>).Frankly speaking, that solution is not really perfect. Some url standards said much more information about symbols that url can include or can’t. So, at once you will figured out, that regex using in my answer is not valid. But in cases that you described it works sell.