I’m trying to cut the URL out of a web link
Say for example, I want take http://site.com/path/to/site.html to just print out ‘site.com’ or ‘http://site.com’
This is the closest I can figure out but it’s not working right:
echo "https://site.com/shisad/sadh" | sed -n "s/.*\(http.*\/\).*/\1/p"
which prints: https://site.com/shisad/
It’s something I’m doing wrong with the special character ‘/” I think. Any ideas ?
When you’re using
sedto match path names, or other patterns containing slashes, use a character other than slash to delimit the regular expression; it makes life a lot easier.The
.*pattern is greedy; it matches the longest possible string. You want a more constrained expression.To print out
http://site.com, you might use:To print out
site.com, you might use:If you think you might have a site without the slash after the host name (so the input only contains
http://site.com), then you could use:Note that these accept all sorts of punctuation characters as ‘valid’; you can be more discriminating if you wish using, perhaps,
[-a-zA-Z0-9_.]*in place of[^/]*— but beware internationalized domain names. The two pattern version doesn’t stop at a blank after the URL; it would include the close parenthesis of(http://example.com). This is a corollary of the point about which characters are valid.