I have a URL which can be any of the following formats:
http://example.com
https://example.com
http://example.com/foo
http://example.com/foo/bar
www.example.com
example.com
foo.example.com
www.foo.example.com
foo.bar.example.com
http://foo.bar.example.com/foo/bar
example.net/foo/bar
Essentially, I need to be able to match any normal URL. How can I extract example.com (or .net, whatever the tld happens to be. I need this to work with any TLD.) from all of these via a single regex?
Well you can use
parse_urlto get the host:Then, you can do some fancy stuff to get only the TLD and the Host
Not very elegant, but should work.
If you want an explanation, here it goes:
First we grab everything between the scheme (
http://, etc), by usingparse_url‘s capabilities to… well…. parse URL’s. 🙂Then we take the host name, and separate it into an array based on where the periods fall, so
test.world.hello.mynamewould become:After that, we take the number of elements in the array (4).
Then, we subtract 2 from it to get the second to last string (the hostname, or
example, in your example)Then, we subtract 1 from it to get the last string (because array keys start at 0), also known as the TLD
Then we combine those two parts with a period, and you have your base host name.