Given a URL, I want to extract domain name(It should not include ‘www’ part). Url can contain http/https. Here is the java code that I wrote. Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.
public static String getDomainName(String url) throws MalformedURLException{
if(!url.startsWith("http") && !url.startsWith("https")){
url = "http://" + url;
}
URL netUrl = new URL(url);
String host = netUrl.getHost();
if(host.startsWith("www")){
host = host.substring("www".length()+1);
}
return host;
}
Input: http://google.com/blah
Output: google.com
If you want to parse a URL, use
java.net.URI.java.net.URLhas a bunch of problems — itsequalsmethod does a DNS lookup which means code using it can be vulnerable to denial of service attacks when used with untrusted inputs.“Mr. Gosling — why did you make url equals suck?” explains one such problem. Just get in the habit of using
java.net.URIinstead.should do what you want.
Your code as written fails for the valid URLs:
httpfoo/bar— relative URL with a path component that starts withhttp.HTTP://example.com/— protocol is case-insensitive.//example.com/— protocol relative URL with a hostwww/foo— a relative URL with a path component that starts withwwwwwwexample.com— domain name that does not starts withwww.but starts withwww.Hierarchical URLs have a complex grammar. If you try to roll your own parser without carefully reading RFC 3986, you will probably get it wrong. Just use the one that’s built into the core libraries.
If you really need to deal with messy inputs that
java.net.URIrejects, see RFC 3986 Appendix B: