I have a series of strings (URLs) in different forms as:
http://domain name.anything/anypathhttps://dmain name.anything/anypathhttp://www.domain name.anything/anypathhttps://www.dmain name.anything/anypath
These strings are saved in CSV file. I need to parse every URL in order to get the domain name only, domain name.anything. i.e, the part after the first . and before the first /.
I separated the strings using split method, then converted each string to a URL, then used the toAuthority function to get the domain name only. The problem is that, toAuthority and toHost are doing the same job for me, they include the www. that I don’t want. Though, in the tutorial from Oracle, it seems that toAuthority supposed to return the domain name without www..
How can I extract the domain name part only without the www. of the URL ??
To really understand this, you should read URI specification – RFC 2396.
The short answer is that the authority component consists of the host component together with an optional port number, username and password … depending on the URL scheme that is used.
You call
getHost(), test if it starts with the string"www."and if it does you remove it.But before you start doing things like that, you need to understand that removing the “www.” may give you a URL that doesn’t work, or that resolves to a document or service that is different to the one the the original URL resolves to. It is a bad idea to gratuitously tidy up URLs … unless you have detailed knowledge of how the sites in question are organized.
The convention that “foo.com” and “www.foo.com” are the same place is just a convention, and a lot of sites don’t implement it. Removing “www.” would be a bad idea because it is liable to turn resolvable URLs into URLs that don’t resolve.