how would you extract the domain name from a URL, excluding any subdomains?
My initial simplistic attempt was:
'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])
This works for http://www.foo.com, but not http://www.foo.com.au.
Is there a way to do this properly without using special knowledge about valid TLDs (Top Level Domains) or country codes (because they change).
thanks
No, there is no “intrinsic” way of knowing that (e.g.)
zap.co.itis a subdomain (because Italy’s registrar DOES sell domains such asco.it) whilezap.co.ukisn’t (because the UK’s registrar DOESN’T sell domains such asco.uk, but only likezap.co.uk).You’ll just have to use an auxiliary table (or online source) to tell you which TLD’s behave peculiarly like UK’s and Australia’s — there’s no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).