I’m trying to create a regular expression that will match the third instance of a / in a url, i.e. so that only the website’s name itself will be recorded, nothing else.
So http://www.stackoverflow.com/questions/answers/help/ after being put through the regex will be http://www.stackoverflow.com
I’ve been playing about with them myself and come up with:
base_url = re.sub(r'[/].*', r'', url)
but all this does is reduce a link to http: – so it’s obvious I need to match the third instance of / – can anyone explain how I would do this?
Thanks!
I suggest you use
urlparsefor parsing URLs:.netlocincludes the port number if present (e.g.www.stackoverflow.com:80); if you don’t want the port number, use.hostnameinstead.