I’m quite new to python. I’m trying to parse a file of URLs to leave only the domain name.
some of the urls in my log file begin with http:// and some begin with http://www.Some begin with both.
This is the part of my code which strips the http:// part. What do I need to add to it to look for both http and www. and remove both?
line = re.findall(r'(https?://\S+)', line)
Currently when I run the code only http:// is stripped. if I change the code to the following:
line = re.findall(r'(https?://www.\S+)', line)
Only domains starting with both are affected.
I need the code to be more conditional.
TIA
edit… here is my full code…
import re
import sys
from urlparse import urlparse
f = open(sys.argv[1], "r")
for line in f.readlines():
line = re.findall(r'(https?://\S+)', line)
if line:
parsed=urlparse(line[0])
print parsed.hostname
f.close()
I mistagged by original post as regex. it is indeed using urlparse.
You can do without regexes here.
Example file input:
Output:
Edit:
There could be a tricky url like foobarwww.com, and the above approach would strip the www. We will have to then revert back to using regexes.
Replace the line
lines = lines.replace("www.", "")withlines = re.sub(r'(www.)(?!com)',r'',lines). Of course, every possible TLD should be used for the not-match pattern.