I’m quite new to python. I’m trying to parse a file of URLs to

Question

0

Editorial Team

Asked: June 18, 20262026-06-18T03:07:48+00:00 2026-06-18T03:07:48+00:00

I’m quite new to python. I’m trying to parse a file of URLs to

0

I’m quite new to python. I’m trying to parse a file of URLs to leave only the domain name.

some of the urls in my log file begin with http:// and some begin with http://www.Some begin with both.

This is the part of my code which strips the http:// part. What do I need to add to it to look for both http and www. and remove both?

line = re.findall(r'(https?://\S+)', line)

Currently when I run the code only http:// is stripped. if I change the code to the following:

line = re.findall(r'(https?://www.\S+)', line)

Only domains starting with both are affected.
I need the code to be more conditional.
TIA

edit… here is my full code…

import re
import sys
from urlparse import urlparse

f = open(sys.argv[1], "r")

for line in f.readlines():
 line = re.findall(r'(https?://\S+)', line)
 if line:
  parsed=urlparse(line[0])
  print parsed.hostname
f.close()

I mistagged by original post as regex. it is indeed using urlparse.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-18T03:07:50+00:00

You can do without regexes here.

with open("file_path","r") as f:
    lines = f.read()
    lines = lines.replace("http://","")
    lines = lines.replace("www.", "") # May replace some false positives ('www.com')
    urls = [url.split('/')[0] for url in lines.split()]
    print '\n'.join(urls)

Example file input:

http://foo.com/index.html
http://www.foobar.com
www.bar.com/?q=res
www.foobar.com

Output:

foo.com
foobar.com
bar.com
foobar.com

Edit:

There could be a tricky url like foobarwww.com, and the above approach would strip the www. We will have to then revert back to using regexes.

Replace the line lines = lines.replace("www.", "") with lines = re.sub(r'(www.)(?!com)',r'',lines). Of course, every possible TLD should be used for the not-match pattern.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m quite new to python. I’m trying to parse a file of URLs to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply