I need to read through a log file, extracting all paths, and return a sorted list of the paths containing no duplicates. What’s the best way to do it? Using a set?
I thought about something like this:
def geturls(filename)
f = open(filename)
s = set() # creates an empty set?
for line in f:
# see if the line matches some regex
if match:
s.add(match.group(1))
f.close()
return sorted(s)
EDIT
The items put in the set are path strings, which should be returned by the functions as a list sorted into alphabetical order.
EDIT 2
Here is some sample data:
10.254.254.28 – – [06/Aug/2007:00:12:20 -0700] “GET
/keyser/22300/ HTTP/1.0” 302 528 “-”
“Mozilla/5.0 (X11; U; Linux i686
(x86_64); en-US; rv:1.8.1.4)
Gecko/20070515 Firefox/2.0.0.4”
10.254.254.58 – – [06/Aug/2007:00:10:05 -0700] “GET
/edu/languages/google-python-class/images/puzzle/a-baaa.jpg HTTP/1.0” 200 2309 “-”
“googlebot-mscrawl-moma (enterprise;
bar-XYZ;
foo123@google.com,foo123@google.com,foo123@google.com,foo123@google.com)”
10.254.254.28 – – [06/Aug/2007:00:11:08 -0700] “GET
/favicon.ico HTTP/1.0” 302 3404 “-”
“googlebot-mscrawl-moma (enterprise;
bar-XYZ;
The interesting part are the urls between GET and HTTP. Maybe I should have mentioned that this is part of an exercise, and no real world data.
1 Answer