I have 2 csv files. The first, input, consists of input street addresses with various errors. The second, ref is a clean street address table. Records within input need to be matched to records within ref. Converting the files to lists with unique records is fast, but once I get to the matching process, it’s dreadfully slow, taking a full 85 seconds just to match two addresses within input to ref without any regular expressions! I realize that the size of ref is the issue here; it is over 1 million records in length and the file size is 30 MB. I was anticipating some performance issues with these kinds of sizes, but taking this long for only two records is unacceptable (realistically, I may have to match up to 10,000 records or more. Additionally, I will eventually need to embed some regex to ref items to allow for more flexible matching. Testing the new regex module is even worse, taking a whopping 185 seconds for the same two input records. Does anybody know the best way to speed things up substantially? Can I somehow index by zip code, for example?
Here are sample addresses from input and ref, respectively (after preprocessing):
60651 N SPRINGFIELD AVE CHICAGO
60061 BROWNING CT VERNON HILLS
Here is what I have so far. (being a novice, I realize that there is probably all kinds of inefficiencies with my code, but that’s not the issue) :
import csv, re
f = csv.reader(open('/Users/benjaminbauman/Documents/inputsample.csv','rU'))
columns = zip(*f)
l = list(columns)
inputaddr = l[0][1:]
f = csv.reader(open('/Users/benjaminbauman/Documents/navstreets.csv','rU'))
f.next()
reffull = []
for row in f:
row = str(row[0:7]).strip(r'['']').replace("\'","")
if not ", , , , ," in row: reffull.append(row)
input = list(set(inputaddr))
ref1 = list(set(reffull))
ref2 = ref1
input_scrub = []
for i in inputaddr:
t = i.replace(',',' ')
input_scrub.append(' '.join(t.split()))
ref_scrub = []
for i in ref1:
t = i.replace(',',' ')
ref_scrub.append(' '.join(t.split()))
output_iter1 = dict([ (i, [ r for r in ref_scrub if re.match(r, i) ]) for i in input_scrub ])
unmatched_iter1 = [i for i, j in output_iter1.items() if len(j) < 1]
matched_iter1 = {i: str(j[0][1]).strip(r'['']') for i, j in output_iter1.items() if len(j) is 1}
tied_iter1 = {k: zip(*(v))[1] for k,v in output_iter1.iteritems() if len(v) > 1}
Instead of fuzzy regex in the new module, maybe you could use the difflib module, if the execution time is acceptable:
the result is: