I am looking to compare multiple CSV files with Python, and output a report. The number of CSV files to compare will vary, so I am having it pull a list from a directory. Each CSV has 2 columns: the first being an area code and exchange, the second being a price.
e.g.
1201007,0.006
1201032,0.0119
1201040,0.0106
1201200,0.0052
1201201,0.0345
The files will not all contain the same area codes and exchanges, so rather than a line by line comparison, I need to use the first field as the key. I then need to generate a report that says: file1 had 200 mismatches to file2, 371 lower prices than file2, and 562 higher prices than file2. I need to generate this to compare each file to each other, so this step would be repeated against file3, file4…., and then file2 against files3, etc. I would consider myself a relative noob to Python. Below is the code I have so far which just grabs the files in the directory and prints prices from all files with a total tally.
import csv
import os
count = 0
#dir containing CSV files
csvdir="tariff_compare"
dirList=os.listdir(csvdir)
#index all files for later use
for idx, fname in enumerate(dirList):
print fname
dic_read = csv.reader(open(fname))
for row in dic_read:
key = row[0]
price = row[1]
print price
count += 1
print count
This assumes that all your data can fit in memory; if not, you will have to try loading only some sets of files at a time, or even just two files at a time.
It does the comparison and writes the output to a summary.csv file, one row per pair of files.
Edit: user1277476 makes a good point; if you pre-sort your files by exchange (or if they are already in sorted order), you could iterate simultaneously through all your files, keeping nothing but the current line for each in memory.
This would let you do a more in-depth comparison for each exchange entry – number of files containing a value, or top or bottom N values, etc.