I have a CSV file with multiple entries. Example csv:
user, phone, email
joe, 123, joe@x.com
mary, 456, mary@x.com
ed, 123, ed@x.com
I’m trying to remove the duplicates by a specific column in the CSV however with the code below I’m getting an “list index out of range”. I thought by comparing row[1] with newrows[1] I would find all duplicates and only rewrite the unique entries in file2.csv. This doesn’t work though and I can’t understand why.
f1 = csv.reader(open('file1.csv', 'rb'))
newrows = []
for row in f1:
if row[1] not in newrows[1]:
newrows.append(row)
writer = csv.writer(open("file2.csv", "wb"))
writer.writerows(newrows)
My end result is to have a list that maintains the sequence of the file (set won’t work…right?) which should look like this:
user, phone, email
joe, 123, joe@x.com
mary, 456, mary@x.com
row[1]refers to the second column in the current row (phone). That’s all well in good.However, you
newrows.append(row)add the entire row to the list.When you check
row[1] in newrowsyou are checking the individual phone number against a list of complete rows. But that’s not what you want to do. You need to check against a list or set of just phone numbers. For that, you probably want to keep track of the rows and a set of the observed phone numbers.Something like: