I have two CSV files (three columns) which I need to compare and extract rows from other file (five columns) that matches. The example for files are:
File1:
ATGCGCGACAGT, ch3, 123546
ATGCATACAGGATAT, ch2, 5141561615
……so on approx 100 entries
File2:
ATGCGGCGACAGT,ch3, 123456,mi141515, AUCAGCUAUAUAU, UACGCAGAUAUAUA
ATCAGACGATTATGA, ch4, 4564764, mi653453, AUCAGCAAUUUUCG, AUACAGACAAAAA
….so on approx 50000 entries
I need to match the column 1,2 and 3 for both the files in such a way that all three columns of file1 should match with file2. If so happens than extract 4,5 and 6 columns for further processing.
I was thinking of:
fhout=csv.writer(open('parsed_out', 'w'), delimiter=',')
for i in file1:
a=[0]
b=[1]
c=[2]
for x in file2:
d=[0]
e=[1]
f=[2]
g=[3]
h=[4]
i=[5]
if a==d and b==e and c==f:
fhout.writerow([g]+[h]+[i])
else:
pass
But somebody told me that I can use hashing or some better way rather writing such big loops for 10,000 or more entries in file1
Please suggest me better way to achieve this. Both file 1 and file 2 are parsed from more complex files.
Try something like:
When run with the following data:
file_1.csvfile_2.csvit produces the output
EDIT: A slightly nicer implementation using sets and list comprehensions:
EDIT 2: Note that this implementation is sensitive to the whitespace within the CSV file. If the spacing in your CSV file is inconsistent, use something like
row = [element.strip(' ') for element in row]to strip out all the spaces.