I have two csv files: one is 98 mb and the other one is 152 kb. the smaller file is a random subset of the bigger one, and I want to write a third file from the big csv such that the rows correspond to each line in the smaller csv file.
Big file (excerpt):
ZINC_ID MWT LogP Desolv_apolar Desolv_polar HBD HBA tPSA Charge NRB SMILES
ZINC00000017 281.337 1.33 3.07 -19.2 2 6 87 0 4 CCC[S@](=O)c1ccc2c(c1)[nH]/c(=N/C(=O)OC)/[nH]2
ZINC00000036 151.141 0.37 3.51 -45.3 1 3 60 -1 2 c1ccc(cc1)[C@@H](C(=O)[O-])O
ZINC00000048 222.24 2.42 3.78 -8.68 0 4 37 0 4 COc1cc(c(c2c1OCO2)OC)CC=C
ZINC00000053 179.151 1.43 6.59 -56.84 0 4 66 -1 3 CC(=O)Oc1ccccc1C(=O)[O-]
Small File (excerpt):
SMILES
CCOc1ccc(cc1)NC(=O)C[C@@H](C)O
C[C@@H](c1ccc2c(c1)nc(o2)c3ccc(cc3)Cl)C(=O)[O-]
CC(=O)Oc1ccccc1C(=O)[O-]
COc1cc(c(c2c1OCO2)OC)CC=C
here is my code:
import csv
writer = csv.writer(open('/Users/Eric/Desktop/newZincSubset.csv','wb'))
count = 0
with open('/Users/Eric/Desktop/test700.csv','rU') as i:
with open('/Users/Eric/Desktop/initial_data.csv','rU') as j:
subject = csv.reader(i)
reference = csv.reader(j)
for row in subject:
smiles = row[0]
for reference_row in reference:
suspect = reference_row[10]
if (smiles == suspect):
writer.writerow(reference_row)
It seems to write the header just fine (ZINC_ID MWT LogP) just fine, but stops searching for every line. Is it a memory issue or is something wrong with my code?
Thanks!
The CSV readers can be iterated just once. After the first inner iteration is done, the underlying file object reaches the end of the file. Once you try to iterate over the
referencereader for the second time there is nothing more to read.I’d recommend that you first read the small file to a dictionary, and then iterate on the larger file searching for matches against the data in memory. You can also key the elements in the dictionary by what you will end up looking for (ref[10] I think), so there will be no need for nested loops.