I am new to python, so I apologize if this example is trivial.
I am trying to write a simple script that will pase and extract parts of two large datafiles (~40gb each) into one resulting file with a slightly altered format. I originally tried to use readlines(), but that reads all of the files into memory, and our instance only has 28gb of memory. Using the sizehint parameter only parses a portion of the file.
I am now iterating over the file. The problem is that I store the output of the text parsing in three lists that grow to be rather large, eclipsing the available memory. I thought this would just switch to using the swap, which would be fine, but it instead just exits with a “MemoryError”.
This works fine with small sample files, but chokes on our actual data.
The script:
import sys
a = []
b = []
c = []
file1 = open(sys.argv[1],"r")
for line in file1:
if '@' in line:
a.append(line.lstrip('@').rstrip('\n'))
b.append(file1.next().rstrip('\n'))
file1.close()
file2 = open(sys.argv[2],"r")
for line in file2:
if '@' in line:
c.append(file2.next().rstrip('\n'))
file2.close()
file3 = open(sys.argv[3],"w")
for i in xrange(len(a)):
file3.write("".join([">",a[i],'\n',b[i],":",c[i],"\n"]))
What I have found online suggests creating some sort of database to store the variables, but that shouldn’t be required. Do you have any ideas how I should deal with this?
For completeness, this is what I’m trying to do (from our example test-data:
file1:
@Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACKIPPTCGTAG
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
file2:
@Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACAAACGATTCT
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
file3 (output):
>Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACKIPPTCGTAG:TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACAAACGATTCT
I haven’t tried this myself, but it seems like the following should work:
In this case, you’re only keeping one line in memory at a time for each file…which should be fine unless the files have really long lines ;^).
Also, since you’re
lstriping the ‘@’ character, you might want to consider usingif line.startswith('@')instead ofif '@' in line.