I am new to python, so I apologize if this example is trivial. I

Question

0

Asked: June 7, 20262026-06-07T20:10:10+00:00 2026-06-07T20:10:10+00:00

I am new to python, so I apologize if this example is trivial. I

0

I am new to python, so I apologize if this example is trivial.

I am trying to write a simple script that will pase and extract parts of two large datafiles (~40gb each) into one resulting file with a slightly altered format. I originally tried to use readlines(), but that reads all of the files into memory, and our instance only has 28gb of memory. Using the sizehint parameter only parses a portion of the file.

I am now iterating over the file. The problem is that I store the output of the text parsing in three lists that grow to be rather large, eclipsing the available memory. I thought this would just switch to using the swap, which would be fine, but it instead just exits with a “MemoryError”.

This works fine with small sample files, but chokes on our actual data.

The script:

import sys

a = []
b = []
c = []

file1 = open(sys.argv[1],"r")
for line in file1:
    if '@' in line:
        a.append(line.lstrip('@').rstrip('\n'))
        b.append(file1.next().rstrip('\n'))
file1.close()

file2 = open(sys.argv[2],"r")
for line in file2:
    if '@' in line: 
        c.append(file2.next().rstrip('\n'))
file2.close()

file3 = open(sys.argv[3],"w")
for i in xrange(len(a)):
    file3.write("".join([">",a[i],'\n',b[i],":",c[i],"\n"]))

What I have found online suggests creating some sort of database to store the variables, but that shouldn’t be required. Do you have any ideas how I should deal with this?

For completeness, this is what I’m trying to do (from our example test-data:

file1: 

@Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACKIPPTCGTAG
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

file2:

@Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACAAACGATTCT
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

file3 (output):

>Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACKIPPTCGTAG:TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACAAACGATTCT

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-07T20:10:12+00:00

I haven’t tried this myself, but it seems like the following should work:

file1 = open(sys.argv[1],"r")
file2 = open(sys.argv[2],"r")
file3 = open(sys.argv[3],"w")

for line1 in file1:
    if '@' in line1:  # line1.startswith('@') is probably better here
        a=line1.lstrip('@').rstrip('\n')
        b=file1.next().rstrip('\n')
        for line2 in file2:
            if '@' in line2:
                c=file2.next().rstrip('\n')
                break
        file3.write(">%s\n%s:%s\n"%(a,b,c))

file1.close()
file2.close()
file3.close()

In this case, you’re only keeping one line in memory at a time for each file…which should be fine unless the files have really long lines ;^).

Also, since you’re lstriping the ‘@’ character, you might want to consider using if line.startswith('@') instead of if '@' in line.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am new to python, so I apologize if this example is trivial. I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply