I need to deal with super large txt input files, and I usually use .readlines() to first read the whole file, and turn it into a list.
I know it’s really memory-cost and can be quite slow, but I also need to make use of LIST characteristics to manipulate the specific lines, like below:
#!/usr/bin/python
import os,sys
import glob
import commands
import gzip
path= '/home/xxx/scratch/'
fastqfiles1=glob.glob(path+'*_1.recal.fastq.gz')
for fastqfile1 in fastqfiles1:
filename = os.path.basename(fastqfile1)
job_id = filename.split('_')[0]
fastqfile2 = os.path.join(path+job_id+'_2.recal.fastq.gz')
newfastq1 = os.path.join(path+job_id+'_1.fastq.gz')
newfastq2 = os.path.join(path+job_id+'_2.fastq.gz')
l1= gzip.open(fastqfile1,'r').readlines()
l2= gzip.open(fastqfile2,'r').readlines()
f1=[]
f2=[]
for i in range(0,len(l1)):
if i % 4 == 3:
b1=[ord(x) for x in l1[i]]
ave1=sum(b1)/float(len(l1[i]))
b2=[ord(x) for x in str(l2[i])]
ave2=sum(b2)/float(len(l2[i]))
if (ave1 >= 20 and ave2>= 20):
f1.append(l1[i-3])
f1.append(l1[i-2])
f1.append(l1[i-1])
f1.append(l1[i])
f2.append(l2[i-3])
f2.append(l2[i-2])
f2.append(l2[i-1])
f2.append(l2[i])
output1=gzip.open(newfastq1,'w')
output1.writelines(f1)
output1.close()
output2=gzip.open(newfastq2,'w')
output2.writelines(f2)
output2.close()
In general, I’m trying to read every 4th line of the whole text, but if the 4th line meets the desired condition, I’ll append these 4 lines into the text.
So can I avoid readlines() to achieve this?
thx
EDIT:
Hi, actually I myself found a better way:
import commands
l1=commands.getoutput('zcat ' + fastqfile1).splitlines(True)
l2=commands.getoutput('zcat ' + fastqfile2).splitlines(True)
I think ‘zcat’ is super fast….
It took around 15min to readlines, while only 1 minute to just zcat…
If you can refactor your code to read through the file linearly, then you can just say
for line in fileto iterate through each line of the file without reading it all into memory at once. But, since your file access looks more complicated, you could use a generator to replacereadlines(). One way to do this would be to useitertools.iziporitertools.izip_longest: