I am trying to make myself better in python. There are some tools to

Question

0

Asked: June 11, 20262026-06-11T23:52:40+00:00 2026-06-11T23:52:40+00:00

I am trying to make myself better in python. There are some tools to

0

I am trying to make myself better in python. There are some tools to do these stuffs, but I want to do it myself for two reasons.

Learn some better ways
flexibility in operation

I have two text files, exactly same size , same number of lines. I need to check 2nd, 6th (+4 everytime) line of text, see its beginning text, check whether it is similar to some predefined text, if so write that line along with the block of 4 in corresponding file, and write the same lines in another corresponding file. (For those to whom it sounded like something familiar, I am trying to demultiplex barcoded data from Illumina paired end sequence data).

I already have a working code but the problem is it takes days to finish. it took me about 10 minutes for 100,000 lines and I have 200 million.

I am posting the code here along with what I am thinking.
OK, I have 100 keys, they are say ATCCGG, ACCTGG…etc. However if I have one mismatch, I would like to consider it as correct , for example DOG can have AOG, BOG, DIG, DAG, DOF,DOH….

def makehamming2(text,dist):

    dicthamming=dict()
    rep=["A","T","C","G"]

    if dist==1:
        for i in range(len(text)):

            for j in range(len(rep)):
                chars=list(text)
                if rep[j]<>chars[i]:
                    chars[i]=rep[j]
                    word="".join(chars)
                    dicthamming[word]=text
    return dicthamming

I am using dist=1.

I use this function for 100 barcodes, so, I have about ~100*18 items in dictionary.

count=0
eachline=1
writeflag=0
seqlen=int(seqlen)
cutlen=len(cutsite)
infile=open(inf, "r")
for line in infile:
        count+=1
        if eachline==1:
            writeflag=0
            header=line
            eachline=2
        elif eachline==2:
            eachline=3
            line=line.strip()
            if line[0:6] in searchdict.keys():

            barcode=searchdict[line[0:6]]

            towritefile=outfile+"/"+barcode+".fastq"


            seq=line[6:seqlen+6]
            qualstart=6
            writeflag=1
            seqeach[barcode]=seqeach.get(barcode,0)+1

    elif eachline==3:
        eachline=4
        third=line
    elif eachline==4:

        eachline=1
        line=line.strip()
        if writeflag==1:
            qualline=line[qualstart:qualstart+seqlen]
            addToBuffer=header+seq+"\n"+third+qualline+"\n"
            bufferdict[towritefile]=bufferdict.get(towritefile,"")+addToBuffer


            Fourlinesofpair=getfrompair(inf2,count, seqlen)


            bufferdictpair[towritefile[:-6:]+"_2.fastq"]=\
            bufferdictpair.get(towritefile[:-6:]+"_2.fastq","")+Fourlinesofpair

                if (count/4)%10000==0:
                    print "writing" , str((count/4))
                    for key, val in bufferdict.items():

                        writefile1=open(key,"a")
                        writefile1.write(val)
                        bufferdict=dict()


                    for key, val in bufferdictpair.items():


                        writefile1=open(key,"a")
                        writefile1.write(val)
                        bufferdictpair=dict()


                    end=(time.time()-start)/60.0
                    print "finished writing", str(end) , "minutes"


    print "writing" , str(count/4)                
    for key, val in bufferdict.items():


        writefile1=open(key,"a")
        writefile1.write(val)
        bufferdict=dict()
        writefile1.close()
    for key, val in bufferdictpair.items():

        writefile1=open(key,"a")
        writefile1.write(val)
        bufferdictpair=dict()
        writefile1.close()

    end=(time.time()-start)/60.0
    print "finished writing", str(end) , "minutes"

getfrompair is a function,

def getfrompair(inf2, linenum, length):

    info=open(inf2,"r")
    content=""
    for count, line in enumerate(info):
        #print str(count)

    if count == linenum-4:
        content=line
    if count == linenum-3:
        content=content+line.strip()[:length]+"\n"
    if count == linenum-2:
        content=content+line
    if count == linenum-1:
        content=content+line.strip()[:length]+"\n"
        #print str(count), content



        return content

So, my main question is how can I optimize it. In most of the cases I would assume this code to be run in at least 8gb memory and >4 core processors. Can I use multiprocessor?
I used the buffer from suggestion in another thread here because that was faster than writing in disk after each line.

Thank you in advance for teaching me.

Edit 1
After Ignacio’s suggestion, I did the profiling and “getfrompair” function is taking more than half of the run time? Is there better way to get a certain line from file without going through each at some time.

Profile result from a fraction (10000 lines, instead of original 800 million)

     68719 function calls in 2.902 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       66    0.000    0.000    0.000    0.000 :0(append)
       32    0.003    0.000    0.003    0.000 :0(close)
     2199    0.007    0.000    0.007    0.000 :0(get)
        8    0.002    0.000    0.002    0.000 :0(items)
        3    0.000    0.000    0.000    0.000 :0(iteritems)
      750    0.001    0.000    0.001    0.000 :0(join)
     7193    0.349    0.000    0.349    0.000 :0(keys)
    39977    0.028    0.000    0.028    0.000 :0(len)
        1    0.000    0.000    0.000    0.000 :0(mkdir)
      767    0.045    0.000    0.045    0.000 :0(open)
      300    0.000    0.000    0.000    0.000 :0(range)
        1    0.005    0.005    0.005    0.005 :0(setprofile)
       96    0.000    0.000    0.000    0.000 :0(split)
        1    0.000    0.000    0.000    0.000 :0(startswith)
        1    0.000    0.000    0.000    0.000 :0(stat)
     6562    0.016    0.000    0.016    0.000 :0(strip)
        4    0.000    0.000    0.000    0.000 :0(time)
       48    0.000    0.000    0.000    0.000 :0(update)
       46    0.004    0.000    0.004    0.000 :0(write)
      733    1.735    0.002    1.776    0.002 RC14100~.PY:273(getfrompair)
        1    0.653    0.653    2.889    2.889 RC14100~.PY:31(split)
        1    0.000    0.000    0.000    0.000 RC14100~.PY:313(makehamming)
        1    0.000    0.000    0.005    0.005 RC14100~.PY:329(processbc2)
       48    0.003    0.000    0.005    0.000 RC14100~.PY:344(makehamming2)
        1    0.006    0.006    2.896    2.896 RC14100~.PY:4(<module>)
     4553    0.015    0.000    0.025    0.000 RC14100~.PY:74(<genexpr>)
     2659    0.014    0.000    0.023    0.000 RC14100~.PY:75(<genexpr>)
     2659    0.013    0.000    0.023    0.000 RC14100~.PY:76(<genexpr>)
        1    0.001    0.001    2.890    2.890 RC14100~.PY:8(main)
        1    0.000    0.000    0.000    0.000 cProfile.py:5(<module>)
        1    0.000    0.000    0.000    0.000 cProfile.py:66(Profile)
        1    0.000    0.000    0.000    0.000 genericpath.py:15(exists)
        1    0.000    0.000    0.000    0.000 ntpath.py:122(splitdrive)
        1    0.000    0.000    0.000    0.000 ntpath.py:164(split)
        1    0.000    0.000    0.000    0.000 os.py:136(makedirs)
        1    0.000    0.000    2.902    2.902 profile:0(<code object <module> at 000000000211A9B0, file "RC14100~.PY", line 4>)
        0    0.000             0.000          profile:0(profiler)



Process "Profile" terminated, ExitCode: 00000000

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T23:52:41+00:00

Editorial Team

2026-06-11T23:52:41+00:00Added an answer on June 11, 2026 at 11:52 pm

Your getfrompair function makes this a classic O(n^2) problem, since you read through the second file each time you get a match. What you want to do instead is read from both files at the same time so that you’re only going through it once. izip is the way to do that.

from itertools import izip

for line,line2 in izip(infile, infile2):

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to make myself better in python. There are some tools to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply