Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8592963
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 11, 20262026-06-11T23:52:40+00:00 2026-06-11T23:52:40+00:00

I am trying to make myself better in python. There are some tools to

  • 0

I am trying to make myself better in python. There are some tools to do these stuffs, but I want to do it myself for two reasons.

  1. Learn some better ways
  2. flexibility in operation

I have two text files, exactly same size , same number of lines. I need to check 2nd, 6th (+4 everytime) line of text, see its beginning text, check whether it is similar to some predefined text, if so write that line along with the block of 4 in corresponding file, and write the same lines in another corresponding file. (For those to whom it sounded like something familiar, I am trying to demultiplex barcoded data from Illumina paired end sequence data).

I already have a working code but the problem is it takes days to finish. it took me about 10 minutes for 100,000 lines and I have 200 million.

I am posting the code here along with what I am thinking.
OK, I have 100 keys, they are say ATCCGG, ACCTGG…etc. However if I have one mismatch, I would like to consider it as correct , for example DOG can have AOG, BOG, DIG, DAG, DOF,DOH….

def makehamming2(text,dist):

    dicthamming=dict()
    rep=["A","T","C","G"]

    if dist==1:
        for i in range(len(text)):

            for j in range(len(rep)):
                chars=list(text)
                if rep[j]<>chars[i]:
                    chars[i]=rep[j]
                    word="".join(chars)
                    dicthamming[word]=text
    return dicthamming

I am using dist=1.

I use this function for 100 barcodes, so, I have about ~100*18 items in dictionary.

count=0
eachline=1
writeflag=0
seqlen=int(seqlen)
cutlen=len(cutsite)
infile=open(inf, "r")
for line in infile:
        count+=1
        if eachline==1:
            writeflag=0
            header=line
            eachline=2
        elif eachline==2:
            eachline=3
            line=line.strip()
            if line[0:6] in searchdict.keys():

            barcode=searchdict[line[0:6]]

            towritefile=outfile+"/"+barcode+".fastq"


            seq=line[6:seqlen+6]
            qualstart=6
            writeflag=1
            seqeach[barcode]=seqeach.get(barcode,0)+1

    elif eachline==3:
        eachline=4
        third=line
    elif eachline==4:

        eachline=1
        line=line.strip()
        if writeflag==1:
            qualline=line[qualstart:qualstart+seqlen]
            addToBuffer=header+seq+"\n"+third+qualline+"\n"
            bufferdict[towritefile]=bufferdict.get(towritefile,"")+addToBuffer


            Fourlinesofpair=getfrompair(inf2,count, seqlen)


            bufferdictpair[towritefile[:-6:]+"_2.fastq"]=\
            bufferdictpair.get(towritefile[:-6:]+"_2.fastq","")+Fourlinesofpair

                if (count/4)%10000==0:
                    print "writing" , str((count/4))
                    for key, val in bufferdict.items():

                        writefile1=open(key,"a")
                        writefile1.write(val)
                        bufferdict=dict()


                    for key, val in bufferdictpair.items():


                        writefile1=open(key,"a")
                        writefile1.write(val)
                        bufferdictpair=dict()


                    end=(time.time()-start)/60.0
                    print "finished writing", str(end) , "minutes"


    print "writing" , str(count/4)                
    for key, val in bufferdict.items():


        writefile1=open(key,"a")
        writefile1.write(val)
        bufferdict=dict()
        writefile1.close()
    for key, val in bufferdictpair.items():

        writefile1=open(key,"a")
        writefile1.write(val)
        bufferdictpair=dict()
        writefile1.close()

    end=(time.time()-start)/60.0
    print "finished writing", str(end) , "minutes"

getfrompair is a function,

def getfrompair(inf2, linenum, length):

    info=open(inf2,"r")
    content=""
    for count, line in enumerate(info):
        #print str(count)

    if count == linenum-4:
        content=line
    if count == linenum-3:
        content=content+line.strip()[:length]+"\n"
    if count == linenum-2:
        content=content+line
    if count == linenum-1:
        content=content+line.strip()[:length]+"\n"
        #print str(count), content



        return content

So, my main question is how can I optimize it. In most of the cases I would assume this code to be run in at least 8gb memory and >4 core processors. Can I use multiprocessor?
I used the buffer from suggestion in another thread here because that was faster than writing in disk after each line.

Thank you in advance for teaching me.

Edit 1
After Ignacio’s suggestion, I did the profiling and “getfrompair” function is taking more than half of the run time? Is there better way to get a certain line from file without going through each at some time.

Profile result from a fraction (10000 lines, instead of original 800 million)

     68719 function calls in 2.902 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       66    0.000    0.000    0.000    0.000 :0(append)
       32    0.003    0.000    0.003    0.000 :0(close)
     2199    0.007    0.000    0.007    0.000 :0(get)
        8    0.002    0.000    0.002    0.000 :0(items)
        3    0.000    0.000    0.000    0.000 :0(iteritems)
      750    0.001    0.000    0.001    0.000 :0(join)
     7193    0.349    0.000    0.349    0.000 :0(keys)
    39977    0.028    0.000    0.028    0.000 :0(len)
        1    0.000    0.000    0.000    0.000 :0(mkdir)
      767    0.045    0.000    0.045    0.000 :0(open)
      300    0.000    0.000    0.000    0.000 :0(range)
        1    0.005    0.005    0.005    0.005 :0(setprofile)
       96    0.000    0.000    0.000    0.000 :0(split)
        1    0.000    0.000    0.000    0.000 :0(startswith)
        1    0.000    0.000    0.000    0.000 :0(stat)
     6562    0.016    0.000    0.016    0.000 :0(strip)
        4    0.000    0.000    0.000    0.000 :0(time)
       48    0.000    0.000    0.000    0.000 :0(update)
       46    0.004    0.000    0.004    0.000 :0(write)
      733    1.735    0.002    1.776    0.002 RC14100~.PY:273(getfrompair)
        1    0.653    0.653    2.889    2.889 RC14100~.PY:31(split)
        1    0.000    0.000    0.000    0.000 RC14100~.PY:313(makehamming)
        1    0.000    0.000    0.005    0.005 RC14100~.PY:329(processbc2)
       48    0.003    0.000    0.005    0.000 RC14100~.PY:344(makehamming2)
        1    0.006    0.006    2.896    2.896 RC14100~.PY:4(<module>)
     4553    0.015    0.000    0.025    0.000 RC14100~.PY:74(<genexpr>)
     2659    0.014    0.000    0.023    0.000 RC14100~.PY:75(<genexpr>)
     2659    0.013    0.000    0.023    0.000 RC14100~.PY:76(<genexpr>)
        1    0.001    0.001    2.890    2.890 RC14100~.PY:8(main)
        1    0.000    0.000    0.000    0.000 cProfile.py:5(<module>)
        1    0.000    0.000    0.000    0.000 cProfile.py:66(Profile)
        1    0.000    0.000    0.000    0.000 genericpath.py:15(exists)
        1    0.000    0.000    0.000    0.000 ntpath.py:122(splitdrive)
        1    0.000    0.000    0.000    0.000 ntpath.py:164(split)
        1    0.000    0.000    0.000    0.000 os.py:136(makedirs)
        1    0.000    0.000    2.902    2.902 profile:0(<code object <module> at 000000000211A9B0, file "RC14100~.PY", line 4>)
        0    0.000             0.000          profile:0(profiler)



Process "Profile" terminated, ExitCode: 00000000
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-11T23:52:41+00:00Added an answer on June 11, 2026 at 11:52 pm

    Your getfrompair function makes this a classic O(n^2) problem, since you read through the second file each time you get a match. What you want to do instead is read from both files at the same time so that you’re only going through it once. izip is the way to do that.

    from itertools import izip
    
    for line,line2 in izip(infile, infile2):
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm teaching myself Python 3.2 and I'm trying to make a program to match
For an XNA engine I'm trying to make myself, I want an array/arraylist of
I want to make myself a simple webapp using vb.net.I am trying to make
I'm still trying to learn node.js + express.js to make something... But after many
I am kinda new to all this but I am trying to make myself
Trying to make a small countdown timer in my app but it's not working.
I am a newb and I am trying to better myself at good practice
I'm trying to make a webapp. The application I am making for myself to
I am trying to make myself an AutoCompleteTextBox that when typed into will show
I'm probably going to ask this incorrectly and make myself look very stupid but

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.