Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8755263
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 13, 20262026-06-13T13:49:08+00:00 2026-06-13T13:49:08+00:00

I am using Python and a regular expression to find an ORF (open reading

  • 0

I am using Python and a regular expression to find an ORF (open reading frame).

Find a sub-string a string that is composed ONLY of the letters ATGC (no spaces or new lines) that:

Starts with ATG, ends with TAG or TAA or TGA and should consider the sequence from the first character, then second and then third:

Seq= "CCTCAGCGAGGACAGCAAGGGACTAGCCAGGAGGGAGAACAGAAACTCCAGAACATCTTGGAAATAGCTCCCAGAAAAGC
AAGCAGCCAACCAGGCAGGTTCTGTCCCTTTCACTCACTGGCCCAAGGCGCCACATCTCCCTCCAGAAAAGACACCATGA
GCACAGAAAGCATGATCCGCGACGTGGAACTGGCAGAAGAGGCACTCCCCCAAAAGATGGGGGGCTTCCAGAACTCCAGG
CGGTGCCTATGTCTCAGCCTCTTCTCATTCCTGCTTGTGGCAGGGGCCACCACGCTCTTCTGTCTACTGAACTTCGGGGT
GATCGGTCCCCAAAGGGATGAGAAGTTCCCAAATGGCCTCCCTCTCATCAGTTCTATGGCCCAGACCCTCACACTCAGAT
CATCTTCTCAAAATTCGAGTGACAAGCCTGTAGCCCACGTCGTAGCAAACCACCAAGTGGAGGAGCAGCTGGAGTGGCTG
AGCCAGCGCGCCAACGCCCTCCTGGCCAACGGCATGGATCTCAAAGACAACCAACTAGTGGTGCCAGCCGATGGGTTGTA
CCTTGTCTACTCCCAGGTTCTCTTCAAGGGACAAGGCTGCCCCGACTACGTGCTCCTCACCCACACCGTCAGCCGATTTG
CTATCTCATACCAGGAGAAAGTCAACCTCCTCTCTGCCGTCAAGAGCCCCTGCCCCAAGGACACCCCTGAGGGGGCTGAG
CTCAAACCCTGGTATGAGCCCATATACCTGGGAGGAGTCTTCCAGCTGGAGAAGGGGGACCAACTCAGCGCTGAGGTCAA
TCTGCCCAAGTACTTAGACTTTGCGGAGTCCGGGCAGGTCTACTTTGGAGTCATTGCTCTGTGAAGGGAATGGGTGTTCA
TCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTA
TCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAA
GATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGG
AGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGAT
CTCAGGCCTTCCTACCTTCAGACCTTTCCAGATTCTTCCCTGAGGTGCAATGCACAGCCTTCCTCACAGAGCCAGCCCCC
CTCTATTTATATTTGCACTTATTATTTATTATTTATTTATTATTTATTTATTTGCTTATGAATGTATTTATTTGGAAGGC
CGGGGTGTCCTGGAGGACCCAGTGTGGGAAGCTGTCTTCAGACAGACATGTTTTCTGTGAAAACGGAGCTGAGCTGTCCC
CACCTGGCCTCTCTACCTTGTTGCCTCCTCTTTTGCTTATGTTTAAAACAAAATATTTATCTAACCCAATTGTCTTAATA
ACGCTGATTTGGTGACCAGGCTGTCGCTACATCACTGAACCTCTGCTCCCCACGGGAGCCGTGACTGTAATCGCCCTACG
GGTCATTGAGAGAAATAA"

What I have tried:

# finding the  stop codon here 

def stop_codon(seq_0):

        for i in range(0,len(seq_0),3):
            if (seq_0[i:i+3]== "TAA" and i%3==0) or (seq_0[i:i+3]== "TAG" and i%3==0) or (seq_0[i:i+3]== "TGA" and i%3==0) :
                a =i+3

                break

            else:
                a = None

# finding the start codon here 

startcodon_find =[m.start() for m in re.finditer('ATG', seq_0)]

How can I find a way to check the start codon and then find the first stop codon. Subsequently find the next start codon and the next stop codon.

I wish to run this for three frames. As mentioned earlier the three frames would be considering the first, second and third characters of the sequence as the start.

Also the sequence needs to be divided into small parts of 3. There for it should be some thing like this:

ATG TTT AAA ACA AAA TAT TTA TCT AAC CCA ATT GTC TTA ATA ACG CTG ATT TGA

Any help will be appreciated.

My final answer :

def orf_find(st0):

    seq_0=""
    for i in range(0,len(st0),3):
        if len(st0[i:i+3])==3:
            seq_0 = seq_0 + st0[i:i+3]+ " "

    ms_1 =[m.start() for m in re.finditer('ATG', seq_0)]
    ms_2 =[m.start() for m in re.finditer('(TAA)|(TAG)|(TGA)', seq_0)]

    def get_next(arr,value):
        for a in arr:
            if a > value:
                return a
        return -1




    codons = []
    start_codon=ms_1[0]
    while (True):
        stop_codon = get_next(ms_2,start_codon)
        if stop_codon == -1:
            break
        codons.append((start_codon,stop_codon))
        start_codon = get_next(ms_1,stop_codon)
        if start_codon==-1:
            break

    max_val = 0
    selected_tupple = ()
    for i in codons:
        k=i[1]-i[0]
        if k > max_val:
            max_val = k
            selected_tupple = i

    print "selected tupple is ", selected_tupple

    final_seq=seq_0[selected_tupple[0]:selected_tupple[1]+3]

    print final_seq
    print "The longest orf length is " + str(max_val)



output_file = open('Longorf.txt','w')
output_file.write(str(orf_find(st0)))

output_file.close()

The above write function does not help me in writing the content on to a text file . All i get in there is NONE.. Why this error .. Can anybody Help ?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-13T13:49:09+00:00Added an answer on June 13, 2026 at 1:49 pm

    As you have tagged it Biopython I suppose you know of Biopython. Have you checked out the docu yet? http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc231 might help.

    I adjusted the code from the above link a bit to work on your sequence:

    from Bio.Seq import Seq
    
    seq = Seq("CCTCAGCGAGGACAGCAAGGGACTAGCCAGGAGGGAGAACAGAAACTCCAGAACATCTTGGAAATAGCTCCCAGAAAAGCAAGCAGCCAACCAGGCAGGTTCTGTCCCTTTCACTCACTGGCCCAAGGCGCCACATCTCCCTCCAGAAAAGACACCATGAGCACAGAAAGCATGATCCGCGACGTGGAACTGGCAGAAGAGGCACTCCCCCAAAAGATGGGGGGCTTCCAGAACTCCAGGCGGTGCCTATGTCTCAGCCTCTTCTCATTCCTGCTTGTGGCAGGGGCCACCACGCTCTTCTGTCTACTGAACTTCGGGGTGATCGGTCCCCAAAGGGATGAGAAGTTCCCAAATGGCCTCCCTCTCATCAGTTCTATGGCCCAGACCCTCACACTCAGATCATCTTCTCAAAATTCGAGTGACAAGCCTGTAGCCCACGTCGTAGCAAACCACCAAGTGGAGGAGCAGCTGGAGTGGCTGAGCCAGCGCGCCAACGCCCTCCTGGCCAACGGCATGGATCTCAAAGACAACCAACTAGTGGTGCCAGCCGATGGGTTGTACCTTGTCTACTCCCAGGTTCTCTTCAAGGGACAAGGCTGCCCCGACTACGTGCTCCTCACCCACACCGTCAGCCGATTTGCTATCTCATACCAGGAGAAAGTCAACCTCCTCTCTGCCGTCAAGAGCCCCTGCCCCAAGGACACCCCTGAGGGGGCTGAGCTCAAACCCTGGTATGAGCCCATATACCTGGGAGGAGTCTTCCAGCTGGAGAAGGGGGACCAACTCAGCGCTGAGGTCAATCTGCCCAAGTACTTAGACTTTGCGGAGTCCGGGCAGGTCTACTTTGGAGTCATTGCTCTGTGAAGGGAATGGGTGTTCATCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTATCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAAGATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGGAGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGATCTCAGGCCTTCCTACCTTCAGACCTTTCCAGATTCTTCCCTGAGGTGCAATGCACAGCCTTCCTCACAGAGCCAGCCCCCCTCTATTTATATTTGCACTTATTATTTATTATTTATTTATTATTTATTTATTTGCTTATGAATGTATTTATTTGGAAGGCCGGGGTGTCCTGGAGGACCCAGTGTGGGAAGCTGTCTTCAGACAGACATGTTTTCTGTGAAAACGGAGCTGAGCTGTCCCCACCTGGCCTCTCTACCTTGTTGCCTCCTCTTTTGCTTATGTTTAAAACAAAATATTTATCTAACCCAATTGTCTTAATAACGCTGATTTGGTGACCAGGCTGTCGCTACATCACTGAACCTCTGCTCCCCACGGGAGCCGTGACTGTAATCGCCCTACGGGTCATTGAGAGAAATAA")
    
    
    table = 1
    min_pro_len = 100
    
    for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
        for frame in range(3):
            for pro in nuc[frame:].translate(table).split("*"):
                if len(pro) >= min_pro_len:
                    print "%s...%s - length %i, strand %i, frame %i" % (pro[:30], pro[-3:], len(pro), strand, frame)
    

    The ORF is also translated. You can choose a different translation table. Check out http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:translation

    EDIT: Explanation of the code:

    Right at the top I create a sequence object out of your string. Notice the seq = Seq("ACGT").
    The two for-loops create the 6 different frames. The inner for-loop translates each frame according to the chosen translation table and returns an amino acid chain where each stop codon is coded as *. The split function splits this string removing these placeholders resulting in a list of possible protein sequences. By setting min_pro_len you can define the minimum amino acid chain length for a protein to be detected. 1 is the standard table. Check out http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG1 Here you see that the initiation codon is AUG (equals ATG) and the end codons (* in the nucleotide sequence) are TAA, TAG, and TGA, just like you wanted. You could also use a different translation table.

    When you add

    print nuc[frame:].translate(table)
    

    right inside the second for-loop you get something like:

    PQRGQQGTSQEGEQKLQNILEIAPRKASSQPGRFCPFHSLAQGATSPSRKDTMSTESMIRDVELAEEALPQKMGGFQNSRRCLCLSLFSFLLVAGATTLFCLLNFGVIGPQRDEKFPNGLPLISSMAQTLTLRSSSQNSSDKPVAHVVANHQVEEQLEWLSQRANALLANGMDLKDNQLVVPADGLYLVYSQVLFKGQGCPDYVLLTHTVSRFAISYQEKVNLLSAVKSPCPKDTPEGAELKPWYEPIYLGGVFQLEKGDQLSAEVNLPKYLDFAESGQVYFGVIAL*REWVFIHSLPSPHSDPFTLTPLLSTPQSPQSVSF*LRKGIMAQGPTLCSELSTTTQKHKMLGQ*PGLWASHAPPSRTQMGFPNSLEPRMSIPEFCKGRVVRLPLSQNEAG*DLRPSYLQTFPDSSLRCNAQPSSQSQPPSIYICTYYLLFIYYLFICL*MYLFGRPGCPGGPSVGSCLQTDMFSVKTELSCPHLASLPCCLLFCLCLKQNIYLTQLS**R*FGDQAVATSLNLCSPREP*L*SPYGSLREI
    

    (notice the asterisks are at the stop codon positions)

    EDIT: Answer to your second question:

    You must return a string that you want write into a file. Create an Output string and return it at the end of the function:

    output = "selected tupple is " + str(selected_tupple) + "\n"
    output += final_seq + "\n"
    output += "The longest orf length is " + str(max_val) + "\n"
    return output
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm using Python to write a regular expression for replacing parts of the string
How to parse the string {'result':(Boolean, MessageString)} using Python regular expressions to get Boolean
I'm using python regular expression module, re . I need to match anything inside
I want to write a regular expression that will match the following string a
In response to Python regular expression I tried to implement an HTML parser using
In Python, I can compile a regular expression to be case-insensitive using re.compile :
I'm using Python re to try to make a regular expression which finds all
In my Python application, I need to write a regular expression that matches a
In a translation-testing app (in Python) I want a regular expression that will accept
I was using Python and regular expressions to find things an HTML document and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.