I am posting a follow-up question to a previous question I had regarding reading frames.
sequence = 'AAATGAAATAAGGATGGGGTAGTATGATGTGTTT'
I am ultimately looking for a specific pattern ‘ATG’ and I want to scan the input sequence until it is found. Once it is found, I want it to proceed with a reading frame of 3 until it finds another sequence either ‘TAA’ or ‘TAG’ or ‘TGT’ and then continue scanning until it finds the next ‘ATG’ with a downstream ‘TAA’ or ‘TAG’ or ‘TGT’
codon_list = ['ATG','AAA','TAA'],['ATG','GGG','TAG'],['ATG','ATG','TGT']
I was trying this
start_frame = sequence.find('ATG')
but it would only give me the first occurence of ‘ATG’. (i.e. ‘2’)
Just for the first list of codons I wrote
for codon in range(len(sequence)):
next_codon = fdna[start_frame:start_frame + 3]
codon_list.append(next_codon)
start_frame = start_frame + 3
if next_codon == 'TAA':
break
if next_codon == 'TAG':
break
elif next_codon=='TGT':
break
print codon_list
>>> ['ATG','AAA','TAA']
It only works for the first occurence of ‘ATG’.
The next part is where I want to create a name for each codon (0,1,2,3,…) and I think I figured that part out:
indx = range(0,len(codon_list))
indx_codon = dict(zip(indx,codon_list)
indx_codon = {0:['ATG','AAA','TAA'],1:['ATG','GGG','TAG'],2:['ATG','ATG','TGT']}
codon_start = ['2','13','23']
codon_end = ['8','21','31']
codon_positions = []
for p,q in zip(codon_start,codon_end):
codon_positions.append(str(p)+':'+str(q))
print codon_positions
>>> ['2:8', '13:21', '23:31']
So my biggest problem is that the .find() function only works for the first occurrence and it gets messed up when I’m creating the index if there is a ‘TAA’ or ‘TAG’ or ‘TGT’ before the ‘ATG’ (‘ATG’ is the one that is supposed to start the reading frame of 3)
How can I create a list of multiple sequences that follow these criteria (i.e. turn sequence into codon_list)?
Here is a fairly concise solution using regular expressions:
Result: