Given a DNA string for example AGC. I am trying to generate all possible uniq strings allowing upto #n (user defined number) mismatches in the given string.
I am able to do this for one mismatch in the following way but not able to implement the recursive solution to generate all the possible combinations based on #n mismatch, DNA string and mutation set(AGCTN)
temp_dict = {}
sequence = 'AGC'
for x in xrange(len(sequence)):
prefix = sequence[:x]
suffix = sequence[x+1:]
temp_dict.update([ (prefix+base+suffix,1) for base in 'ACGTN'])
print temp_dict
An example:
for a given sample string : ACG, the following are the 13 uniq sequences allowing upto one mismatch
{'ACC': 1, 'ATG': 1, 'AAG': 1, 'ANG': 1, 'ACG': 1, 'GCG': 1, 'AGG': 1,
'ACA': 1, 'ACN': 1, 'ACT': 1, 'TCG': 1, 'CCG': 1, 'NCG': 1}
I want to generalize this so that the program can take a 100 characters long DNA string and return a list/dict of uniq strings allowing user defined #mismatches
Thanks!
-Abhi
Assuming I understand you, I think you can use the
itertoolsmodule. The basic idea is to choose locations where there’s going to be a mismatch usingcombinationsand then construct all satisfying lists usingproduct:For your example case: