I’m looking to create a regular expression in python that matches all DNA sequences starting with T followed by 18 characters (any characters) and then terminated by either AA, TT, CC or GG. I can manage the first part but I can’t seem to find a way to write the end (the double characters) without duplicating the regex 4 times.
Here’s what I have for a sequence ending in TT:
import re
seq='ATGTGTGGACACAAGTGACAGTTTACGATGAGGTTACAGCCCGCA'
match=re.findall('T.{18}TT',seq)
print match
Check out a good tutorial.
There is a concept called alternation. It matches any one of the given options:
Note that you should use raw strings to encode regexes in Python, otherwise you will get problems with escaping characters later.