I have a little trouble with a regular expression. I have the following one: (A|C|G|T){3} which gives every permutation of three letters from A,B,C,D but now I want to exclude three specific patterns: "TAG", "TAA" and "TGA" . Tried with [^], but it is not yielding expected results. Same goes using look-around(look ahead and look-behind).
What I am trying to achieve is to find all sub-strings who start with “ATG”, end with either “TAG”, “TAA” or “TGA” and in the middle it should have triples of A,C,G or T.
Thanks for the help!
Here is what I have done so far:
(ATG)((((A|C|G|T)){3})[^TAG][^TAA][^TGA])*(TAG|TAA|TGA)
(ATG)((?!TAG)(?!TAA)(?!TGA)(((A|C|G|T)){3})*)(TAG|TAA|TGA)
If I understand correctly:
1) Start with ATG
2) A number of triplets, except ‘TAG’, ‘TAA’, and ‘TGA’
3) One of the triplets ‘TAG’, ‘TAA’, or ‘TGA’
This should work:
The difference from you second idea it to move the negative look-ahead inside the quantifier to get ‘a number of triplet’ step ensuring that neither of the triplets are one of the exceptions
This solution does not assume any commonality between the elements in step 2 and step 3. A simpler, but in your case equivalent, formulations would be:
1) Match ‘ATG’
2) Match a number of triplets
3) … until you match ‘TAG’, ‘TAA’, ‘TGA’.
To do this you just need to make the quantifier in step 2 non-greedy, as this would test is Step 3 matches before trying if step 2 matched again.
Then the solution would look like:
An alternative interpretation might be:
1) Start with ATG
2) A number of triplets
3) One of the triplets ‘TAG’, ‘TAA’, ‘TGA’
4) The substring found in step 2 must not contain the substrings ‘TAG’, ‘TAA’, ‘TGA’.
In this case I would solve it using two regular expressions. On implementing step 1-3 and one for the test in step 4: