Consider the following two cases in both of which the string to test will contain only the characters ‘a’, ‘t’, ‘g’ and ‘c’ in any combination and can be of arbitrary length. It could have only ‘t’ for example.
- Test to see if ‘a’ occurs more than 5 times. If the length of the string is 100 and the character ‘a’ appears five times in the first 10 places, regex should not search the remaining string.
- Test to see if zero or more consecutive ‘a’ occurs only once and the string must not end with ‘a’. Valid example: ggtcccccccctggtaaaatcg, gctgctcgtccttgcttcg, ag. Invalid: a, agatcttgcgt, agtcga.
Now I know how to construct a basic regex to test for both cases but I want to ensure the search is optimized and does not waste unnecessary iterations. In the second point above, agatcttgcgt should terminate as soon as the third character is tested since it breaks the consecutive rule.
Any help with the optimized regex would help. Also, not the primary question but how can I see th internals of how the search is performed (number of iterations, etc.)?
If performance is critical you might want to consider non-regex solutions. For example your first requirement can easily be solved by using
string.Contains.A regular expression typically scans its input in a linear fashion from left to right, looking at every character until it finds a match, and possibly looking at characters multiple times if there is backtracking. On the other hand, there exist some advanced string searching algorithms that can determine the presence or absence of a substring without necessarily examining all characters in the string. For example, to search for
aaaaayou only need to check every fifth character until you find ana.You can use RegexBuddy to debug regular expressions and to see how many steps are needed: