I have the following regex:
(?i:^TPI$|^TIP$|^IPT$|^ITP$|^PIT$|^PTI$|^IP$|^PI$|^TI$|^IT$|^PT$|^TP$|^T$|^P$|^I$)
How can I simplify it? My regular expression knowledge is rather limited.
My requirements are:
- Acceptable inputs are “T”, “P”, and “I”
- Values may come in any order
- Only one of each value is accepted. “TTI” is invalid, but “TI” is valid
- Case insensitive
I used
^(?i:[TPI]){1,3}$
in the past, and that mostly works. The only problem is it accepts multiple values “TTT” is acceptable with that regex, I need that to fail).
We can try in a different way. The attempt you made allows some strings to slip through which you don’t want. Namely, everything with repetitions. In the following I will experiment a bit with PowerShell to show the solution. First we need all possible strings we can expect as input:
This yields the following sequence of values (I format them on a single line, but they come out one per line usually):
This is of course also what the regular expression
will match.
We can restrict what we want to match by using a so-called negative lookahead assertion which will match only if some text is following but won’t actually match the text itself, thereby allowing it to be captured by the pattern you have above. This can be accomplished with
(?!)where you would insert some sub-expression after the!. Let’s try and restrict to input that doesn’t start with twoI, twoPor twoT:As you can see, those are gone from the results. We can simplify that if we use a capturing group and a backreference. Parentheses normally (except if they start with
(?) capture what is matched inside them and you can use that after matching to extract parts from the match or for replacements. But you can also use it in the pattern itself in many regex engines (in fact, I think there is no engine that allows negative lookahead but not backreferences in the pattern). SoII|PP|TTcan be written as(.)\1which just says “a letter, follows by exactly the same letter” since\1is the backreference, pointing to whatever was matched by(.).Now we still have a few values we don’t want, namely everything with two same letters in position 2 and 3 and those in position 1 and 3. We can get rid of the former with the following:
The
.?in the beginning now says “match a character or not” which therefore extends what we had before two exclude the matches with repetitions in the end. For the second set we just need to exclude matches that look like(.).\1, i.e. a letter, followed by another and then a repetition of the first. We can extend the regex above by just putting another.?, i.e. an optional letter between the capturing group and the backreference:Which now is exactly the set you wanted to represent. The final regex is
It’s shorter than before, that’s for sure. Whether it’s simpler might be up for debate, as it might need some explanation what it does. This probably is even more the case for the more compressed approach in the other answer. It’s shorter, indeed, but this being my answer and we contend for votes I just have to say that I dislike it 😉 … just kidding. But for such things I guess separating the basic pattern from exclusions does indeed make sense for readability.
Another option might be to validate the basic pattern with regex, i.e. your initial approach. And then use code to reject duplicates which might look something like
depending on your language – provided it makes those things easy and readable.