In a translation-testing app (in Python) I want a regular expression that will accept either of these two strings:
a = "I want the red book"
b = "the book which I want is red"
So far I’m using something like this:
^(the book which )*I want (is |the )red (book)*$
This will accept both string a and string b. But it will also accept a string without either of the two optional sub-strings:
sub1 = (the book which )
sub2 = (book)
How can I indicate that one of these two substrings must be present, even though they’re not adjacent?
I realize that in this example it would be trivially easy to avoid the problem by just testing for longer alternatives separated by “or” |. This is a simplified example of a problem that is harder to avoid with the actual user input I’m working with.
This looks like a problem that might be better solved with a difflib.SequenceMatcher than with regular expressions.
However, a regular expression that works for the specific example in the original question is as follows:
This will fail for the string “I want the red” (which lacks both of the required substrings “the books which ” and ” book”). This uses the (?(id/name)yes-pattern|no-pattern) syntax which allows for alternatives based on the existence of a previously matched group.