For testing purposes on a project I’m working on, I have a need to, if given a regular expression, randomly generate a string that will FAIL to be matched by it. For instance, if I’m given this regex:
^[abcd]d+
Then I should be able to generate strings such as:
hnbbad
uduebbaef
9f8;djfew
skjcc98332f
…each of which does NOT match the regex, but NOT generate:
addr32
bdfd09usdj
cdddddd-9fdssee
…each of which DO. In other words, I want something like an anti-Xeger.
Does such a library exist, preferably in Python (if I can understand the theory, I can most likely convert it to Python if need be)? I gave some thought to how I could write this, but given the scope of regular expressions, it seemed that might be a much harder problem than what things like Xeger can tackle. I also looked around for a pre-made library to do this, but either I’m not using the right keywords to search or nobody’s had this problem before.
My initial instinct is, no, such a library does not exist because it’s not possible. You can’t be sure that you can find a valid input for any arbitrary regular expression in a reasonable amount of time.
For example, proving whether a number is prime is believed to be a hard to solve mathematical problem. The following regular expression matches any string which is at least 10000 characters long and whose total length is a prime number:
I doubt that any library exists that can find a valid input to this regular expression in reasonable time. And this is a very easy example with a simple solution, e.g.
'x' * 10007will work. It would be possible to come up with other regular expressions that are much harder to find valid inputs for.I think the only way you are going to solve this is if you limit yourself to some subset of all possible regular expressions.
But having said that if you have a magical library that generates text that matches for any arbitrary regular expression then all you need to do is generate a regular expression that matches all the strings that don’t match your original expression.
Luckily this is possible using a negative lookahead:
If you are willing to change the requirements to only allow a limited subset of regular expressions then you can negate the regular expression by using boolean logic. For example if
^[abcd]d+becomes^[^abcd]|^[abcd][^d]. It is then possible to find a valid input for this regular expression in reasonable time.