So i have a regex pattern, and I want to generate all the text permutations that would be allowed from that pattern.
Example:
var pattern = "^My (?:biological|real)? Name is Steve$";
var permutations = getStringPermutations(pattern);
This would return the list of strings below:
My Name is Steve
My real Name is Steve
My biological Name is Steve
Update:
Obviously a regex has an infinate number of matches, so i only want to generate off of optional string literals as in the (?:biological|real)? from my example above. Something like (.)* has too many matches, so I will not be generating them off of that.
If you restrict yourself to the subset of regular expressions that are anchored at both ends, and involve only literal text, single-character wildcards, and alternation, the matching
strings should be pretty easy to enumerate. I’d probably rewrite the regex as a BNF grammar
and use that to generate an exhaustive list of matching strings. For your example:
Start with the productions that have only terminal symbols on the RHS, and enumerate
all the possible values that the nonterminal on the LHS could take. Then work your
way up to the productions with nonterminals on the RHS. For concatenation of nonterminal symbols, form the Cartesian product of the sets represented by each RHS nonterminal.
For alternation, take the union of the sets represented by each option. Continue
until you’ve worked your way up to
<lang>, then you’re done.However, once you include the ‘*’ or ‘+’ operators, you have to contend with infinite
numbers of matching strings. And if you also want to handle advanced features like backreferences…you’re probably well on your way to something that’s isomorphic
to the Halting Problem!