I have a set of strings. I would like to extract a regular expression that matches all these strings. Further, it should match preferably only these and not many others.
Is there an existing python module that does this?
www.google.com
www.googlemail.com/hello/hey
www.google.com/hello/hey
Then, the extracted regex could be www\.google(mail)?\.com(/hello/hey)?
(This also matches www.googlemail.com but I guess I need to live with it)
My motivation for this is in a machine learning setting. I would like to extract a regular expression that “best” represents all these strings.
I understand that regexes like
(www.google.com)|(www.googlemail.com/hello/hey)|(www.google.com/hello/hey) or
www.google(mail.com/hello/hey)|(.com)|(/hello/hey) would be right given my specification, because they match no other urls other than the given ones. But such a regex will become very large if there are large number of strings in the set.
There’s a little perl library that was designed to do this. I know you’re using python, but if it’s a very large list of strings, you can fork off a perl subprocess now and then. (Or copy the algorithm if you’re sufficiently motivated).