I have a table of strings (about 100,000) in following format:
pattern , string
e.g. –
*l*ph*nt , elephant
c*mp*t*r , computer
s*v* , save
s*nn] , sunny
]*rr] , worry
To simplify, assume a * denotes a vowel, a consonant stands unchanged and ] denotes either a ‘y’ or a ‘w’ (say, for instance, semi-vowels/round-vowels in phonology).
Given a pattern, what is the best way to generate the possible sensible strings? A sensible string is defined as a string having each of its consecutive two-letter substrings, that were not specified in the pattern, inside the data-set.
e.g. –
h*ll* –> hallo, hello, holla …
‘hallo’ is sensible because ‘ha’, ‘al’, ‘lo’ can be seen in the data-set as with the words ‘have’, ‘also’, ‘low’. The two letters ‘ll’ is not considered because it was specified in the pattern.
What are the simple and efficient ways to do this?
Are there any libraries/frameworks for achieving this?
I’ve no specific language in mind but prefer to use java for this program.
This is particularly well suited to Python itertools, set and re operations: