I have a table of strings (about 100,000) in following format: pattern , string

Question

0

Asked: May 24, 20262026-05-24T05:43:11+00:00 2026-05-24T05:43:11+00:00

I have a table of strings (about 100,000) in following format: pattern , string

0

I have a table of strings (about 100,000) in following format:

pattern , string

e.g. –

*l*ph*nt , elephant
c*mp*t*r , computer
s*v* , save
s*nn] , sunny
]*rr] , worry

To simplify, assume a * denotes a vowel, a consonant stands unchanged and ] denotes either a ‘y’ or a ‘w’ (say, for instance, semi-vowels/round-vowels in phonology).

Given a pattern, what is the best way to generate the possible sensible strings? A sensible string is defined as a string having each of its consecutive two-letter substrings, that were not specified in the pattern, inside the data-set.

e.g. –

h*ll* –> hallo, hello, holla …

‘hallo’ is sensible because ‘ha’, ‘al’, ‘lo’ can be seen in the data-set as with the words ‘have’, ‘also’, ‘low’. The two letters ‘ll’ is not considered because it was specified in the pattern.

What are the simple and efficient ways to do this?
Are there any libraries/frameworks for achieving this?

I’ve no specific language in mind but prefer to use java for this program.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T05:43:11+00:00

This is particularly well suited to Python itertools, set and re operations:

import re
import itertools

VOWELS      = 'aeiou'
SEMI_VOWELS = 'wy'
DATASET     = '/usr/share/dict/words'
SENSIBLES   = set()

def digraphs(word, digraph=r'..'):
    '''
    >>> digraphs('bar')
    set(['ar', 'ba'])
    '''
    base = re.findall(digraph, word)
    base.extend(re.findall(digraph, word[1:]))
    return set(base)

def expand(pattern, wildcard, elements):
    '''
    >>> expand('h?', '?', 'aeiou')
    ['ha', 'he', 'hi', 'ho', 'hu']
    '''
    tokens = re.split(re.escape(wildcard), pattern)
    results = set()
    for perm in itertools.permutations(elements, len(tokens)):
        results.add(''.join([l for p in zip(tokens, perm) for l in p][:-1]))
    return sorted(results)

def enum(pattern):
    not_sensible = digraphs(pattern, r'[^*\]]{2}')
    for p in expand(pattern, '*', VOWELS):
        for q in expand(p, ']', SEMI_VOWELS):
            if (digraphs(q) - not_sensible).issubset(SENSIBLES):
                print q

## Init the data-set (may be long...)
## you may want to pre-compute this
## and adapt it to your data-set.
for word in open(DATASET, 'r').readlines():
    for digraph in digraphs(word.rstrip()):
        SENSIBLES.add(digraph)

enum('*l*ph*nt')
enum('s*nn]')
enum('h*ll*')

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a table of strings (about 100,000) in following format: pattern , string

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply