I have two sets of strings ( A and B ), and I want

Question

0

Asked: May 27, 20262026-05-27T04:03:29+00:00 2026-05-27T04:03:29+00:00

I have two sets of strings ( A and B ), and I want

0

I have two sets of strings (A and B), and I want to know all pairs of strings a in A and b in B where a is a substring of b.

The first step of coding this was the following:

for a in A:
    for b in B:
        if a in b:
            print (a,b)

However, I wanted to know– is there a more efficient way to do this with regular expressions (e.g. instead of checking if a in b:, check if the regexp '.*' + a + '.*': matches ‘b’. I thought that maybe using something like this would let me cache the Knuth-Morris-Pratt failure function for all a. Also, using a list comprehension for the inner for b in B: loop will likely give a pretty big speedup (and a nested list comprehension may be even better).

I’m not very interested in making a giant leap in the asymptotic runtime of the algorithm (e.g. using a suffix tree or anything else complex and clever). I’m more concerned with the constant (I just need to do this for several pairs of A and B sets, and I don’t want it to run all week).

Do you know any tricks or have any generic advice to do this more quickly? Thanks a lot for any insight you can share!

Edit:

Using the advice of @ninjagecko and @Sven Marnach, I built a quick prefix table of 10-mers:

    import collections
    prefix_table = collections.defaultdict(set)
    for k, b in enumerate(B):
        for i in xrange(len(prot_seq)-10):
            j = i+10+1
            prefix_table[b[i:j]].add(k)

    for a in A:
        if len(a) >= 10:
            for k in prefix_table[a[:10]]:
                # check if a is in b
                # (missing_edges is necessary, but not sufficient)
                if a in B[k]:
                    print (a,b)
        else:
            for k in xrange(len(prots_and_seqs)):
                # a is too small to use the table; check if
                # a is in any b
                if a in B[k]:
                    print (a, b)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T04:03:30+00:00

Of course you can easily write this as a list comprehension:

[(a, b) for a in A for b in B if a in b]

This might slightly speed up the loop, but don’t expect too much. I doubt using regular expressions will help in any way with this one.

Edit: Here are some timings:

import itertools
import timeit
import re
import collections

with open("/usr/share/dict/british-english") as f:
    A = [s.strip() for s in itertools.islice(f, 28000, 30000)]
    B = [s.strip() for s in itertools.islice(f, 23000, 25000)]

def f():
    result = []
    for a in A:
        for b in B:
            if a in b:
                result.append((a, b))
    return result

def g():
    return [(a, b) for a in A for b in B if a in b]

def h():
    res = [re.compile(re.escape(a)) for a in A]
    return [(a, b) for a in res for b in B if a.search(b)]

def ninjagecko():
    d = collections.defaultdict(set)
    for k, b in enumerate(B):
        for i, j in itertools.combinations(range(len(b) + 1), 2):
            d[b[i:j]].add(k)
    return [(a, B[k]) for a in A for k in d[a]]

print "Nested loop", timeit.repeat(f, number=1)
print "List comprehension", timeit.repeat(g, number=1)
print "Regular expressions", timeit.repeat(h, number=1)
print "ninjagecko", timeit.repeat(ninjagecko, number=1)

Results:

Nested loop [0.3641810417175293, 0.36279606819152832, 0.36295199394226074]
List comprehension [0.362030029296875, 0.36148500442504883, 0.36158299446105957]
Regular expressions [1.6498990058898926, 1.6494300365447998, 1.6480278968811035]
ninjagecko [0.06402897834777832, 0.063711881637573242, 0.06389307975769043]

Edit 2: Added a variant of the alogrithm suggested by ninjagecko to the timings. You can see it is much better than all the brute force approaches.

Edit 3: Used sets instead of lists to eliminate the duplicates. (I did not update the timings — they remained essentially unchanged.)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have two sets of strings ( A and B ), and I want

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply