Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6873357
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T04:03:29+00:00 2026-05-27T04:03:29+00:00

I have two sets of strings ( A and B ), and I want

  • 0

I have two sets of strings (A and B), and I want to know all pairs of strings a in A and b in B where a is a substring of b.

The first step of coding this was the following:

for a in A:
    for b in B:
        if a in b:
            print (a,b)

However, I wanted to know– is there a more efficient way to do this with regular expressions (e.g. instead of checking if a in b:, check if the regexp '.*' + a + '.*': matches ‘b’. I thought that maybe using something like this would let me cache the Knuth-Morris-Pratt failure function for all a. Also, using a list comprehension for the inner for b in B: loop will likely give a pretty big speedup (and a nested list comprehension may be even better).

I’m not very interested in making a giant leap in the asymptotic runtime of the algorithm (e.g. using a suffix tree or anything else complex and clever). I’m more concerned with the constant (I just need to do this for several pairs of A and B sets, and I don’t want it to run all week).

Do you know any tricks or have any generic advice to do this more quickly? Thanks a lot for any insight you can share!


Edit:

Using the advice of @ninjagecko and @Sven Marnach, I built a quick prefix table of 10-mers:

    import collections
    prefix_table = collections.defaultdict(set)
    for k, b in enumerate(B):
        for i in xrange(len(prot_seq)-10):
            j = i+10+1
            prefix_table[b[i:j]].add(k)

    for a in A:
        if len(a) >= 10:
            for k in prefix_table[a[:10]]:
                # check if a is in b
                # (missing_edges is necessary, but not sufficient)
                if a in B[k]:
                    print (a,b)
        else:
            for k in xrange(len(prots_and_seqs)):
                # a is too small to use the table; check if
                # a is in any b
                if a in B[k]:
                    print (a, b)
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T04:03:30+00:00Added an answer on May 27, 2026 at 4:03 am

    Of course you can easily write this as a list comprehension:

    [(a, b) for a in A for b in B if a in b]
    

    This might slightly speed up the loop, but don’t expect too much. I doubt using regular expressions will help in any way with this one.

    Edit: Here are some timings:

    import itertools
    import timeit
    import re
    import collections
    
    with open("/usr/share/dict/british-english") as f:
        A = [s.strip() for s in itertools.islice(f, 28000, 30000)]
        B = [s.strip() for s in itertools.islice(f, 23000, 25000)]
    
    def f():
        result = []
        for a in A:
            for b in B:
                if a in b:
                    result.append((a, b))
        return result
    
    def g():
        return [(a, b) for a in A for b in B if a in b]
    
    def h():
        res = [re.compile(re.escape(a)) for a in A]
        return [(a, b) for a in res for b in B if a.search(b)]
    
    def ninjagecko():
        d = collections.defaultdict(set)
        for k, b in enumerate(B):
            for i, j in itertools.combinations(range(len(b) + 1), 2):
                d[b[i:j]].add(k)
        return [(a, B[k]) for a in A for k in d[a]]
    
    print "Nested loop", timeit.repeat(f, number=1)
    print "List comprehension", timeit.repeat(g, number=1)
    print "Regular expressions", timeit.repeat(h, number=1)
    print "ninjagecko", timeit.repeat(ninjagecko, number=1)
    

    Results:

    Nested loop [0.3641810417175293, 0.36279606819152832, 0.36295199394226074]
    List comprehension [0.362030029296875, 0.36148500442504883, 0.36158299446105957]
    Regular expressions [1.6498990058898926, 1.6494300365447998, 1.6480278968811035]
    ninjagecko [0.06402897834777832, 0.063711881637573242, 0.06389307975769043]
    

    Edit 2: Added a variant of the alogrithm suggested by ninjagecko to the timings. You can see it is much better than all the brute force approaches.

    Edit 3: Used sets instead of lists to eliminate the duplicates. (I did not update the timings — they remained essentially unchanged.)

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have two Sets of Strings, with each in the following format: Set1(Names) Set2(Sizes)
I Have two sets of strings. set<string> A; set<string> B; I will insert some
I have two sets of lists that are synchronized and look like this: (by
We have two sets, A and B. Each one of these sets include strings.
I have two sets of objets and I want to get the intersection of
Assume we have three sets of strings in Scala. One has elements A,B,C. Two
I have a List<> of objects containing two strings and a DateTime. I want
I have two functions. The first sets a variable then the second get's the
I have two sets of elements with (sometimes) corresponding rel and id attributes: <a
I have two Sets. Set b is the subset of Set a . they're

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.