The Problem: A large static list of strings is provided as A, A long string is provided as B, strings in A are all very short (a keywords list), I want to check if every string in A is a sub-string of B and get them.
Now I use a simple loop like:
result = []
for word in A:
if word in B:
result.append(word)
But it’s crazy slow when A contains ~500,000 or more items.
Is there any library or algorithm that fits this problem? I’ve tried my best to search but no luck.
Thank you!
Your problem is large enough that you probably need to hit it with the algorithm bat.
Take a look into the Aho-Corasick Algorithm. Your problem statement is a paraphrase of the problem that this algorithm tackles.
Also, look into the work by Nicholas Lehuen with his PyTST package.
There are also references in a related Stack Overflow message that mention other algorithms such as Rabin-Karp: Algorithm for linear pattern matching?