Given a string S and a list L of patterns [L1, …, Ln], how would you find the list of all tokens in S matching a pattern in L and so that the total number of matched letters in S is maximized?
A dummy example would be S = “thenuke”, L = {“the”, “then”, “nuke”} and we would like to retrieve [“the”, “nuke”] as if we start by matching “then”, we do not get the solution maximizing the total number of letters in S being matched.
I have been looking at other SO questions, string matching algorithms but found nothing to efficiently solve the maximization part of the problem.
This must have been studied e.g. in bioinformatics but I’m not in the field so any help (including link to academic papers) deeply appreciated!
This can be solved in O(|S| + |L| + k) time, where k is the total number of matches of all strings from L in S. There are two major steps:
Run Aho-Corasick. This will give you all matches of any string from L in S. This runs in the same time as mentioned above.
Initialize an array, A, of integers of length |S| + 1 to all zeros. March through the array, at position i set A[i] to A[i-1] if it is larger, then for every match, M, from L in S at position i, set A[i+|M|] to the max of A[i+|M|] and A[i] + |M|.
Here is some code, in Go, that does exactly this. It uses a package I wrote that has a convenient wrapper for calling Aho-Corasick.