In my program I need to search in a quite big string (~1 mb) for a relatively small substring (< 1 kb).
The problem is the string contains simple wildcards in the sense of “a?c” which means I want to search for strings like “abc” or also “apc”,… (I am only interested in the first occurence).
Until now I use the trivial approach (here in pseudocode)
algorithm "search", input: haystack(string), needle(string)
for(i = 0, i < length(haystack), ++i)
if(!CompareMemory(haystack+i,needle,length(needle))
return i;
return -1; (Not found)
Where “CompareMemory” returns 0 iff the first and second argument are identical (also concerning wildcards) only regarding the amount of bytes the third argument gives.
My question is now if there is a fast algorithm for this (you don’t have to give it, but if you do I would prefer c++, c or pseudocode). I started here
but I think most of the fast algorithms don’t allow wildcards (by the way they exploit the nature of strings).
I hope the format of the question is ok because I am new here, thank you in advance!
Among strings over an alphabet of c characters, let S have length s and let T_1 … T_k have average length b. S will be searched for each of the k target strings. (The problem statement doesn’t mention multiple searches of a given string; I mention it below because in that paradigm my program does well.)
The program uses O(s+c) time and space for setup, and (if S and the T_i are random strings) O(k*u*s/c) + O(k*b + k*b*s/c^u) total time for searching, with u=3 in program as shown. For longer targets, u should be increased, and rare, widely-separated key characters chosen.
In step 1, the program creates an array L of s+TsizMax integers (in program, TsizMax = allowed target length) and uses it for c lists of locations of next occurrences of characters, with list heads in H[] and tails in T[]. This is the O(s+c) time and space step.
In step 2, the program repeatedly reads and processes target strings. Step 2A chooses u = 3 different non-wild key characters (in current target). As shown, the program just uses the first three such characters; with a tiny bit more work, it could instead use the rarest characters in the target, to improve performance. Note, it doesn’t cope with targets with fewer than three such characters.
The line “L[T[r]] = L[g+i] = g+i;” within Step 2A sets up a guard cell in L with proper delta offset so that Step 2G will automatically execute at end of search, without needing any extra testing during the search. T[r] indexes the tail cell of the list for character r, so cell L[g+i] becomes a new, self-referencing, end-of-list for character r. (This technique allows the loops to run with a minimum of extraneous condition testing.)
Step 2B sets vars a,b,c to head-of-list locations, and sets deltas dab, dac, and dbc corresponding to distances between the chosen key characters in target.
Step 2C checks if key characters appear in S. This step is necessary because otherwise a while loop in Step 2E will hang. We don’t want more checks within those while loops because they are the inner loops of search.
Step 2D does steps 2E to 2i until var c points to after end of S, at which point it is impossible to make any more matches.
Step 2E consists of u = 3 while loops, that “enforce delta distances”, that is, crawl indexes a,b,c along over each other as long as they are not pattern-compatible. The while loops are fairly fast, each being in essence (with ++si instrumentation removed) “while (v+d < w) v = L[v]” for various v, d, w. Replicating the three while loops a few times may increase performance a little and will not change net results.
In Step 2G, we know that the u key characters match, so we do a complete compare of target to match point, with wild-character handling. Step 2H reports result of compare. Program as given also reports non-matches in this section; remove that in production.
Step 2I advances all the key-character indexes, because none of the currently-indexed characters can be the key part of another match.
You can run the program to see a few operation-count statistics. For example, the output
shows that Step 2G was entered 3 times (ie, the key characters matched 3 times); the full compare succeeded twice; step 2E while loops advanced indexes 6 times; step 2I advanced indexes 9 times; there were 15 advances in all, to search the 42-character string for the de?ga target.
Note, if you like this answer better than current preferred answer, unmark that one and mark this one. 🙂