I’m trying to find a substring from a string text that is an anagram to a string pattern.
My Question:
Could the Rabin-Karp algorithm be adjusted to this purpose? Or are there better algorithms?
I have tried out a brute-force algorithm, which did not work in my case because the text and the pattern can each be up to one million characters.
Update: I’ve heard there is a worst-case O(n2) algorithm that uses O(1) space. Does anyone know what this algorithm is?
Update 2: For reference, here is pseudocode for the Rabin-Karp algorithm:
function RabinKarp(string s[1..n], string sub[1..m])
hsub := hash(sub[1..m]); hs := hash(s[1..m])
for i from 1 to n-m+1
if hs = hsub
if s[i..i+m-1] = sub
return i
hs := hash(s[i+1..i+m])
return not found
This uses a rolling hash function to allow calculating the new hash in O(1),
so the overall search is O(nm) in the worst-case, but with a good hash function is O(m + n) in the best case. Is there a rolling hash function that would produce few collisions when searching for anagrams of the string?
Compute a hash of the pattern that doesn’t depend on the order of the letters in the pattern (for example, use the sum the character codes for each letter). Then apply the same hash function in “rolling” fashion to the text, as in Rabin-Karp. If the hashes match, you need to perform a full test of the pattern against the current window in the text, because the hash may collide with other values too.
By associating each symbol in your alphabet to a prime number, then computing the product of those prime numbers as your hash code, you will have fewer collisions.
There is, however, a bit of mathematical trickery that will assist you if you want to compute a running product like this: each time you step the window, multiply the running hash-code by the multiplicative inverse of the code for the symbol that is leaving the window, then multiply by the code for the symbol that is entering the window.
As an example, suppose you are computing the hash of letters ‘a’–’z’ as an unsigned, 64-bit value. Use a table like this:
The multiplicative inverse of n is the number that yields 1 when multiplied by n, modulo some number. The modulus here is 264, since you are using 64-bit numbers. So,
5 * 14757395258967641293should be 1, for example. This works, because you are just multiplying in GF(264).Computing a list of the first primes is easy, and your platform should have a library to efficiently compute the multiplicative inverse of these numbers.
Start coding with the number 3 because 2 is co-prime with the size of an integer (a power of 2 on whatever processor you are working on), and cannot be inverted.