I’m trying to use shingleprinting to measure document similarity. The process involves the following steps:
- Create a 5-shingling of the two documents D1, D2
- Hash each shingle with a 64-bit hash
- Pick a random permutation of the numbers from 0 to 2^64-1 and apply to shingle hashes
- For each document find the smallest of the resulting values
- If they match count it as a positive example, if not count it as a negative example
- Repeat 3. to 5. a few times
- Use
positive_examples / total examplesas the similarity measure
Step 3 involves generating a random permutation of a very long sequence. Using a Knuth-shuffle seems out of the question. Is there some shortcut for this? Note that in the end we need only a single element of the resulting permutation.
Warning: I’m not 100% positive about this, but I’ve read some of the papers and I believe this is how it works. For instance, in “A small approximately min-wise independent family of hash functions” by Piotr Indyk, he writes “In the implementation integrated with Altavista, the set H was chosen to be a pairwise independent family of hash functions.”
In step 3, you don’t actually need a random permutation on [n] (the integers from 1 to n). It turns out that a pairwise-independent hash function works in practice. So what you do is pick a pairwise-independent hash function h. And then apply h to each of the shingle hashes. You can take the min of those values in step 4.
A standard pairwise-independent hash function is h(x) = ax + b (mod p), where a and b are chosen randomly and p is a prime.
References: http://www.cs.princeton.edu/courses/archive/fall08/cos521/hash.pdf and http://people.csail.mit.edu/indyk/minwise99.ps