I’m trying to use shingleprinting to measure document similarity. The process involves the following

Question

0

Editorial Team

Asked: May 21, 20262026-05-21T22:25:19+00:00 2026-05-21T22:25:19+00:00

I’m trying to use shingleprinting to measure document similarity. The process involves the following

0

I’m trying to use shingleprinting to measure document similarity. The process involves the following steps:

Create a 5-shingling of the two documents D1, D2
Hash each shingle with a 64-bit hash
Pick a random permutation of the numbers from 0 to 2^64-1 and apply to shingle hashes
For each document find the smallest of the resulting values
If they match count it as a positive example, if not count it as a negative example
Repeat 3. to 5. a few times
Use positive_examples / total examples as the similarity measure

Step 3 involves generating a random permutation of a very long sequence. Using a Knuth-shuffle seems out of the question. Is there some shortcut for this? Note that in the end we need only a single element of the resulting permutation.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-21T22:25:20+00:00

Warning: I’m not 100% positive about this, but I’ve read some of the papers and I believe this is how it works. For instance, in “A small approximately min-wise independent family of hash functions” by Piotr Indyk, he writes “In the implementation integrated with Altavista, the set H was chosen to be a pairwise independent family of hash functions.”

In step 3, you don’t actually need a random permutation on [n] (the integers from 1 to n). It turns out that a pairwise-independent hash function works in practice. So what you do is pick a pairwise-independent hash function h. And then apply h to each of the shingle hashes. You can take the min of those values in step 4.

A standard pairwise-independent hash function is h(x) = ax + b (mod p), where a and b are chosen randomly and p is a prime.

References: http://www.cs.princeton.edu/courses/archive/fall08/cos521/hash.pdf and http://people.csail.mit.edu/indyk/minwise99.ps

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to use shingleprinting to measure document similarity. The process involves the following

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply