A project I’m working on requires detection of duplicate files. Under normal circumstances I would simply compare the file bytes in blocks or hash value of the entire file contents. However, the system does not have access to the entire file – only the first 50KB or so. It also knows the total file size of the original file.
I was thinking of implementing the following: each time a file is added, I would look for possible duplicates using both the total file size and a hash calculation of (file-size)+(first-20KB-of-file). The hash algorithm itself is not the issue at this stage, but will likely be MurmurHash2.
Another option is to also store, say, bytes 9000 through 9020 and use that as a third condition when looking up a duplicate copy or alternatively to compare byte-to-byte when the aforementioned lookup method returns possible duplicates in a last attempt to discard false positives.
How naive is my proposed implementation? Is there a reliable way to predict the amount of false positives? What other considerations should I be aware of?
Edit: I forgot to mention that the files are generally going to be compressed archives (ZIP,RAR) and on occasion JPG images.
You can use file size, hashes and partial-contents to quickly detect when two files are different, but you can only determine if they are exactly the same by comparing their contents in full.
It’s up to you to decide whether the failure rate of your partial-file-check will be low enough to be acceptable in your specific circumstances. Bearing in mind that even an “exceedingly unlikely” event will happen frequently if you have enough volume. But if you know the type of data that the files will contain, you can judge the chances of two near-identical files (idenitcal in the first 50kB) cropping up.
I would think that if a partial-file-match is acceptable to you, then a hash of those partial file contents is probably going to be pretty acceptable too.
If you have access to 50kB then I’d use all 50kB rather than only the first 20kB in my hash.
Picking an arbitrary 20 bytes probably won’t help much (your file contents will either be very different in which case hash+size clashes will be unlikely, or they will be very similar in which case the chances of a randomly chosen 20 bytes being different will be quite low)
In your circumstances I would check the size, then a hash of the available daa (50kB), then if this suggests a file match, a brute-force comparison of the available data just to minimise the risks, if you don’t expect to be adding so many duplicates that this would bog the system down.