I got many, many files to be uploaded to the server, and I just want a way to avoid duplicates.
Thus, generating a unique and small key value from a big string seemed something that a checksum was intended to do, and hashing seemed like the evolution of that.
So I was going to use hash md5 to do this. But then I read somewhere that “MD5 are not meant to be unique keys” and I thought that’s really weird.
What’s the right way of doing this?
edit: by the way, I took two sources to get to the following, which is how I’m currently doing it and it’s working just fine, with Python 2.5:
import hashlib
def md5_from_file (fileName, block_size=2**14):
md5 = hashlib.md5()
f = open(fileName)
while True:
data = f.read(block_size)
if not data:
break
md5.update(data)
f.close()
return md5.hexdigest()
Sticking with MD5 is a good idea. Just to make sure I’d append the file length or number of chunks to your file-hash table.
Yes, there is the possibility that you run into two files that have the same MD5 hash, but that’s quite unlikely (if your files are decent sized). Thus adding the number of chunks to your hash may help you reduce that since now you have to find two files the same size with the same MD5.
To reduce that possibility switch to SHA1 and use the same method. Or even use both(concatenating) if performance is not an issue.
Of course, keep in mind that this will only work with duplicate files at binary level, not images, sounds, video that are “the same” but have different signatures.