I got many, many files to be uploaded to the server, and I just

Question

0

Asked: May 14, 20262026-05-14T19:53:10+00:00 2026-05-14T19:53:10+00:00

I got many, many files to be uploaded to the server, and I just

0

I got many, many files to be uploaded to the server, and I just want a way to avoid duplicates.

Thus, generating a unique and small key value from a big string seemed something that a checksum was intended to do, and hashing seemed like the evolution of that.

So I was going to use hash md5 to do this. But then I read somewhere that “MD5 are not meant to be unique keys” and I thought that’s really weird.

What’s the right way of doing this?

edit: by the way, I took two sources to get to the following, which is how I’m currently doing it and it’s working just fine, with Python 2.5:

import hashlib

def md5_from_file (fileName, block_size=2**14):
    md5 = hashlib.md5()
    f = open(fileName)
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data)
    f.close()
    return md5.hexdigest()

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-14T19:53:11+00:00

Sticking with MD5 is a good idea. Just to make sure I’d append the file length or number of chunks to your file-hash table.

Yes, there is the possibility that you run into two files that have the same MD5 hash, but that’s quite unlikely (if your files are decent sized). Thus adding the number of chunks to your hash may help you reduce that since now you have to find two files the same size with the same MD5.

# This is the algorithm you described, but also returns the number of chunks.
new_file_hash, nchunks = hash_for_tile(new_file)
store_file(new_file, nchunks, hash)

def store_file(file, nchunks, hash):
  "" Tells you whether there is another file with the same contents already, by 
     making a table lookup ""
  # This can be a DB lookup or some way to obtain your hash map
  big_table = ObtainTable()

  # Two level lookup table might help performance
  # Will vary on the number of entries and nature of big_table
  if nchunks in big_table:
     if hash in big_table[hash]:
       raise DuplicateFileException,\
         'File is dup with %s' big_table[nchunks][lookup_hash]
  else:
    big_table[nchunks] = {}

  big_table[nchunks].update({
    hash: file.filename
  })

  file.save() # or something

To reduce that possibility switch to SHA1 and use the same method. Or even use both(concatenating) if performance is not an issue.

Of course, keep in mind that this will only work with duplicate files at binary level, not images, sounds, video that are “the same” but have different signatures.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I got many, many files to be uploaded to the server, and I just

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply