You (the caller) don't need to check the pointer for…

Question

0

Asked: May 11, 20262026-05-11T17:02:29+00:00 2026-05-11T17:02:29+00:00

Further to this question: Algorithm for determining a file’s identity Recap : I’m looking

0

Further to this question: Algorithm for determining a file’s identity

Recap: I’m looking for a cheap algorithm for determining a files identity which works the vast majority of the time.

I went ahead and implemented an algorithm that gives me a “pretty unique” hash per file.

The way my algorithm works is:

For files smaller than a certain threshold I use the full files content for the identity hash.
For files larger than the threshold I take random N samples of X size.
I include the filesize in the hashed data. (meaning all files with different sizes result in a different hash)

Questions:

What values should I choose for N and X (how many random samples should I take of which size?) I went with 4 samples of 8K each and am not able to stump the algorithm. I found that increasing the amount of samples quickly decreases the speed of the algorithm (cause seeks are pretty expensive)
The maths one: how non-different do my files need to be for this algorithm to blow up. (2 different files with same length end up having the same hash)
The optimization one: Are there any ways I can optimize my concrete implementation to improve throughput (I seem to be able to do about 100 files a second on my system).
Does this implementation look sane? Can you think of any real world examples where this will fail. (My focus is on media files)

Relevant information:

The algorithm I implemented

Thanks for your help!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-11T17:02:29+00:00

Always include 1st and last block of file in hash.

This is because they’re most likely to be different from file to file. If you consider BMP, it may have fairly standard header (like 800×600 image, 24bit, null rest), so you may want to overshoot the header a bit to get to the differentiating data. The problem is that headers vary wildly in size.

Last block is for fileformats that append data to original.

Read in blocks of size that is native to the filesystem you use, or at least divisible by 512.
Always read blocks at offset that is divisible by blocksize.
If you get same has for same sized file, do the deep scan of it (hash all data) and memorize filepath to not scan it again.

Even then unless you’re lucky you will misidentify some files as same (for example SQL Server database file and it’s 1:1 backup copy after only a few insertions; except that SS does write a timestamp..)

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions