This is an interview question : Given a directory with lots of files, find

Question

0

Asked: May 18, 20262026-05-18T01:07:43+00:00 2026-05-18T01:07:43+00:00

This is an interview question : Given a directory with lots of files, find

0

This is an interview question: “Given a directory with lots of files, find the files that have the same content”. I would propose to use a hash function to generate hash values of the file contents and compare only the files with the same hash values. Does it make sense ?

The next question is how to choose the hash function. Would you use SHA-1 for that purpose ?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-18T01:07:44+00:00

Like most interview questions, it’s more meant to spark a conversation than to have a single answer.

If there are very few files, it may be faster to simply to a byte-by-byte comparison until you reach bytes which do not match (assuming you do). If there are many files, it may be faster to compute hashes, as you won’t have to shift around the disk reading in chunks from multiple files. This process may be sped up by grabbing increasingly large chunks of each file, as you progress through the files eliminating potentials. hIt may also be necessary to distribute the problem among multiple servers, if their are enough files.

I would begin with a much faster and simpler hash function than SHA-1. SHA-1 is cryptographically secure, which is not necessarily required in this case. In my informal tests, Adler 32, for example, is 2-3 times faster. You could also use an even weaker presumptive test, than retest any files which match. This decision also depends on the relation between IO bandwidth and CPU power, if you have a more powerful CPU, use a more specific hash to save having to reread files in subsequent tests, if you have faster IO, the rereads may be cheaper than doing expensive hashes unnecessarily.

Another interesting idea would be to use heuristics on the files as you process them to determine the optimal method, based on the files size, computer’s speed, and the file’s entropy.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This is an interview question : Given a directory with lots of files, find

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply