I am developing an “open distributed cloud storage system”.
By open I mean that anyone can participate in hosting of files.
My current design uses a sha1 hash of the files content as global file id.
It is given that the client already knows this hash value and receives the file from a “bandwidth donor”.
The client now needs to verify that the file indeed is the correct one, by generating the hash and comparing it to the expected value.
However my concern is that someone could deliberately modify a file to produce the same hash. As far as I know this is doable easily for hashes of the CRC family. Some “googling” around revealed a lot of claims that the same would be easy for MD5.
Now my question is: Is there a hashing algorithm which satisfies the criteria of beeing
- fast for big amounts of data
- well distributed in the hashing range (aka “unique”)
- has a sufficient target range (“bit length”)
- is resistant to deliberate collision attacks
All other means that I can think of achieving a setup that serves my needs involve a secret component, for example a secret openssl key or a shared secret salt for a hash function.
Unfortunately I cannot work with that.
What you are asking for is a one-way function, whose existence is a major open problem.
With cryptographic hash functions, the specific attack you wanted to avoid is called the “second pre-image attack”.
That should help you Googling what you want, but as far as I know there is actually no known practical second pre-image attack for MD5.
First of all, you probably found that it is easy to find two arbitrary files that have the same hash, and to find two different such pairs every time you try.
But it is difficult to generate a file to disguise as some specific file – in other words, it is unlikely that one of the prementioned “two arbitrary files” actually belongs to a non-malicious agent in your storage.
If you’re still not satisfied, you might want to try something like SHA-1 or SHA-2 or GOST.