I have some code on my PHP powered site that creates a random hash (using sha1()) and I use it to match records in the database.
What are the chances of a collision? Should I generate the hash, then check first if it’s in the database (I’d rather avoid an extra query) or automatically insert it, based on the probability that it probably won’t collide with another.
If you assume that SHA-1 does a good job, you can conclude that there’s a 1 in 2^160 chance that two given messages have the same hash (since SHA-1 produces a 160-bit hash).
2^160 is a ridiculously large number. It’s roughly 10^48. Even if you have a million entries in your database, that’s still a 1 in 10^42 chance that a new entry will share the same hash.
SHA-1 has proved to be fairly good, so I don’t think you need to worry about collisions at all.
As a side note, use PHP’s raw_output feature when you use SHA-1 as this will lead to a shorter string and hence will make your database operations a bit faster.
EDIT: To address the birthday paradox, a database with 10^18 (a million million million) entries has a chance of about 1 in 0.0000000000003 of a collision. Really not worth worrying about.