I’m looking for a way to identify data, such that I prevent it from being loaded more than once in memory. I thought a good method of doing this is to create a hash for each data buffer and use that as a sort of an id.
The data that I’ll need to hash ranges from 8 kB images, to 40 kB animations, to 3–5 MB music files, to <0.5 MB sound files. What do you think is the best hashing algorithm for my case? For that matter, is hashing the data the way to go, or should I think of some other way to identify data?
There are a number of strong algorithms in wide use:
a weaker one
* CRC (10 bytes)
A general rule of thumb:
Hashing is the way to go. Just remember that for very very very large collections of items, you should consider the probability of getting a collision (i.e. use a hash lookup followed by a linear search by contents).
As long as you seem to be storing files, it is probably nice to make a compound key:
This has the following advantages:
$0.02