I’m looking for a way to identify data, such that I prevent it from

Question

0

Asked: May 23, 20262026-05-23T03:24:29+00:00 2026-05-23T03:24:29+00:00

I’m looking for a way to identify data, such that I prevent it from

0

I’m looking for a way to identify data, such that I prevent it from being loaded more than once in memory. I thought a good method of doing this is to create a hash for each data buffer and use that as a sort of an id.

The data that I’ll need to hash ranges from 8 kB images, to 40 kB animations, to 3–5 MB music files, to <0.5 MB sound files. What do you think is the best hashing algorithm for my case? For that matter, is hashing the data the way to go, or should I think of some other way to identify data?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T03:24:30+00:00

There are a number of strong algorithms in wide use:

sha512 (128 bytes)
sha384 (96 bytes)
sha224 (56 bytes)
sha1 and ripemd160 (40 bytes)
md5 (32 bytes) (like older md4 and md2)

a weaker one
* CRC (10 bytes)

A general rule of thumb:

collision probability increases with the number of elements in your collections, not with their size
collision probability decreases with the number of bits in the checksum

Hashing is the way to go. Just remember that for very very very large collections of items, you should consider the probability of getting a collision (i.e. use a hash lookup followed by a linear search by contents).

As long as you seem to be storing files, it is probably nice to make a compound key:

hash of contents
file name
file length

This has the following advantages:

has collisions are off the table for all practical purposes because the chance of getting a collision for different data samples of the same length is enormously lower than the risk of getting a collision with just any data random samples
your collection is immediately content addressable (meaning: you don’t need the filename to lookup the contents; you could use the content hash to scan for duplicates by another name, because the hash doesn’t depend on the name)

$0.02

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m looking for a way to identify data, such that I prevent it from

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply