I’m looking for a Perl string checksum function with the following properties:
- Input: Unicode string of undefined length (
$string) - Output: Unsigned integer (
$hash), for which0 <= $hash <= 2^32-1holds (0 to 4294967295, matching the size of a 4-byte MySQL unsigned int)
Pseudo-code:
sub checksum {
my $string = shift;
my $hash;
... checksum logic goes here ...
die unless ($hash >= 0);
die unless ($hash <= 4_294_967_295);
return $hash;
}
Ideally the checksum function should be quick to run and should generate values somewhat uniformly in the target space (0 .. 2^32-1) to avoid collisions. In this application random collisions are totally non-fatal, but obviously I want to avoid them to the extent that it is possible.
Given these requirements, what is the best way to solve this?
Any hash function will be sufficient – simply truncate it to 4-bytes and convert to a number. Good hash functions have a random distribution, and this distribution will be constant no matter where you truncate the string.
I suggest Digest::MD5 because it is the fastest hash implementation that comes with Perl as standard. String::CRC, as Pim mentions, is also implemented in C and should be faster.
Here’s how to calculate the hash and convert it to an integer: