I made a hash algorithm that uses MD5 for some low-security key generation. Basically, it takes the characters of a String and sums their indexed products, then takes the modulo of a random number, before MD5-ing that. In Java:
BigInteger bi = BigInteger.ZERO;
char[] array = input.toCharArray();
for (int i = 0; i < array.length; i++) {
bi = bi.add(BigInteger.valueOf(i + 1).multiply(
BigInteger.valueOf(array[i])));
}
final int moduloOperator = 52665; // random constant
final byte[] moduloResult = bi.remainder(
BigInteger.valueOf(moduloOperator)).toByteArray();
MessageDigest md;
try {
md = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException nsae) {
nsae.printStackTrace();
return null;
}
md.update(moduloResult);
return new BigInteger(1, md.digest()).toString().substring(0, 7);
I have the substring at the end because it needs to be easily readable.
At first glance, it works as intended: different inputs give different outputs, but the result is consistent across runs.
However, when playing with it a bit, I noticed the following:
hash("") = "1963546"
hash("1963546") = "1322048"
hash("1322048") = "2101764"
hash("2101764") = "3234562"
Looks fine so far. Suitably random. But then:
hash("3234562") = "3234562"
hash("3234562") = "3234562" [etc.]
This dumbfounded me. I would guess that there’s about a one in ten million chance that the hash of a 7-digit number is itself. Did this really happen on only the fifth iteration, or is there something wrong with my setup? More importantly, could there be any other similar errors that could have a serious impact on my hash?
Thanks.
The “random” part of your code is doing more harm than good.
First, the code adds together several uncorrelated numbers:
Let’s see the result of this for “2101764” and “3234562”. I’ll use Python for brevity.
Well, there’s your problem.
Remember the Central Limit Theorem? The sum of random numbers is much more predictable than the individual numbers themselves. Back of the envelope, for a 7 digit input the sum will have a distribution with a variance of 13.16 and mean of 115.5. It would be safe to infer at least of all 60% of sums will be within a 50 number range, 95% of sums within a 100 number range, and all sums within a 189 number range — if anything, I think this is generous about the entropy of the sum.
After destroying information through addition, the algorithm takes the sum modulo
52665. There are only 52665 possible numbers modulo 52665, so this code can only ever produce 52665 hashes in the best of circumstances.And…There’s no reason to do any of this! Random code does not make random numbers. Making a good hash function is hard. You’re not going to improve on a hash by hacking up some code to slice and dice things. On the contrary, you are likely destroy sources of randomness. If you want a random hash, use one that someone else has written.
Say, for example, MD5!