Okay, the title is really very subjective. But thats just what the problem is to me.
The background is that I want to distribute hits of static web contents evenly about a defined number of caching servers. Also the delivery to clients should speed up because several domains are in use and requests are not blocking each other. I also don’t need a classic load balancer but generate the right links right away in my html code.
I also want to ensure that the same url always gets served by the same server.
So I just defined a little function that returns the host to use by hashing the request url and calculates the modulo by the number of servers in use:
function pseudocode_statify($url) { // $url looks like /folder1/folder2/file.jpg
return 'http://' . md5($url) % $num_of_servers .'.mydomain.com' . $url;
}
I first had something like hex decoding and substring to prevent overflow in place, but found out that it just works fine the way above.
However my problem is that if I run the following test script:
for($i=0;$i<100000;$i++) {
$md5 = md5(uniqid($i).microtime().rand(1,999999999999));
$result[$md5%2]++;
}
I expected an even distribution. meaning that $result[0] would be near the value of $result[1];
This was not the case.
Okay, this is nothing special sofar. I would have just accepted the fact that md5 is not as evenly distributed as i thought and would have gone vor another hashing algorithm like sha1 or something.
But I tried to reproduce the findings and found a pattern that I cannot explain.
The ratio was always about 2/1. In fact it was the ratio was always something like 1/2.16 to 1/2.17
Sample output of some runs of the script above:
output was generated by: echo "ratio: ".$result[0]/$result[1]."\n";
ratio: 2.1757121534504
ratio: 2.1729411578062
ratio: 2.1726559360393
ratio: 2.1676895664225
ratio: 2.1667416128848
ratio: 2.1667115284133
ratio: 2.1677791605385
ratio: 2.1658969579688
ratio: 2.1668508131769
ratio: 2.1689292821741
Now the weird thing was that the ratio of sums % 2 equaling 1 and sums % 2 equaling 0 sometimes alternated!
for($j = 0; $j<100;$j++) {
for($i=0;$i<100000;$i++) {
$md5 = md5(uniqid($i).microtime().rand(1,999999999999));
$result[$md5%2]++;
}
var_dump($result);
}
I ran the script from the command line two sperate times and aborted it after 3 runs and it produced theese two outputs:
joe@joe-laptop:/home/flimmit/httpdocs$ php test.php
PHP Notice: Undefined variable: result in /home/flimmit/httpdocs/test.php on line 6
PHP Notice: Undefined offset: 0 in /home/flimmit/httpdocs/test.php on line 6
PHP Notice: Undefined offset: 1 in /home/flimmit/httpdocs/test.php on line 6
array(2) {
[0]=>
int(68223)
[1]=>
int(31777)
}
array(2) {
[0]=>
int(136384)
[1]=>
int(63616)
}
array(2) {
[0]=>
int(204498)
[1]=>
int(95502)
}
^C
joe@joe-laptop:/home/flimmit/httpdocs$ php test.php
PHP Notice: Undefined variable: result in /home/flimmit/httpdocs/test.php on line 6
PHP Notice: Undefined offset: 1 in /home/flimmit/httpdocs/test.php on line 6
PHP Notice: Undefined offset: 0 in /home/flimmit/httpdocs/test.php on line 6
array(2) {
[1]=>
int(31612)
[0]=>
int(68388)
}
array(2) {
[1]=>
int(63318)
[0]=>
int(136682)
}
array(2) {
[1]=>
int(94954)
[0]=>
int(205046)
}
^C
joe@joe-laptop:/home/flimmit/httpdocs$
As you can see in the first one the first entry of results is always higher, in the second one its the other way round. same script.
Strange thing is that i can ONLY reproduce this behaviour when i run the script several times.
I wrote this small script to reproduce the “swapping” and generate enough measure data:
for($j = 0; $j<100;$j++) {
for($i=0;$i<rand(1000,10000);$i++) {
$md5 = md5(uniqid($i).microtime().rand(1,99999999));
$result[$md5%2]++;
}
#var_dump($result);
echo "ratio: ".$result[0]/$result[1]." ".(($result[0]<$result[1]) ? "A":"B")."\n";
sleep(rand(2,5));
}
But here It only prints b, never A’s. Which made me think there might be a semantic error in the script, but i didnt find any.
I am really stuck and this bothers me a lot.
So my questions:
-
Can you recommend any literature / weblinks were i could read about md5 a little bit deeper including distributions etc
-
Can you explain / reproduce the behaviour? Do I have an error here? (in fact thats very likely but i cant find it)
-
Can you recommend any other algorithm that would besuitable for my use case? It needs not be cryptographic or strong but fast, deterministic and evenly distributed.
The
md5()function returns a string, not an integer.Which means that this string will be type-casted to an integer to do the modulo ; and as this string will contain characters in the
0-9A-Frange, casted to an integer, you have :For example, this :
Will get you the following output :
I’ll let you guess the possible impact this can have on the result of the modulo operator 😉