I’m writing a small article on humanly readable alternatives to Guids/UIDs, for example those used on TinyURL for the url hashes (which are often printed in magazines, so need to be short).
The simple uid I’m generating is – 6 characters: either a lowercase letter (a-z) or 0-9.
‘According to my calculations captain’, that’s 6 mutually exclusive events, although calculating the probability of a clash gets a little harder than P(A or B) = P(A) + P(B), as obviously it includes numbers and from the code below, you can see it works out whether to use a number or letter using 50/50.
I’m interested in the clash rate and if the code below is a realistic simulation of anticipated clash rate you’d get from generating a hash. On average I get 40-50 clashes per million, however bare in mind the uid wouldn’t be generated a million times at once, but probably only around 10-1000 times a minute.
What is the probability of a clash each time, and can anyone suggest a better way of doing it?
static Random _random = new Random(); public static void main() { // Size of the key, 6 HashSet<string> set = new HashSet<string>(); int clashes = 0; for (int n=0;n < 1000000;n++) { StringBuilder builder = new StringBuilder(); for (int i =0;i < 7;i++) { if (_random.NextDouble() > 0.5) { builder.Append((char)_random.Next(97,123)); } else { builder.Append(_random.Next(0,9).ToString()); } } if (set.Contains(builder.ToString())) { clashes++; Console.WriteLine('clash: (' +n+ ')' +builder.ToString()); } set.Add(builder.ToString()); _random.Next(); //Console.Write(builder.ToString()); } Console.WriteLine('Clashes: ' +clashes); Console.ReadLine(); }
UPDATE: Here’s the resulting article from this question
I really asked two questions here so I was cheating. The answer I was after was rcar’s, however Sklivvz’s is also the answer to the 2nd part (an alternative). Is it possible to make a custom unique id generator in a database, or would it be client side (which would be 2 possible reads first)?
The general idea I was after was using Ids in databases or other stores that can be used by phone or printed material, not a giant 16 byte guid.
UPDATE 2: I put the formula for two mutually exclusive events above instead of 2 independent ones (as getting an ‘a’ the first time doesn’t mean you can’t get an ‘a’ the second time). Should’ve been P(A and B) = P(A) x P(B)
The probability of a collision against one specific ID is:
which is around 1.7×10^-9.
The probability of a collision after generating n IDs is 1-p^n, so you’ll have roughly a 0.17% chance of a collision for each new insertion after 1 million IDs have been inserted, around 1.7% after 10 million IDs, and around 16% after 100 million.
1000 IDs/minute works out to about 43 million/month, so as Sklivvz pointed out, using some incrementing ID is probably going to be a better way to go in this case.
EDIT:
To explain the math, he’s essentially flipping a coin and then picking a number or letter 6 times. There’s a 0.5 probability that the coin flip matches, and then 50% of the time there’s a 1/10 chance of matching and a 50% chance of a 1/26 chance of matching. That happens 6 times independently, so you multiply those probabilities together.