I’ve got a very large range/set of numbers, (1..1236401668096), that I would basically like to ‘shuffle’, i.e. randomly traverse without revisiting the same number. I will be running a Web service, and each time a request comes in it will increment a counter and pull the next ‘shuffled’ number from the range. The algorithm will have to accommodate for the server going offline, being able to restart traversal using the persisted value of the counter (something like how you can seed a pseudo-random number generator, and get the same pseudo-random number given the seed and which iteration you are on).
I’m wondering if such an algorithm exists or is feasible. I’ve seen the Fisher-Yates Shuffle, but the 1st step is to “Write down the numbers from 1 to N”, which would take terabytes of storage for my entire range. Generating a pseudo-random number for each request might work for awhile, but as the database/tree gets full, collisions will become more common and could degrade performance (already a 0.08% chance of collision after 1 billion hits according to my calculation). Is there a more ideal solution for my scenario, or is this just a pipe dream?
The reason for the shuffling is that being able to correctly guess the next number in the sequence could lead to a minor DOS vulnerability in my app, but also because the presentation layer will look much nicer with a wider number distribution (I’d rather not go into details about exactly what the app does). At this point I’m considering just using a PRNG and dealing with collisions or shuffling range slices (starting with (1..10000000).to_a.shuffle, then, (10000001, 20000000).to_a.shuffle, etc. as each range’s numbers start to run out).
Any mathemagicians out there have any better ideas/suggestions?
Concatenate a PRNG or LFSR sequence with
/dev/randombitsThere are several algorithms that can generate pseudo-random numbers with arbitrarily large and known periods. The two obvious candidates are the LCPRNG (LCG) and the LFSR, but there are more algorithms such as the Mersenne Twister.
The period of these generators can be easily constructed to fit your requirements and then you simply won’t have collisions.
You could deal with the predictable behavior of PRNG’s and LFSR’s by adding 10, 20, or 30 bits of cryptographically hashed entropy from an interface like
/dev/random.Because the deterministic part of your number is known to be unique it makes no difference if you ever repeat the actually random part of it.