I have seen this question asked a lot but never seen a true concrete answer to it. So I am going to post one here which will hopefully help people understand why exactly there is “modulo bias” when using a random number generator, like rand() in C++.
I have seen this question asked a lot but never seen a true concrete
Share
So
rand()is a pseudo-random number generator which chooses a natural number between 0 andRAND_MAX, which is a constant defined incstdlib(see this article for a general overview onrand()).Now what happens if you want to generate a random number between say 0 and 2? For the sake of explanation, let’s say
RAND_MAXis 10 and I decide to generate a random number between 0 and 2 by callingrand()%3. However,rand()%3does not produce the numbers between 0 and 2 with equal probability!When
rand()returns 0, 3, 6, or 9,rand()%3 == 0. Therefore, P(0) = 4/11When
rand()returns 1, 4, 7, or 10,rand()%3 == 1. Therefore, P(1) = 4/11When
rand()returns 2, 5, or 8,rand()%3 == 2. Therefore, P(2) = 3/11This does not generate the numbers between 0 and 2 with equal probability. Of course for small ranges this might not be the biggest issue but for a larger range this could skew the distribution, biasing the smaller numbers.
So when does
rand()%nreturn a range of numbers from 0 to n-1 with equal probability? WhenRAND_MAX%n == n - 1. In this case, along with our earlier assumptionrand()does return a number between 0 andRAND_MAXwith equal probability, the modulo classes of n would also be equally distributed.So how do we solve this problem? A crude way is to keep generating random numbers until you get a number in your desired range:
but that’s inefficient for low values of
n, since you only have an/RAND_MAXchance of getting a value in your range, and so you’ll need to performRAND_MAX/ncalls torand()on average.A more efficient formula approach would be to take some large range with a length divisible by
n, likeRAND_MAX - RAND_MAX % n, keep generating random numbers until you get one that lies in the range, and then take the modulus:For small values of
n, this will rarely require more than one call torand().Works cited and further reading:
CPlusPlus Reference
Eternally Confuzzled