(This is a follow-up article to my post on generating random data that conforms to a given distribution; you might want to read it first.)

Here’s an interesting question I was pondering last week. The .NET base class library has a method Random.Next(int) which gives you a pseudo-random integer greater than or equal to zero, and less than the argument. By contrast, the rand() method in the standard C library returns a random integer between 0 and RAND_MAX , which is usually 32768. A common technique for generating random numbers in a particular range is to use the remainder operator:

int value = rand() % range;

However, this almost always introduces some bias that causes the distribution to stop being uniform. Do you see why?



Let’s suppose that rand() generates a uniform probability distribution; for any call we have a 1 / 32768 chance of getting any particular number. And let’s suppose that range is 3. What is the distribution of rand() % 3 ? There are 10923 numbers between 0 and 32767 that produce a remainder of 0, 10923 that produce a remainder of 1 and 10922 that produce a remainder of 2, so 2 is very slightly less likely, and 0 and 1 are very slightly more likely. We’ve introduced a bias towards smaller numbers.

This difference is tiny; so small that you would not notice it without millions of data points. But now let’s suppose that we want to generate a random number between 0 and 19999 using this technique, and see what the bias is. How many ways are there to get 0? We could generate 0 or 20000, so 2. How many ways are there to get 1? We could generate 1 or 20001. And so on, up to 12767. But there is only one way to get 12768 out of rand() % 20000 ! Every number between 0 and 12767 is twice as likely as every number from 12768 to 19999; this is a massive bias.

How can we quantify how massive or tiny the bias is? One technique is to graph the desired probability distribution[1. Here shown as a continuous function, though of course it is a collection of discrete data points.] and the actual probability distribution, and then take the area of their difference.

Here the red line is the desired probability distribution: an even 1/20000 for each element. The blue line is the probability of getting every element with rand() % 20000 , and the shaded area is the difference between the desired and actual distributions. The area under both distribution lines is 100%, as you would expect for a probability distribution. A few quick calculations shows that the shaded area is a massive 28%! By contrast if we did the same exercise for rand() % 3 the shaded area would be 0.004%. (Of course the two shaded areas are equal since the total area under both curves is the same. We could just compute one of the areas and double it.)

I computed the “bias area” for every range value from 1 to 32768; the result is an interesting shape:

Every power of two gives no bias at all; half way between each power of two gives an every-increasing bias area.

The moral of the story is: when using the remainder technique, you’re introducing a bias towards small numbers, and that bias can get quite large if the range is large. Even a range of a few thousand is already introducing a 5% bias towards smaller numbers.

So what then is the better technique? First off, obviously if you have a library method that does the right thing, use it! The .NET implementation of Random.Next(int) has been designed to not have this bias. (It does not use the technique I describe here. Rather, it generates a random double and then multiplies that by the desired range, then rounds to an integer. If the range is very large then special techniques are used to ensure good distribution.) If you must use the remainder technique then my advice is to simply discard the results that would introduce the bias. Something like:

while(true) { int value = rand(); if (value < RAND_MAX - RAND_MAX % range) return value % range; }