The vast majority of my posts about random number generation have focused on looking at the properties of different generation schemes. But, perhaps surprisingly, the performance of your randomized algorithm may hinge not on the generation scheme you chose, but on other factors. In this post (inspired by and building on an excellent recent paper by Daniel Lemire), we'll explore a common source of overhead in random number generation that frequently outweighs PRNG engine performance.

Imagine this scenario:

For their homework, Huan and Sasha both sit down to implement the same randomized algorithm in C++, running it on the same machine at the university on the same data sets. Their code is almost identical except for random number generation. Huan is eager to get to music practice and so just uses the Mersenne Twister. Sasha, on the other hand, spends several extra hours doing lots of investigation. Sasha benchmarks several fast PRNGs that have been touted lately on social media, and picks the fastest. When Huan and Sasha meet up, Huan can't wait to show off, asking Sasha, “What did you use as your PRNG?”

“Oh, I just used the Mersenne Twister—it's built in and it seemed to work well,” says Huan.

“Ha!” says Sasha. “I used jsf32 . It goes much faster than something old and slow like the Mersenne Twister! My program was done in three minutes fifteen seconds!”.

“Hmm, well, mine got done in under a minute,” says Huan and shrugs. “Anyhow, I'm off to a concert. Want to come?”

“No,” says Sasha. “I, uh, want to take another look at my code….”

This embarrassing fictional scenario is not particularly contrived; it is actually based on real-world results. If your randomized algorithm isn't running as fast as you'd like and the bottleneck seems to be generating random numbers, somewhat counterintuitively the problem might not be with your random number generator!

Background: Random Numbers in Practice

Most modern good-quality random number generators produce machine words filled with random bits, thus typically generating numbers in the range [0..232) or [0..264). But in many use cases users need numbers in a particular range—simple examples such as rolling a die or choosing a random playing card require numbers in small fixed ranges, but many algorithms from shuffling to reservoir sampling to randomized BSTs all need numbers drawn from different ranges.

Methods

We'll cover a number of different methods. To simplify the discussion, rather than generate in the range [i..j) or [i..j] we will generate numbers in the range [0..k). With such a scheme in hand, we can, for example, generate numbers in the range [i..j) by setting k = j - i, generating a number in the range [0..k) and then adding i to it.

C++ Built-In Techniques

In many languages there are built-in facilities for getting a random number in a range. For example, to draw a card from a 52-card deck using scripting languages like Perl and Python we can say int(rand(52)) and random.randint(0,52) , respectively. In C++, we can similarly use uniform_int_distribution .

The C++ code to implement this approach is just

uint32_t bounded_rand(rng_t& rng, uint32_t range) { std::uniform_int_distribution<uint32_t> dist(0, range-1); return dist(rng); }

Typically, built-in approaches use one of the other techniques we will describe below, but most users will simply use the facilities provided without thinking too deeply about what is happening behind the scenes, assuming that they are well enough engineered and sufficiently efficient. In C++, the built-in techniques are more complex because they must be able to deal with quite arbitrary generation engines—a generator that produces values between -3 and 17 would be quite legal and can be used with std::uniform_int_distribution to produce numbers in any range, such as [0..1000). C++'s built-in facilities are thus over-engineered for most of the uses to which they are put.

Classic Modulo (Biased)

Let's move from an over-engineered approach to an under-engineered one.

When I was learning to program, we generated numbers in a range (e.g., to draw a card from a 52-card deck) using the modulo operator by writing something like rand() % 52 to get a random number in the range [0..52).

In C++, we could write this approach as

uint32_t bounded_rand(rng_t& rng, uint32_t range) { return rng() % range; }

Although this approach is simple, it also illustrates why getting numbers within a range is typically slow—it requires division (to calculate the remainder produced by the % operator). Division is typically at least an order of magnitude slower than other arithmetical operations, and so this single arithmetic operation takes longer than all the work done by a fast PRNG.

But in addition to being slow, it is also biased. To understand why rand() % 52 produces biased numbers, if we assume that rand() produces numbers in the range [0..232), observe that 52 does not perfectly divide 232, it divides it 82,595,524 times with remainder 48. Meaning that if we use rand() % 52 , there will be 82,595,525 ways to select the first 48 cards from our 52-card deck and only 82,595,524 ways to select the final four cards . In other words, there is a 0.00000121% bias against these last four cards (perhaps these are the kings!). Back when I was a student writing homework assignments rolling dice or drawing cards, no one really worried about these tiny biases, but as the range increases, the bias increases linearly. For a 32-bit PRNG, a bounded range of less than 224 has a bias of less than 0.5% but above 231 the bias is 50%—some numbers will occur half as often as others.

In this article, our primary focus is on techniques that adopt strategies to avoid bias, but it's probably worth realizing that for a 64-bit PRNG the amount of bias in typical use cases is likely to be negligible.

One other concern we might have is that some generators have weak low-order bits. For example, the Xoroshiro+ and Xoshiro+ families of PRNGs have low-order bits that fail statistical tests. When we perform % 52 (because 52 is even) we pass the lowest bit straight through into the output.

FP Multiply (Biased)

Another common technique is to use a PRNG that generates floating point numbers in the unit interval [0..1), and then convert those numbers into the desired range. Perl adopts this approach, recommending using int(rand(10)) to generate an integer in the range [0..10) by generating a floating-point number and then rounding it down.

In C++, we would write this approach as

static uint32_t bounded_rand(rng_t& rng, uint32_t range) { double zeroone = 0x1.0p-32 * rng(); return range * zeroone; }

(Note that 0x1.0p-32 is a binary floating point constant for 2-32 that we use to convert a random integer value in the range [0..232) into a double in the unit interval; we could instead perform this conversion using ldexp(rng(), -32) but when I benchmarked this approach, it proved to be much slower option.)

This approach is just as biased as the classic modulo approach, but the bias manifests itself differently. For example, if we were choosing numbers in the range [0..52), the numbers 0, 13, 26 and 39 would appear once less often than the others.

This version is also more annoying to generalize to 64 bits because it requires a floating point type with at least 64 bits of mantissa. On x86 Linux and macOS we can use long double to access x86 extended precision floating point which has a 64-bit mantissa, but long double is not universally portable—on some systems long double is equivalent to double .

On the positive side, this approach works better than modulo-based approaches on PRNGs that have weak low order bits.

Integer Multiplication (Biased)

The multiplication method can be adjusted to used fixed-point arithmetic rather than floating point. Essentially we just multiply throughout by 232,

uint32_t bounded_rand(rng_t& rng, uint32_t range) { uint32_t x = rng(); uint64_t m = uint64_t(x) * uint64_t(range); return m >> 32; }

It might appear that this version requires 64-bit arithmetic, but on x86 CPUs, a good compiler will compile this code to a 32-bit mult instruction (which yields two 32-bit outputs, one of which is the return value). We should expect this version to be fast, but biased in exactly the same way as the floating-point multiplication method.

Division with Rejection (Unbiased)

We can rearrange the above fixed-point–multiplication scheme into a division-based scheme. Instead of calculating x * range / 2**32 , we calculate x / (2**32 / range) . Because we are doing integer arithmetic, this version has different rounding behavior and will sometimes generate values outside of the desired range. If we reject those values (i.e., throw them away and generate new ones), we end up with an unbiased technique.

For example, in the case of drawing a card using a 32-bit PRNG, we can generate a 32-bit number and divide it by 232/52 = 82,595,524 to choose our card. This method works when the random value from the 32-bit PRNG is less than 52 × 82595524 = 232/32 – 48. If the random value from the PRNG is one of the final 48 values at the top end of the range of the generator, it must be rejected and another sought.

Our code for this version uses a trick to divide 232 by range without using any 64-bit math. Calculating 2**32 / range directly requires us to represent 232, which is too big (by one!) to be represented in a 32-bit integer. Instead, we observe that for unsigned integers, unary negation of range calculates the positive value 232 – range ; dividing this value by range will produce an answer one less than 2**32 / range .

Thus C++ code to generate numbers via division with rejection is

uint32_t bounded_rand(rng_t& rng, uint32_t range) { // calculates divisor = 2**32 / range uint32_t divisor = ((-range) / range) + 1; if (divisor == 0) // overflow, it's really 2**32 return 0; for (;;) { uint32_t val = rng() / divisor; if (val < range) return val; } }

Of course, this approach requires two slow division-based operations which are typically slower than other arithmetic operations, so we should not expect it to be quick.

Debiased Modulo (Twice) — OpenBSD's Method

We can also apply a rejection approach to debias the classic modulo method. In our playing card example, we will again need to reject 48 values. In this version, rather than reject the last 48 values, we (equivalently) reject the first 48 values.

Here's a C++ implementation of this approach:

uint32_t bounded_rand(rng_t& rng, uint32_t range) { // calculates 2**32 % range uint32_t t = (-range) % range; for (;;) { uint32_t r = rng(); if (r >= t) return r % range; } }

This technique eliminates bias, but it requires two time-consuming modulo operations for each output (and may require the underlying generator to produce more than one number). Thus we should expect it to be about half the speed of the classic biased approach.

The code for OpenBSD's arc4random_uniform (which is also used on OS X and iOS) adopts this strategy.

Debiased Modulo (Once) — Java's Method

Java adopts a different approach to generating a number in a range that manages to use only one modulo operation except in the relatively rare occasions when rejection occurs. The code is

static uint32_t bounded_rand(rng_t& rng, uint32_t range) { uint32_t x, r; do { x = rng(); r = x % range; } while (x - r > (-range)); return r; }

It may take a little thought to figure out why this variant works. Unlike the previous version based on remainders, which removes bias by removing some of the lowest values from the underlying generation engine, this version filters out values from the top part of the engine's range.

Debiased Integer Multiplication — Lemire's Method

Much as we removed bias from the modulo technique, we can also remove bias from the integer multiplication method. This method was devised by Lemire.

uint32_t bounded_rand(rng_t& rng, uint32_t range) { uint32_t t = (-range) % range; do { uint32_t x = rng(); uint64_t m = uint64_t(x) * uint64_t(range); uint32_t l = uint32_t(m); } while (l < t); return m >> 32; }

Bitmask with Rejection (Unbiased) — Apple's Method

Our final approach avoids division and remainder operations entirely. Instead it uses a simple masking operation to get a random number in the range [0..2k) where k is the smallest value such that 2k is greater than the range. If the value is too large for our range, we discard and try another. Code below:

uint32_t bounded_rand(rng_t& rng, uint32_t range) { uint32_t mask = ~uint32_t(0); --range; mask >>= __builtin_clz(range|1); uint32_t x; do { x = rng() & mask; } while (x > range); return x; }

This approach was adopted by Apple when (in the macOS Sierra release) they made their own revision to the code for arc4random_uniform .

Benchmarking the Basic Approaches

We now have several approaches to evaluate. Unfortunately, benchmarking is tricky when we're concerned about the cost of a single division operation. No benchmark is going to capture all the factors that might be in play in your application and there's no guarantee that what is best for your application is necessarily best for mine.

We'll use three benchmarks, and test with a plethora of different PRNGs.

Large-Shuffle Benchmark

Perhaps the most obvious benchmark to run is a shuffle benchmark. In this benchmark we will simulate performing a large shuffle. To sort an array of size N, we must generate numbers in the ranges [0..N), [0..(N-1)), … , [0..1). For this benchmark, we will assume N is as large as possible (232-1 for uint32_t ). The code is

for (uint32_t i = 0xffffffff; i > 0; --i) { uint32_t bval = bounded_rand(rng, i); assert(bval < i); sum += bval; }

Note that we “use” each number by adding it to sum (to avoid it being optimized away) but do not actually do any shuffling to keep the focus on number generation.

To test 64-bit generation, we have an analogous test, but it is not practical to run a test corresponding to shuffling an array of size 264 – 1 (because this larger-sized benchmark would take many thousands of years to run). Instead we cross the entire 64-bit range but generate the same number of outputs as the 32-bit test. The code is

for (uint32_t i = 0xffffffff; i > 0; --i) { uint64_t bound = (uint64_t(i)<<32) | i; uint64_t bval = bounded_rand(rng, bound ); assert(bval < bound); sum += bval; }

Mersenne-Twister Results

The results below show the performance of this benchmark for each of the methods we've discussed, using the Mersenne Twister and testing at both 32-bit (with libstdc++ 's std::mt19937 ) as discussed in this article and analogous 64-bit code (with libstdc++ 's std:mt19937_64 ). Results are the geometric mean of 15 runs with different seeds, then normalized to make the classic biased-mod method have unit time.

It might appear that we have some clear answers about performance—we seem to have a clear progression of better techniques and might also perhaps wonder what the designers of libstdc++ were thinking when they wrote such a terrible implementation for 32-bit sizes—but as is so often the case with benchmarking, the situation is more complex than these results might seem to show. First, there is a risk that our results might somehow be specific to the Mersenne Twister, so we will broaden the set of PRNGs tested. Second, there may be a subtle issue with the benchmark itself. Let's address the first problem first.

Multiple-PRNGs Results

For 32-bit PRNGs, we'll test with arc4_rand32 , chacha8r , gjrand32 , jsf32 , mt19937 , pcg32 , pcg32_fast , sfc32 , splitmix32 , xoroshiro64+ , xorshift*64/32 , xoshiro128+ , and xoshiro128** , and for 64-bit PRNGs we'll use gjrand64 , jsf64 , mcg128 , mcg128_fast , mt19937_64 , pcg64 , pcg64_fast , sfc64 , splitmix64 , xoroshiro128+ , xorshift*128/64 , xoshiro256+ , and xoshiro256* . These collections give us some slow PRNGs and a number of very fast ones.

Here are the results:

We can see some key differences from our early Mersenne-Twister-only results. Faster PRNGs will shift the balance towards the bounding code and thus the differences between our various approaches become more pronounced, especially for 64-bit PRNGs. With this broader collection of PRNGs the libstc++ implementation now no longer seems terrible.

Conclusions

In this benchmark, the biased multiplication-based approach is fastest by a significant margin. There are many situations where bounds will be small relative to the size of the PRNG and performance is absolutely critical. In these situations a negligible amount of bias may be unlikely to have any noticeable effect but PRNG speed will. One such example is random-pivot Quicksort. For unbiased approaches, the bitmask technique also seems promising.

But before we draw too many conclusions, we should recognize that a significant issue with this benchmark is that the bulk of the time is spent with very high bounds, which likely overemphasizes large ranges. Thus we shall move on to a second benchmark.

Small-Shuffle Benchmark

This benchmark is similar to the previous one, but performs a much smaller “array shuffle” (executing it multiple times). The code is

for (uint32_t j = 0; j < 0xffff; ++j) { for (uint32_t i = 0xffff; i > 0; --i) { uint32_t bval = bounded_rand(rng, i); assert(bval < i); sum += bval; } }

Mersenne-Twister Results

Multiple PRNGs Results

Conclusions

This benchmark avoids overemphasis on larger bounds and more accurately reflects real use cases, but now omits larger bounds entirely.

All-Ranges Benchmark

This benchmark aims to avoid the deficiencies of the previous two by doing some testing at each power-of-two size, ensuring that each size is represented but not overrepresented.

for (uint32_t bit = 1; bit != 0; bit <<= 1) { for (uint32_t i = 0; i < 0x1000000; ++i) { uint32_t bound = bit | (i & (bit - 1)); uint32_t bval = bounded_rand(rng, bound); assert(bval < bound); sum += bval; } }

Mersenne-Twister Results

Multiple-PRNGs Results

Conclusions

Many of our conclusions remain. The biased-multiply method is fast if we can tolerate bias, and the bitmask scheme seems like a good all-rounder.

We could be done at this point if we weren't willing to go back, look critically at our code, and make some adjustments.

Small-Constant Benchmark

Although the all-ranges benchmark is probably the most representative one, we'll finish with two other benchmarks for the special case where the compiler knows that the range is a constant and can optimize the algorithms accordingly. Typically the compiler can produce code that is specialized for that particular range (e.g., modulo by a constant can avoid using the mod instruction).

For this benchmark, we use the small constant 52 as our range:

for (uint32_t i = 0; i < 0x80000000; ++i) { uint32_t bval = bounded_rand(rng, 52); assert(bval < 52); sum += bval; }

Mersenne-Twister Results

Multiple-PRNGs Results

Conclusions

Even though the compiler can and does optimize the bitmask version (precalculating the necessary mask) this algorithm rejects much more often than the other unbiased methods, spoiling its performance in this context. Versions based on integer multiplication win out, but the margin between them and versions based on division or modulo is much smaller.

Large-Constant Benchmark

This benchmark is similar to the previous one, but uses range closer to the maximum possible value rather than the minimum possible.

for (uint32_t i = 0; i < 0x80000000; ++i) { uint32_t bval = bounded_rand(rng, uint32_t(-52)); assert(bval < uint32_t(-52)); sum += bval; }

Mersenne-Twister Results

Multiple-PRNGs Results

Conclusions

In this case, the bitmask version is not rejecting more often than the other techniques and so has similar performance. The 64-bit long double floating point code is apparently harder to optimize and actually turns in performance more than 8× worse than classic mod.

Making Improvements

Thus far, all of our debiasing approaches require the use of an extra modulo operation, which causes them to run considerably slower than their biased counterparts. It would be helpful if we could find ways to reduce this overhead.

Faster Threshold-Based Discarding

A few of our algorithms have code involving a threshold, such as

uint32_t bounded_rand(rng_t& rng, uint32_t range) { // calculates 2**32 % range uint32_t t = (-range) % range; for (;;) { uint32_t r = rng(); if (r >= t) return r % range; } }

When range is small compared to the output range of the PRNG, most of the time the number will be much greater than the threshold. Thus if we can have an estimate of the threshold that might be a little high, we can save an expensive modulo operation.

The code below achieves this goal:

uint32_t bounded_rand(rng_t& rng, uint32_t range) { uint32_t r = rng(); if (r < range) { uint32_t t = (-range) % range; while (r < t) r = rng(); } return r % range; }

This change can be applied to both the “Debiased Mod (Twice)” method (above) and the “Debiased Integer Multiply” approach. The idea is Lemire's, who applied it to the latter approach (but not the former).

Large-Shuffle Benchmark Results

This optimization results in notable improvements for the 64-bit benchmark (where modulo is even slower), but actually very slightly worsens performance in the 32-bit benchmark. Despite improvements, the bitmask approach still wins.

Small-Shuffle Benchmark Results

On the other hand, this change significantly speeds the small-shuffle benchmark for both the integer-multiply approach and the two-modulo approach, in each case moving their performance close to that of their unbiased variants. For the two-modulo (OpenBSD) approach, its performance is now almost identical to the single-modulo (Java) approach.

All-Ranges Benchmark Results

We see similar improvement in the all-ranges benchmark.

It seems like we can declare a new winner overall: Lemire's optimized debiased–integer-multiplication method.

Constant Benchmark Results

We see little effect on the small-constant and large-constant benchmarks.

Optimizing Modulo

Usually calculating a % b requires division, but in situations where a < b the result is simply a with no division required, and, when a/2 < b , the result is simply a - b . Thus, rather than calculate

a %= b;

we can instead run

if (a >= b) { a -= b; if (a >= b) a %= b; }

Division is sufficiently expensive that the added overhead of this more complex code may pay off by the time saved by avoiding division.

Large-Shuffle Benchmark Results

Adding this optimization noticeably helps the large-shuffle benchmark. Again, it is more noticeable for 64-bit code where modulo is more expensive. For the two-modulo approach (OpenBSD-style), we show versions with the optimization made for just one of its modulo operations and the other applying it to both.

In this benchmark, the bitmask approach is still the winner but the margin between it and Lemire's approach has narrowed significantly.

Small-Shuffle Benchmark Results

Adding this optimization does not improve performance in the small-shuffle benchmark, so the question becomes whether it adds any noticeable cost. In some cases it does not, but in other cases it does add modest overhead.

All-Ranges Benchmark Results

Similarly, there is little change in the all-ranges benchmark.

Constant Benchmark Results

Again, we see little effect on the small-constant and large-constant benchmarks.

Bonus Results, PRNGs Compared

The main reason for using a constellation of multiple PRNGs to test our number-in-range schemes was to avoid unintentionally skewing the results due to the peculiarities of one particular PRNG scheme. But we can also use the same underlying test results to compare the generation schemes themselves.

32-Bit–Output PRNGs

The graph below shows the performance of different 32-bit generation schemes, averaged across all the techniques and fifteen runs, normalized to the performance of the 32-bit Mersenne Twister:

We can see that there is greater improvement relative to the Mersenne Twister for the small-constant and large-constant benchmarks. This effect should not be a surprise—as we lower the cost of the number-in-a-range aspect, the performance of the generation engine becomes a greater factor in performance.

On the one hand, I'm pleased to see that pcg32_fast is indeed fast. But it also shows why I rarely get too wound up about the performance of most modern high-performance general-purpose PRNGs—the differences between different techniques are very modest. In particular, the fastest four schemes differ in performance by less than 5%, which I consider to be “in the noise”.

64-Bit–Output PRNGs

The graph below shows the performance of different 64-bit generation schemes, averaged across all the techniques and fifteen runs, normalized to the performance of the 32-bit Mersenne Twister. It might seem strange to normalize to the 32-bit Mersenne Twister, but doing so helps us see the added cost of using 64-bit generation when 32-bit generation would have been sufficient.

These results confirm that mcg128_fast is blazingly fast, but again the last four techniques differ by only about 5% so there is very little to choose between the fastest methods. pcg64 and pcg64_fast must necessarily be slower than mcg128_fast because they have 128-bit LCGs and 128-bit MCGs as their base generators. Despite not being the fastest techniques in this collection, pcg64 is still from 20% to 33% faster than the 64-bit Mersenne Twister.

But perhaps more importantly, these results also show that if you don't actually need 64-bit output, using a 64-bit PRNG is usually slower than using a 32-bit PRNG.

Conclusions

From our benchmarks, we can see that switching from a commonly-used PRNG (e.g., the 32-bit Mersenne Twister) to a faster PRNG reduced the execution time of our benchmarks by 45%. But switching from a commonly used method for finding a number in a range to our fastest method reduced our benchmark time by about 66%, in other words reducing execution time to one third of the original time.

The fastest (unbiased) method is Lemire's (with an extra tweak from me). Here it is:

uint32_t bounded_rand(rng_t& rng, uint32_t range) { uint32_t x = rng(); uint64_t m = uint64_t(x) * uint64_t(range); uint32_t l = uint32_t(m); if (l < range) { uint32_t t = -range; if (t >= range) { t -= range; if (t >= range) t %= range; } while (l < t) { x = rng(); m = uint64_t(x) * uint64_t(range); l = uint32_t(m); } } return m >> 32; }

Using Lemire's method will make more of a difference to the performance of most randomized algorithms than switching from a fast generation engine to a slightly faster one.

Appendix: Test Notes

Code for all of the tests is on GitHub. Overall, I tested 23 methods for bounded_rand using 26 different PRNGs (13 32-bit PRNGs and 13 64-bit PRNGs), across two compilers, GCC 8 and and LLVM 6, resulting in 26 * 23 * 2 = 1196 executables, each of which was run with the same 15 seeds, resulting in 1196 * 15 = 17,940 distinct test runs, each combining three benchmarks. I ran tests primarily on a 48-core machine with four Xeon E7-4830v3 CPUs running at 2.1 GHz. Running the full set of tests required a little less than a month of CPU time.