The MWC64X Random Number Generator

The MWC64X RNG is a small, fast, high-quality, skippable random uniform number generator designed for use with GPUs via OpenCL. The main features of MWC64X are:

Small State : Each RNG requires just two 32-bit words of register storage, reducing register pressure on the rest of the program.

: Each RNG requires just two 32-bit words of register storage, reducing register pressure on the rest of the program. Speed : Each random number takes five or six instructions (assuming a post-2010 GPU with 32-bit integer multiply like Fermi or Cayman).

: Each random number takes five or six instructions (assuming a post-2010 GPU with 32-bit integer multiply like Fermi or Cayman). High Quality : MWC64X passes every statistical test I've applied to it, unlike most other small and/or fast generators. Oh, and I'm talking about serious tests here, not just dicking about with Diehard and the NIST tests.

: MWC64X passes every statistical test I've applied to it, unlike most other small and/or fast generators. Oh, and I'm talking about serious tests here, not just dicking about with Diehard and the NIST tests. Stream-Splitting : The RNG is skippable (which is not that un-common), using very little working storage (which is less common), and I've also actually provided the code to do it (which is somewhat rare).

: The RNG is skippable (which is not that un-common), using very little working storage (which is less common), and I've also actually provided the code to do it (which is somewhat rare). Portability : The instructions used by MWC64X are directly supported on all high-performance GPUs (i.e. AMD + NVidia), and are also supported in both the scalar and SSE2 instruction set for x86.

uint32_t MWC64X(uint64_t *state) { uint32_t c=(*state)>>32, x=(*state)&0xFFFFFFFF; *state = x*((uint64_t)4294883355U) + c; return x^c; }

uint MWC64X(uint2 *state) { enum { A=4294883355U}; uint x=(*state).x, c=(*state).y; // Unpack the state uint res=x^c; // Calculate the result uint hi=mul_hi(x,A); // Step the RNG x=x*A+c; c=hi+(x<c); *state=(uint2)(x,c) // Pack the state back up return res; // Return the next result }

Note that to achieve a small fast generator there are certain tradeoffs - these won't matter to most people, but if you are doing hard-core multi-GPU Monte-Carlo then bear them in mind:

Period : The period is "only" 2^63. This is reasonable, but in some applications you could exhaust the period by having huge numbers of GPUs in parallel, or by running simulations for days. It also means you can't use random seeding, although because I've provided skipping code that shouldn't matter.

: The period is "only" 2^63. This is reasonable, but in some applications you could exhaust the period by having huge numbers of GPUs in parallel, or by running simulations for days. It also means you can't use random seeding, although because I've provided skipping code that shouldn't matter. Lack of Theory : The generator is based on a type of generator with known theoretical quality and period (the Multiply-With-Carry) but then modified to drastically improve empirical quality while retaining the period. I can't say anything about theoretical structure, but can demonstrate that it is empirically much more random than generators with supposedly good theoretical properties.

I've put together a package containing the OpenCL implementations, and some OpenCL/C++ test benches which can make sure that your particular OpenCL vendor hasn't got a broken compiler or something. The RNG provides four variants of the generator, which produce either one, two, four, or eight streams from a single generator. The main one-stream version is most useful on NVidia platforms, while the four-vector version is useful on AMD platforms as it lets the compiler fill the VLIW slots.

Each random number generator is represented by an opaque struct which contains the state of the RNG:

typedef ... mwc64x_state_t; typedef ... mwc64xvec2_state_t; typedef ... mwc64xvec4_state_t; typedef ... mwc64xvec8x_state_t;

uint MWC64X_NextUint(mwc64x_state_t *state); uint2 MWC64X_NextUint2(mwc64x_state_t *state); uint4 MWC64X_NextUint4(mwc64x_state_t *state); uint8 MWC64X_NextUint8(mwc64x_state_t *state);

The package provides a stream splitting mechanism for assigning sub-sequences to a particular stream. To use it you need to choose the maximum possible samples you will take from a single stream - for most cases 2^40 is a reasonable choice. The function MWC32X_SeedStreams will then initialise each stream to be offset by the amount from the previous stream:

void MWC64X_SeedStreams(mwc64x_state_t *s, ulong baseOffset, ulong perStreamOffset); void MWC64XVEC2_SeedStreams(mwc64xvec2_state_t *s, ulong baseOffset, ulong perStreamOffset); void MWC64XVEC4_SeedStreams(mwc64xvec4_state_t *s, ulong baseOffset, ulong perStreamOffset); void MWC64XVEC8_SeedStreams(mwc64xvec8_state_t *s, ulong baseOffset, ulong perStreamOffset);

The nice thing about this is that you can create simulations which give exactly the same result, no matter how many work items you launch with, or whether you use the single stream or multiple stream RNG. For example, this function estimates PI using n 2d points, and gives the same result no matter how many work items you use (assuming n is divisible by get_global_size(0)):

__kernel void EstimatePi(uint n, ulong baseOffset, __global uint *acc) { mwc64x_state_t rng; ulong samplesPerStream=n/get_global_size(0); MWC64X_SeedStreams(&rng, baseOffset, 2*samplesPerStream); uint count=0; for(uint i=0;i<samplesPerStream;i++){ ulong x=MWC64X_NextUint(&rng); ulong y=MWC64X_NextUint(&rng); ulong x2=x*x; ulong y2=y*y; if(x2+y2 >= x2) count++; } atomic_add(acc, count); }

We could do exactly the same thing with the 4 stream version, and will get exactly the same answer:

__kernel void EstimatePi(uint n, ulong baseOffset, __global uint *acc) { mwc64x_state_t rng; ulong samplesPerStream=n/(4*get_global_size(0)); MWC64XVEC4_SeedStreams(&rng, baseOffset, 2*samplesPerStream); uint count=0; for(uint i=0;i<samplesPerStream;i++){ ulong4 x=MWC64XVEC4_NextUint4(&rng); ulong4 y=MWC64XVEC4_NextUint4(&rng); ulong4 x2=x*x; ulong4 y2=y*y; int4 hit=-(x2+y2 >= x2); count+=hit.x+hit.y+hit.z+hit.w; } atomic_add(acc, count); }

MWC64X is based on a Multiply-With-Carry (MWC) generator, which is a type of generator invented by George Marsaglia. For our purposes we can describe the MWC generator using some simple pseudo-code and 64-bit integers:

enum{ A = 4294883355UL }; // A "special" constant ulong x; // Holds current RNG State x=(x&0xFFFFFFFF)*A+(x>>32); // Advance RNG to the next state

uint x, c; // 64-bit state split into 32-bits uint hi=mul_hi(x,A); x=x*A+c; c=hi+(x<c); // The final term deals with overflow of x

uint hi=mul_hi(x,A); uint lo=x*A; x=lo+c; c=x<c; c=c+hi;

So an MWC allows us to implement a 64-bit RNG that is really fast on a GPU, but there is one problem: the MWC is equivalent to a prime-modulus LCG, but unfortunately you don't have much control over the LCG multiplier. Usually one chooses the LCG multiplier to optimise the spectral distribution in the first ten or so dimensions, but here we just have to deal with what the MWC maths allows, and sadly it only allows multipliers that provide quite a poor distribution of consecutive triples (on other dimensions it is pretty decent). In practise this means that the plain MWC fails on tests looking at triples, and is not appropriate for use in the raw form.

However, one of the nice things about the MWC in the GPU optimised form is that it gives you two components (x and c) to work with. In addition, both components completely change for each update step. So what we can do is to combine these two components to effectively destroy the spectral structure of the the MWC - so we lose the theoretical guarantees that the spectral test gives us, but in exchange we gain extremely good empirical quality.

There are lots of ways we could combine x and c, but the most obvious choice is also extremely effective - we can simply xor them together. This has the nice intuitive argument that we are combining the components using a different "type" of maths (i.e. GF(2)) than was used to generate them, and also behaves very well statistically. So if we hold the state of the RNG in a uint2 (i.e. a two element OpenCL vector), the full MWC64X update step is:

uint MWC64X(uint2 *state) { enum{ A=4294883355U }; uint x=(*state).x, c=(*state).y; // Unpack the state uint res=x^c; // Calculate the result uint hi=mul_hi(x,A); // Step the RNG x=x*A+c; c=hi+(x<c); *state=(uint2)(x,c) // Pack the state back up return res; // Return the next result }

To be clear: I have no theoretical quality measures for this RNG. All I can say is that it has a period of 2^63, and it passes every statistical test that I am aware of, so there are no known empirical flaws. It easily passes the Diehard and NIST tests, but they are pretty out of date and irrelevant in the fact of modern simulations consuming trillions of samples.

For a realistic test we turn to TestU01, which provides a large set of statistical tests, and also defines two extremely stringent test batteries, Crush and BigCrush, which consume huge numbers of samples. MWC64X passes both of these tests with no suspicious p-values.

One of the problems with many RNGs is that the low order bits are less random than the high-order bits, so to test that I also ran BigCrush on the bit-reversed output of MWC64X. It still passed all the tests, so that means one can get away with doing things like:

uint index=MWC64X(pState) & 0xF;

Finally, one of the things that worried me a little was the straight uniformity of the output (as the range of the c component is slightly deficient) - if we do a histogram of the output, is it flat? To test this I bucketed samples according to their top 20 bits, i.e. created a histogram with one million buckets. For each binary power number of samples up to 2^40 the histogram was filled with consecutive samples, and for each sample size chi2 tests were done at 2^20 buckets and also by aggregating to get fewer buckets down to 2^4. In total this required 2^41 consecutive samples, and none of the p-values was suspicious, so the uniformity seems to be pretty damn good and I was probably worrying about nothing.

Stream skipping and/or stream splitting are a way of carving the output of a random number generator into multiple distinct streams. An RNG can be described in terms of two parts: a state transition function s_i=f(s i-1 ) and an output function x i =f(s i ). Given an initial state s 0 , we can define the sequence of states as s 1 =f(s 0 ), s 2 =f(s 1 ), and the sequence of outputs x 0 =f(s 0 ),x 1 =f(s 1 ). So we end up with a sequence of outputs x 0 ,x 1 ,x 2 ,... Eventually this sequence must repeat, so for example with MWC64X we find that at p=~263 we have x p =x 0 , x p+1 =x 1 , etc.

The idea of stream splitting is to take n samples from the large sequence, and to give each of k processors a contiguous sub-sequence from the n samples. For example, assume that we want to run a simulation using n=240 random samples, and wish to process it using k=210 processors (e.g. work items), then we need to give each processor m=n/k=230 consecutive samples .

For processor 0, that is easy: we can assign it the sub-stream x 0 ...x m-1 , so the processor just starts in the first state and sequentially generates n random numbers. However, for processor 1, we want it to process samples x m ..x 2m-1 , so the samples immediately following those consumed by processor 0. However, to get to sample x m knowing only the starting state s 0 would require processor 1 to perform 2 30 RNG steps before it could start work. The worst situation would be for processor k-1 which must generate and throw away almost 240 samples before it could start any real work. Such an approach would be pointless, and extremely slow.

Fortunately, we can use the relationship between the MWC and LCG generators to implement a skip function s i+d =skip(s i ,d), which is able to jump d steps forward in much fewer than d steps (approximately log2(d) operations, although each operation is more complicated than the RNG step function. This allows each processor j to jump directly to s j*m =skip(s 0 ,j*m), without having to generate all the intermediate steps.

A particularly nice feature of the MWC64X skipping method for GPUs is that it requires very little storage - around 10 or so 64-bit registers, and no local or global RAM. This makes it much easier to manage skipping/seeding, as all work-items can skip at the same time, and there is no need to allocate or manage additional memory. The tradeoff is that the MWC64X skip function is a bit more computationally intensive than some other methods (for example those for xor-based generators), but is still fast (particularly for small distances), and skipping is rarely a bottleneck.

In the source code the skip function is provided by a skip function specialised for each generator:

void MWC64X_Skip(mwc64x_state_t *s, ulong distance); void MWC64XVEC2_Skip(mwc64xvec2_state_t *s, ulong distance); void MWC64XVEC4_Skip(mwc64xvec4_state_t *s, ulong distance); void MWC64XVEC8_Skip(mwc64xvec8_state_t *s, ulong distance);

More useful for most people is MWC64X_SeedStreams and its variants, which will automatically take into account the number of work items and streams per generator. See the source code overview, or look at the estimate_pi example in the source code for more information.

Up: RNGs for GPUs