Sometimes, when a software package meets a certain level of maturity (or the desire to hack on it fades sufficiently), it's tempting to consider it "done". Here's a little tale of when done isn't really done.

About a week ago, I received a message from Finlay Thompson asking about my Haskell statistics package: he wanted to know how to generate pseudo-random variables using it. I redirected him from that to my mwc-random package, where my pseudo-random number generation code lives.

The mwc-random package currently provides generators for two widely used distributions: uniform and normal. When I was originally writing it, I paid particular attention to making it high quality, fast, and easy to use.

"High quality" sounds a little nebulous, but in the world of pseudo-random number generation, it's actually pretty well defined: a good PRNG should have a large period (the number of samples you need to pull out of it before it repeats itself, assuming a good seed), and the numbers it generates should withstand stringent tests of apparent independence (simply put, given one datum, you shouldn't be able to predict the next).

One algorithm that satisfies these criteria of quality is George Marsaglia's multiply-with-carry algorithm MWC256 (also known as MWC8222). It has a period of about 28222 (huge enough for all conceivable practical purposes), and stands up well to the "testu01", "diehard" and "big crush" statistical tests.

Due to its simplicity, MWC256 is also very fast, and under appropriate circumstances (e.g. on a 64-bit machine) it can be even faster than the well known Mersenne Twister algorithm (which also fails some statistical tests that MWC256 passes).

The Mersenne Twister is itself available for Haskellers to use, in the form of the mersenne-random package. This package is a wrapper around the Mersenne Twister library, and unfortunately it imposes on its users the underlying library's typically horrible constraints borne of too much Fortran programming: you can only have one PRNG per application, and it can only be used from a single thread! The mwc-random package is less restrictive: fire up as many PRNGs in different threads as you like, and they'll all operate independently. You can also use the PRNGs in either the ST or the IO monad, for further convenience.

When generating normally distributed random variables, the mwc-random package uses an algorithm known as the "modified ziggurat". One of the more popular algorithms for generating normally distributed variables is called the ziggurat, but its popularity belies an ill-understood quality problem: the numbers it generates aren't independent enough! It turns out that they are noticeably correlated. The modified ziggurat is almost as fast, and it sacrifices a little speed in the name of improved independence.

The base-level performance of the random number generators looks like this on my Mac using 32-bit GHC 6.12.3, where the time quoted is to generate a single double-precision floating point number:

Uniform: 142.6 nanoseconds

Normal: 15149 nanoseconds

Where does the question of being "done" or not come in? Well, while poking around tonight, I was a little surprised at the large difference in speed between the uniform and normal PRNGs, so I investigated. The ziggurat algorithm gets its name from the precomputed lookup tables it uses to gain its speed. It turns out that GHC's inliner was being too aggressive with the table-related code, causing the ziggurat tables to be regenerated over and over instead of precomputed just once. Ouch!

One small and very quick change, and the performance of the PRNG for normally distributed variables changed dramatically:

Before: 15149 nanoseconds

After: 246.8 nanoseconds

That's a little over 61 times faster. Not bad for a couple of lines of changes!

As a final note, now that GHC can build 64-bit programs on a Mac, you might wonder how it performs. Here's a comparison between 32-bit and 64-bit versions of GHC 7.0.2 (times in nanoseconds):

type 32-bit 64-bit speedup uniform Double 148 28 5.3 uniform Int32 53 16.7 3.2 normal Int32 252 62 4.1

Those are some pretty nice performance improvements! Of course, not all applications come in the form of nice tight numeric kernels, so don't take it as given that you'll see improvements like this in your code.