Update (08/19/15): I bought an Nvidia GPU on eBay, so I’ll run some performance tests on it in the next couple weeks.

Right now, my research is centered around continuous-time Markov chains (CTMCs). In particular, using CTMCs to model a network of biochemical reactions. For example, the state of the Markov chain could correspond to the number of mRNA molecules, proteins, ribosomes, etc. I’m not typically concerned about whether that’s in vitro or in vivo. I don’t like to get ahead too ahead of myself. For the sake of simplicity, just imagine the biochemistry happening in a well-mixed solution inside a small container.

Each time we simulate a path of the Markov chain, that’s the simulation equivalent of “measuring” what’s happening inside the container and getting a time series. Since the CTMC is stochastic, each time we simulate the time series, we likely get something different. Here’s an example of what I’m talking about:

Generated the time series with Stochkit and plotted with R

That’s a single times series of a dimerization model. There is one type of molecule, X, and it can dimerize and dissociate. There are two reactions that take place in the model: , and , where represents the dimer.

At this point ,you might be wondering how Poisson random variates come into the mix. Well simulating each of those time series is typically done by simulating every single reaction that occurs. This is a problem when the model is large and we need thousands or millions of simulations to estimate things such as the expectation or variance of the CTMC’s path over time.

There’s been a lot of work on this problem, but in this post I’m going to focus on one method, namely “tau-leaping“. I’ll gloss over the details, but it suffices to say that we can generate approximate time series by assuming that for each reaction the number of times it occurs in a fixed time interval follows a Poisson distribution. The parameter of the distribution is given by the reaction’s “propensity”, which depends on the current number of molecules. Mathematically, that means

where is the number of reactions that occur for the reaction over a time interval of , and is the reaction’s propensity function. What’s important for us to know for this post is that we need a ton of Poisson random variates (RVs), and they are almost always generated with different parameters. Furthermore, generating such variates is a huge part of the work for the simulation algorithm.

While writing this post, I wanted to double-check if Nvidia or AMD have produced code to generate Poisson RVs for CUDA and OpenCL, respectively. I don’t think I found anything at the time did this initial research (I think it was about a year ago). Right now, AMD links to a Github project for generating uniform RVs, which is convenient, but I already knew there was code to do exactly that, which that library on Github actually uses to do the dirty work.

I have an AMD GPU, so I can only run OpenCL code, but I still wanted to see if Nvidia came up with anything. It turns out that they include a Poisson RV generator in their cuRAND library. Naturally, I wanted to see how they did it. It’s no surprise that you can generate Poisson RVs on a GPU, the question is how efficiently. As anyone who has some experience writing code for GPUs, porting algorithms written for a CPU is highly non-trivial. And they often need to be entirely reworked in order to run efficiently. Based on the cuRAND documentation, I don’t think their solution is efficient, at least given my requirements:

Can generate Poisson RVs from uniform RVs

Handles any magnitude of parameter

Generates RVs with many different parameters

Does all of the above efficiently on a GPU

cuRAND has two functions for generating Poisson RVs. The first is:

curand_poisson (curandState_t *state, double lambda)

which they state “requires pre-processing on the host”. Upon further reading, the CPU needs to call the cuRAND API function curandCreatePoissonDistribution every time the parameter changes. That’s not going to work for me. But remember when I said cuRAND has two functions for generating Poisson RVs? Here is the documentation for second one:

curand_poisson4 (curandStatePhilox4_32_10_t *state, double lambda)

The function generates four Poisson random variates. The first parameter is just the state of the Random123 generator, and the second parameter tells us that all four of the random variables must have the same parameter. Once again, this function doesn’t meet my requirements. I need to generate Poisson random variates with different parameters. Nonetheless, I’d be really interested to see the source code of those functions.

I downloaded the CUDA toolkit, and I found the relevant file, curand_poisson.h. The first function, curand_poisson, uses Knuth’s algorithm (discussed later) when the parameter, lambda, is less than 64, and uses a normal approximation when lambda is greater than 4000. In between 64 and 400, it uses a rejection algorithm “based on the gammainc approximation”. The quote is from their code comment. The second function uses the exact same algorithms, but just generates four random RVs instead of one. I believe they have a specific function for generating four RVs, because Random123 generates four uniform RVs every time you use it.

To summarize, Nvidia’s implementation doesn’t really handle RVs with different parameters, and later I’ll explain why I believe their algorithms are likely very inefficient. I would absolutely run some tests on their library, but I don’t have an Nvidia GPU. Maybe I’ll find a good deal on eBay in time for part 2.

Anyways, in terms of solving the problem at hand, it looks like we’re going to need to get creative and use those straps attached to our boots.This is actually good news. It’s research after all.

Since we need a novel solution, it’s time to educate ourselves and learn about the state-of-the-art. With a simple Google search, we can easily find the canonical means of generating Poisson RVs. The “Poisson Distribution” Wikipedia page shows a simple algorithm taken from Knuth’s Seminumerical Algorithms. The intuition behind the algorithm is that we can think of a Poisson random variable as describing the number of “events” occurring for a Poisson process inside of a fixed time interval. The propensity (i.e. parameter of the distribution) determines how quickly the events arrive. The interarrival times of the events are given by an exponentially-distributed random variable with the same parameter as the Poisson random variable. At the end of the day, the higher the propensity, the shorter the interarrival times, which means more events occur within a time interval.

Of course, research into generating Poisson RVs didn’t stop with the above algorithm. In true 21st century fashion, I used Google to see what algorithms people have developed over the years. This blog was the most useful to me. It turns out that a 1993 paper by Wolfgang Hörmann contains the state-of-the-art algorithm.

The important thing to note about Hörmann’s algorithm and the others listed in that blog is that they rely on rejection, which is a commonly used method to generate random variates for many different distributions. Rejection methods can be very efficient, but remember that we are trying to generate Poisson RVs on a GPU, not a CPU! That’s an important distinction, because rejection algorithms (and Knuth’s), rely on branch statements, namely while, if, and for. When a GPU is executing a program (kernel in OpenCL parlance), all of the threads within a warp (i.e. a group) execute the same instructions in lock-step, but just on different pieces of data (see SIMD). Branching is a problem when some threads take the branch and others do not, because many of the threads need to stall while the others execute instructions. That is why I think Nvidia’s implementation is likely very inefficient except when the parameter is larger than 4000, because then they use a normal approximation that doesn’t involve any branching. It is also a fine implementation if the parameter is much smaller than one, because then the Poisson random variate is often just zero, so for each thread in a warp Knuth’s algorithm terminates immediately. As a result, there’s no stalling.

At that point, my gut told me to at least find and test an alternative algorithm for generating Poisson RVs, specifically an algorithm that works well on a GPU. The goal was to find or develop an algorithm that

Can generate Poisson RVs from uniform RVs

Handle parameters of different magnitudes

Generates RVs with many different parameters

Does not use branching

Is preferably exact, but may sacrifice some accuracy (after all, the simulation method itself is only approximate)

After some more research (Googling), I found some really useful resources that led me to a potential solution. The key to a good story is some suspense, so I’ll leave it at that for now. In part 2, I’ll present the solution I came up with, its shortcomings, and also some numerical results.