What I want to describe in this post is how to solve stochastic PDEs in Julia using GPU parallelism. I will go from start to finish, describing how to use the type-genericness of the DifferentialEquations.jl library in order to write a code that uses within-method GPU-parallelism on the system of PDEs. This is mostly a proof of concept: the most efficient integrators for this problem are not compatible with GPU parallelism yet, and the GPU parallelism isn't fully efficient yet. However, I thought it would be nice to show an early progress report showing that it works and what needs to be fixed in Base Julia and various libraries for us to get the full efficiency.

Edit May 2019

As of DifferentialEquations.jl v6.4.0, this is no longer a proof of concept. The whole library, including implicit solvers with GMRES, etc., and same for SDEs, DAEs, DDEs, etc. are all GPU-compatible with a fast form of broadcast. This has been optimized and made efficient. The only methods which are not are the non-native Julia ones, like Sundials. Some of this blog post has been editing towards the newer version of the GPU code.

Our Problem: 2-dimensional Reaction-Diffusion Equations

The reaction-diffusion equation is a PDE commonly handled in systems biology which is a diffusion equation plus a nonlinear reaction term. The dynamics are defined as:

But this doesn't need to only have a single "reactant" u: this can be a vector of reactants and the is then the nonlinear vector equations describing how these different pieces react together. Let's settle on a specific equation to make this easier to explain. Let's use a simple model of a 3-component system where A can diffuse through space to bind with the non-diffusive B to form the complex C (also non-diffusive, assume B is too big and gets stuck in a cell which causes C=A+B to be stuck as well). Other than the binding, we make each of these undergo a simple birth-death process, and we write down the equations which result from mass-action kinetics. If this all is meaningless to you, just understand that it gives the system of PDEs:

One addition that was made to the model is that we let be the production of , and we let that be a function of space so that way it only is produced on one side of our equation. Let's make it a constant when x>80, and 0 otherwise, and let our spatial domain be and .

This model is spatial: each reactant is defined at each point in space, and all of the reactions are local, meaning that at spatial point only uses . This is an important fact which will come up later for parallelization.

Discretizing the PDE into ODEs

In order to solve this via a method of lines (MOL) approach, we need to discretize the PDE into a system of ODEs. Let's do a simple uniformly-spaced grid finite difference discretization. Choose and so that we have 100*100=10000 points for each reactant. Notice how fast that grows! Put the reactants in a matrix such that A[i,j] = , i.e. the columns of the matrix is the values and the rows are the values (this way looking at the matrix is essentially like looking at the discretized space).

So now we have 3 matrices (A, B, and C) for our reactants. How do we discretize the PDE? In this case, the diffusion term simply becomes a tridiagonal matrix where is central band. You can notice that performs diffusion along the columns of , and so this is diffusion along the . Similarly, flips the indices and thus does diffusion along the rows of making this diffusion along . Thus is the discretized Laplacian (we could have separate diffusion constants and if we want by using different constants on the , but let's not do that for this simple example. I'll leave that as an exercise for the reader). I enforced a Neumann boundary condition with zero derivative (also known as a no-flux boundary condition) by reflecting the changes over the boundary. Thus the derivative operator is generated as:

const Mx = Tridiagonal ( [ 1.0 for i in 1 :N- 1 ] , [ - 2.0 for i in 1 :N ] , [ 1.0 for i in 1 :N- 1 ] ) const My = copy ( Mx ) # Do the reflections, different for x and y operators Mx [ 2 , 1 ] = 2.0 Mx [ end- 1 , end ] = 2.0 My [ 1 , 2 ] = 2.0 My [ end ,end- 1 ] = 2.0

I also could have done this using the DiffEqOperators.jl library, but I wanted to show what it truly is at its core.

Since all of the reactions are local, we only have each point in space react separately. Thus this represents itself as element-wise equations on the reactants. Thus we can write it out quite simply. The ODE which then represents the PDE is thus in pseudo Julia code:

DA = D * ( M * A + A * M ) @. DA + α₁ - β₁ * A - r₁ * A * B + r₂ * C @. α₂ - β₂ * B - r₁ * A * B + r₂ * C @. α₃ - β₃ * C + r₁ * A * B - r₂ * C

Note here that I am using α₁ as a matrix (or row-vector, since that will broadcast just fine) where every point in space with x<80 has this zero, and all of the others have it as a constant. The other coefficients are all scalars. How do we do this with the ODE solver?

Our Representation via Views of 3-Tensors

We can represent our problem with a 3-dimensional tensor, taking each 2-dimensional slice as our (A,B,C). This means that we can define:

u0 = zeros ( N,N, 3 )

Now we can decompose it like:

A = @view u [ :,:, 1 ] B = @view u [ :,:, 2 ] C = @view u [ :,:, 3 ] dA = @view du [ :,:, 1 ] dB = @view du [ :,:, 2 ] dC = @view du [ :,:, 3 ]

These views will not construct new arrays and will instead just be pointers to the (contiguous) memory pieces, so this is a nice and efficient way to handle this. Together, our ODE using this tensor as its container can be written as follows:

function f ( du,u,p,t ) A = @view u [ :,:, 1 ] B = @view u [ :,:, 2 ] C = @view u [ :,:, 3 ] dA = @view du [ :,:, 1 ] dB = @view du [ :,:, 2 ] dC = @view du [ :,:, 3 ] DA = D * ( M * A + A * M ) @. dA = DA + α₁ - β₁ * A - r₁ * A * B + r₂ * C @. dB = α₂ - β₂ * B - r₁ * A * B + r₂ * C @. dC = α₃ - β₃ * C + r₁ * A * B - r₂ * C end

where this is using @. to do inplace updates on our du to say how the full tensor should update in time. Note that we can make this more efficient by adding some cache variables to the diffusion matrix multiplications and using mul!, but let's ignore that for now.

Together, the ODE which defines our PDE is thus:

prob = ODEProblem ( f,u0, ( 0.0 , 100.0 ) ) sol = solve ( prob,ROCK2 ( ) )

if I want to solve it on . Done! The solution gives back our tensors (and interpolates to create new ones if you use sol(t)). We can plot it in Plots.jl

and see the pretty gradients. Using this 3rd order explicit adaptive Runge-Kutta method we solve this equation in about 40 seconds. That's okay.

Some Optimizations

There are some optimizations that can still be done. When we do A*B as matrix multiplication, we create another temporary matrix. These allocations can bog down the system. Instead we can pre-allocate the outputs and use the inplace functions mul! to make better use of memory. The easiest way to store these cache arrays are constant globals, but you can use closures (anonymous functions which capture data, i.e. (x)->f(x,y)) or call-overloaded types to do it without globals. The globals way (the easy way) is simply:

const MyA = zeros ( N,N ) const AMx = zeros ( N,N ) const DA = zeros ( N,N ) function f ( du,u,p,t ) A = @view u [ :,:, 1 ] B = @view u [ :,:, 2 ] C = @view u [ :,:, 3 ] dA = @view du [ :,:, 1 ] dB = @view du [ :,:, 2 ] dC = @view du [ :,:, 3 ] mul ! ( MyA,My,A ) mul ! ( AMx,A,Mx ) @. DA = D * ( MyA + AMx ) @. dA = DA + α₁ - β₁ * A - r₁ * A * B + r₂ * C @. dB = α₂ - β₂ * B - r₁ * A * B + r₂ * C @. dC = α₃ - β₃ * C + r₁ * A * B - r₂ * C end

For reference, closures looks like:

MyA = zeros ( N,N ) AMx = zeros ( N,N ) DA = zeros ( N,N ) function f_full ( du,u,p,t,MyA,AMx,DA ) A = @view u [ :,:, 1 ] B = @view u [ :,:, 2 ] C = @view u [ :,:, 3 ] dA = @view du [ :,:, 1 ] dB = @view du [ :,:, 2 ] dC = @view du [ :,:, 3 ] mul ! ( MyA,My,A ) mul ! ( AMx,A,Mx ) @. DA = D * ( MyA + AMx ) @. dA = DA + α₁ - β₁ * A - r₁ * A * B + r₂ * C @. dB = α₂ - β₂ * B - r₁ * A * B + r₂ * C @. dC = α₃ - β₃ * C + r₁ * A * B - r₂ * C end f = ( du,u,p,t ) - > f_full ( du,u,p,t,MyA,AMx,DA )

and a call overloaded type looks like:

struct MyFunction { T } < : Function MyA::T AMx::T DA::T end # Now define the overload function ( ff::MyFunction ) ( du,u,p,t ) # This is a function which references itself via ff A = @view u [ :,:, 1 ] B = @view u [ :,:, 2 ] C = @view u [ :,:, 3 ] dA = @view du [ :,:, 1 ] dB = @view du [ :,:, 2 ] dC = @view du [ :,:, 3 ] mul ! ( ff. MyA ,My,A ) mul ! ( ff. AMx ,A,Mx ) @. ff . DA = D * ( ff. MyA + ff. AMx ) @. dA = f. DA + α₁ - β₁ * A - r₁ * A * B + r₂ * C @. dB = α₂ - β₂ * B - r₁ * A * B + r₂ * C @. dC = α₃ - β₃ * C + r₁ * A * B - r₂ * C end MyA = zeros ( N,N ) AMx = zeros ( N,N ) DA = zeros ( N,N ) f = MyFunction ( MyA,AMx,DA ) # Now f(du,u,p,t) is our function!

These last two ways enclose the pointer to our cache arrays locally but still present a function f(du,u,p,t) to the ODE solver.

Now since PDEs are large, many times we don't care about getting the whole timeseries. Using the output controls from DifferentialEquations.jl, we can make it only output the final timepoint.

sol = solve ( prob,ROCK2 ( ) ,progress= true ,save_everystep= false ,save_start= false )

Also, if you're using Juno this'll give you a nice progress bar so you can track how it's going.

Quick Note About Performance

We are using the ROCK2 method here because it's a method for stiff equations with eigenvalues that are real-dominated (as opposed to dominated by the imaginary parts). If we wanted to use a more conventional implicit ODE solver, we would need to make use of the sparsity pattern. This Gist shows how to use SparseDiffTools.jl to perform matrix coloring and specialize the ODE solver on the sparsity pattern. It turns out that ROCK2 is more efficient anyways (and doesn't require sparsity handling), so we will keep this setup.

The Full ODE Code

As a summary, here's a full PDE code:

using OrdinaryDiffEq, RecursiveArrayTools, LinearAlgebra # Define the constants for the PDE const α₂ = 1.0 const α₃ = 1.0 const β₁ = 1.0 const β₂ = 1.0 const β₃ = 1.0 const r₁ = 1.0 const r₂ = 1.0 const D = 100.0 const γ₁ = 0.1 const γ₂ = 0.1 const γ₃ = 0.1 const N = 100 const X = reshape ( [ i for i in 1 : 100 for j in 1 : 100 ] ,N,N ) const Y = reshape ( [ j for i in 1 : 100 for j in 1 : 100 ] ,N,N ) const α₁ = 1.0 . * ( X. > = 80 ) const Mx = full ( Tridiagonal ( [ 1.0 for i in 1 :N- 1 ] , [ - 2.0 for i in 1 :N ] , [ 1.0 for i in 1 :N- 1 ] ) ) const My = copy ( Mx ) Mx [ 2 , 1 ] = 2.0 Mx [ end- 1 , end ] = 2.0 My [ 1 , 2 ] = 2.0 My [ end ,end- 1 ] = 2.0 # Define the initial condition as normal arrays u0 = zeros ( N,N, 3 ) const MyA = zeros ( N,N ) ; const AMx = zeros ( N,N ) ; const DA = zeros ( N,N ) # Define the discretized PDE as an ODE function function f ( du,u,p,t ) A = @view u [ :,:, 1 ] B = @view u [ :,:, 2 ] C = @view u [ :,:, 3 ] dA = @view du [ :,:, 1 ] dB = @view du [ :,:, 2 ] dC = @view du [ :,:, 3 ] mul ! ( MyA,My,A ) mul ! ( AMx,A,Mx ) @. DA = D * ( MyA + AMx ) @. dA = DA + α₁ - β₁ * A - r₁ * A * B + r₂ * C @. dB = α₂ - β₂ * B - r₁ * A * B + r₂ * C @. dC = α₃ - β₃ * C + r₁ * A * B - r₂ * C end # Solve the ODE prob = ODEProblem ( f,u0, ( 0.0 , 100.0 ) ) sol = solve ( prob,ROCK2 ( ) ,progress= true ,save_everystep= false ,save_start= false ) using Plots ; pyplot ( ) p1 = surface ( X,Y,sol [ end ] . x [ 1 ] ,title = "[A]" ) p2 = surface ( X,Y,sol [ end ] . x [ 2 ] ,title = "[B]" ) p3 = surface ( X,Y,sol [ end ] . x [ 3 ] ,title = "[C]" ) plot ( p1,p2,p3,layout=grid ( 3 , 1 ) )

Making Use of GPU Parallelism

That was all using the CPU. How do we make turn on GPU parallelism with DifferentialEquations.jl? Well, you don't. DifferentialEquations.jl "doesn't have GPU bits". So wait... can we not do GPU parallelism? No, this is the glory of type-genericness, especially in broadcasted operations. To make things use the GPU, we simply use a GPUArray. If instead of zeros(N,M) we used CuArray(zeros(N,M)). CuArray naturally overrides broadcast such that dotted operations are performed on the GPU. DifferentialEquations.jl uses broadcast internally, and thus just by putting the array as a GPUArray, the array-type will take over how all internal updates are performed and turn this algorithm into a fully GPU-parallelized algorithm that doesn't require copying to the CPU. Wasn't that simple?

From that you can probably also see how to multithread everything, or how to set everything up with distributed parallelism. You can make the ODE solvers do whatever you want by defining an array type where the broadcast does whatever special behavior you want.

So to recap, the entire difference from above is changing to:

using CuArrays const gMx = CuArray ( Float32 . ( Mx ) ) const gMy = CuArray ( Float32 . ( My ) ) const gα₁ = CuArray ( Float32 . ( α₁ ) ) gu0 = CuArray ( Float32 . ( u0 ) ) const gMyA = CuArray ( zeros ( Float32 ,N,N ) ) const gAMx = CuArray ( zeros ( Float32 ,N,N ) ) const gDA = CuArray ( zeros ( Float32 ,N,N ) ) function gf ( du,u,p,t ) A = @view u [ :,:, 1 ] B = @view u [ :,:, 2 ] C = @view u [ :,:, 3 ] dA = @view du [ :,:, 1 ] dB = @view du [ :,:, 2 ] dC = @view du [ :,:, 3 ] mul ! ( gMyA,gMy,A ) mul ! ( gAMx,A,gMx ) @. gDA = D * ( gMyA + AgMx ) @. dA = gDA + gα₁ - β₁ * A - r₁ * A * B + r₂ * C @. dB = α₂ - β₂ * B - r₁ * A * B + r₂ * C @. dC = α₃ - β₃ * C + r₁ * A * B - r₂ * C end prob2 = ODEProblem ( gf,gu0, ( 0.0 , 100.0 ) ) CuArray. allowslow ( false ) # makes sure none of the slow fallbacks are used @ time sol = solve ( prob2,ROCK2 ( ) ,progress= true ,dt= 0.003 ,save_everystep= false ,save_start= false )

Go have fun.

And Stochastic PDEs?

Why not make it an SPDE? All that we need to do is extend each of the PDE equations to have a noise function. In this case, let's use multiplicative noise on each reactant. This means that our noise update equation is:

function g ( du,u,p,t ) A = @view u [ :,:, 1 ] B = @view u [ :,:, 2 ] C = @view u [ :,:, 3 ] dA = @view du [ :,:, 1 ] dB = @view du [ :,:, 2 ] dC = @view du [ :,:, 3 ] @. dA = γ₁ * A @. dB = γ₂ * A @. dC = γ₃ * A end

Now we just define and solve the system of SDEs:

prob = SDEProblem ( f,g,u0, ( 0.0 , 100.0 ) ) sol = solve ( prob,SRIW1 ( ) )

We can see the cool effect that diffusion dampens the noise in [A] but is unable to dampen the noise in [B] which results in a very noisy [C]. The stiff SPDE takes much longer to solve even using high order plus adaptivity because stochastic problems are just that much more difficult (current research topic is to make new algorithms for this!). It gets GPU'd just by using CuArray like before. But there we go: solving systems of stochastic PDEs using high order adaptive algorithms with within-method GPU parallelism. That's gotta be a first? The cool thing is that nobody ever had to implement the GPU-parallelism either, it just exists by virtue of the Julia type system.

(Note: We can also use one of the SROCK methods for better performance here, but they will require a choice of dt)

Side Notes

Warning: This can take awhile to solve! An explicit Runge-Kutta algorithm isn't necessarily great here, though to use a stiff solver on a problem of this size requires once again smartly choosing sparse linear solvers. The high order adaptive method is pretty much necessary though since something like Euler-Maruyama is simply not stable enough to solve this at a reasonable dt. Also, the current algorithms are not so great at handling this problem. Good thing there's a publication coming along with some new stuff...

Conclusion

So that's where we're at. GPU parallelism works because of abstract typing. But in some cases we need to help the GPU array libraries get up to snuff to handle all of the operations, and then we'll really be in business! Of course there's more optimizing that needs to be done, and we can do this by specializing code paths on bottlenecks as needed.

I think this is at least a nice proof of concept showing that Julia's generic algorithms allow for one to not only take advantage of things like higher precision, but also take advantage of parallelism and extra hardware without having to re-write the underlying algorithm. There's definitely more work that needs to be done, but I can see this usage of abstract array typing as being one of Julia's "killer features" in the coming years as the GPU community refines its tools. As of May 2019, all of this GPU stuff is compatible with stiff solvers and linear solver choices (so that way it can make use of GPU-based Jacobian factorizations and Krylov methods), and comparable methods for SDEs also are implemented in DifferentialEquations.jl. A follow-up blog post will show how to best use GPUs with implicit methods.

Full Script

Here's the full script for recreating everything:

####################################################### ### Solve the PDE ####################################################### using OrdinaryDiffEq, RecursiveArrayTools, LinearAlgebra # Define the constants for the PDE const α₂ = 1.0 const α₃ = 1.0 const β₁ = 1.0 const β₂ = 1.0 const β₃ = 1.0 const r₁ = 1.0 const r₂ = 1.0 const D = 100.0 const γ₁ = 0.1 const γ₂ = 0.1 const γ₃ = 0.1 const N = 100 const X = reshape ( [ i for i in 1 : 100 for j in 1 : 100 ] ,N,N ) const Y = reshape ( [ j for i in 1 : 100 for j in 1 : 100 ] ,N,N ) const α₁ = 1.0 . * ( X. > = 80 ) const Mx = Array ( Tridiagonal ( [ 1.0 for i in 1 :N- 1 ] , [ - 2.0 for i in 1 :N ] , [ 1.0 for i in 1 :N- 1 ] ) ) const My = copy ( Mx ) Mx [ 2 , 1 ] = 2.0 Mx [ end- 1 , end ] = 2.0 My [ 1 , 2 ] = 2.0 My [ end ,end- 1 ] = 2.0 # Define the initial condition as normal arrays u0 = zeros ( N,N, 3 ) const MyA = zeros ( N,N ) ; const AMx = zeros ( N,N ) ; const DA = zeros ( N,N ) # Define the discretized PDE as an ODE function function f ( du,u,p,t ) A = @view u [ :,:, 1 ] B = @view u [ :,:, 2 ] C = @view u [ :,:, 3 ] dA = @view du [ :,:, 1 ] dB = @view du [ :,:, 2 ] dC = @view du [ :,:, 3 ] mul ! ( MyA,My,A ) mul ! ( AMx,A,Mx ) @. DA = D * ( MyA + AMx ) @. dA = DA + α₁ - β₁ * A - r₁ * A * B + r₂ * C @. dB = α₂ - β₂ * B - r₁ * A * B + r₂ * C @. dC = α₃ - β₃ * C + r₁ * A * B - r₂ * C end # Solve the ODE prob = ODEProblem ( f,u0, ( 0.0 , 100.0 ) ) @ time sol = solve ( prob,ROCK2 ( ) ,progress= true ,save_everystep= false ,save_start= false ) using Plots ; pyplot ( ) p1 = surface ( X,Y,sol [ end ] . x [ 1 ] ,title = "[A]" ) p2 = surface ( X,Y,sol [ end ] . x [ 2 ] ,title = "[B]" ) p3 = surface ( X,Y,sol [ end ] . x [ 3 ] ,title = "[C]" ) plot ( p1,p2,p3,layout=grid ( 3 , 1 ) ) ####################################################### ### Solve the PDE using CLArrays ####################################################### using CuArrays gu0 = CuArray ( Float32 . ( u0 ) ) const gMx = CuArray ( Float32 . ( Mx ) ) const gMy = CuArray ( Float32 . ( My ) ) const gα₁ = CuArray ( Float32 . ( α₁ ) ) const gMyA = CuArray ( zeros ( Float32 ,N,N ) ) const gAMx = CuArray ( zeros ( Float32 ,N,N ) ) const gDA = CuArray ( zeros ( Float32 ,N,N ) ) function gf ( du,u,p,t ) A = @view u [ :,:, 1 ] B = @view u [ :,:, 2 ] C = @view u [ :,:, 3 ] dA = @view du [ :,:, 1 ] dB = @view du [ :,:, 2 ] dC = @view du [ :,:, 3 ] mul ! ( gMyA,gMy,A ) mul ! ( gAMx,A,gMx ) @. gDA = D * ( gMyA + AgMx ) @. dA = gDA + gα₁ - β₁ * A - r₁ * A * B + r₂ * C @. dB = α₂ - β₂ * B - r₁ * A * B + r₂ * C @. dC = α₃ - β₃ * C + r₁ * A * B - r₂ * C end prob2 = ODEProblem ( gf,gu0, ( 0.0 , 100.0 ) ) CuArray. allowslow ( false ) # makes sure none of the slow fallbacks are used @ time sol = solve ( prob2,ROCK2 ( ) ,progress= true ,save_everystep= false ,save_start= false ) ####################################################### ### Solve the SPDE ####################################################### using StochasticDiffEq function g ( du,u,p,t ) A = @view u [ :,:, 1 ] B = @view u [ :,:, 2 ] C = @view u [ :,:, 3 ] dA = @view du [ :,:, 1 ] dB = @view du [ :,:, 2 ] dC = @view du [ :,:, 3 ] @. dA = γ₁ * A @. dB = γ₂ * A @. dC = γ₃ * A end prob3 = SDEProblem ( f,g,u0, ( 0.0 , 100.0 ) ) sol = solve ( prob3,SOSRI ( ) ,progress= true ,save_everystep= false ,save_start= false ) p1 = surface ( X,Y,sol [ end ] . x [ 1 ] ,title = "[A]" ) p2 = surface ( X,Y,sol [ end ] . x [ 2 ] ,title = "[B]" ) p3 = surface ( X,Y,sol [ end ] . x [ 3 ] ,title = "[C]" ) plot ( p1,p2,p3,layout=grid ( 3 , 1 ) ) # Exercise: Do SPDE + GPU