I was recently playing around with a problem presented to me -- using Monte Carlo simulations to calculate the value of π (pi). For anyone not familiar with the Monte Carlo method, it is essentially a probabilistic way to come up with the answer to a mathematical question by running a large number of simulations when you cannot get or want to double-check a closed-formed solution. Think about flipping an invisible coin 100 times in a blackbox and just getting the experimental results out. When you get approximately 50:50 heads/tails, you realize through these simulated flips that there are probably two equally likely outcomes in our blackbox, without actually ever seeing the coin. We'll build a more useful use case shortly.

As I built out the solution, I realized that it was a perfect time to apply Go's concurrency primitives to make things run a little faster on my multicore machine. Let's see if we can use Go's concurrency features to our advantage. We'll benchmark our functions as we go and see that we'll run into some big pitfalls as we try to boost our multicore performance.

Building Monte Carlo simulations

First, our problem to solve with Monte Carlo simulations: calculate the value of π. Wikipedia's page on the Monte Carlo method does a very good job of describing the method to do this:

Draw a square on the ground, then inscribe a circle within it. Uniformly scatter some objects of uniform size (grains of rice or sand) over the square. Count the number of objects inside the circle and the total number of objects. The ratio of the two counts is an estimate of the ratio of the two areas, which is π/4. Multiply the result by 4 to estimate π.

And here's the accompanying and very helpful visual from Wikipedia:

To make the simulations simple, we will just use a unit square with sides of length 1. That means our final ratio of A circle / A square will simply be (π*(1)2/4) / 12 = π/4. We just need to multiply by 4 to get π. With that method in mind now, our Go code looks like:

package main import ( "fmt" "math/rand" "time" ) func PI(samples int) float64 { var inside int = 0 for i := 0; i < samples; i++ { x := rand.Float64() y := rand.Float64() if (x*x + y*y) < 1 { inside++ } } ratio := float64(inside) / float64(samples) return ratio * 4 } func init() { rand.Seed(time.Now().UnixNano()) } func main() { fmt.Println("Our value of Pi after 100 runs:\t\t\t", PI(100)) fmt.Println("Our value of Pi after 1,000 runs:\t\t", PI(1000)) fmt.Println("Our value of Pi after 10,000 runs:\t\t", PI(10000)) fmt.Println("Our value of Pi after 100,000 runs:\t\t", PI(100000)) fmt.Println("Our value of Pi after 1,000,000 runs:\t\t", PI(1000000)) fmt.Println("Our value of Pi after 10,000,000 runs:\t\t", PI(10000000)) fmt.Println("Our value of Pi after 100,000,000 runs:\t\t", PI(100000000)) }

Hopefully the above code is fairly straightforward. Essentially, we use the math/rand package to generate a random x and y value between 0.0 and 1.0, and then see if x2 + y2 <= 1. This is the equation of a unit circle and tells us if our generated random sample fell inside the circle. We do this for n samples, keeping count of how many of these fall inside the circle. We divide our count of samples inside the circle by total samples, to get a ratio of A circle / A square , which should equal roughly π/4. Finally, we multiply by 4 to get our final estimate for π.

In my run, I get the results:

Our value of Pi after 100 runs: 3.32 Our value of Pi after 1,000 runs: 3.14 Our value of Pi after 10,000 runs: 3.1544 Our value of Pi after 100,000 runs: 3.13976 Our value of Pi after 1,000,000 runs: 3.140692 Our value of Pi after 10,000,000 runs: 3.14165 Our value of Pi after 100,000,000 runs: 3.14164864

...Phew. That took a while, maybe ~10 seconds for me (we'll time it shortly.) But look at that, π = 3.14164864 after 100,000,000 runs. That's not too far from the real value of π ~= 3.14159! Nice!

Adding concurrency

Since these runs are completely independent of each other, they are a perfect candidate for parallelizing. Please note that there absolutely may be better optimizations than using parallelism available for this particular problem, but parallelism will help us with any independent Monte Carlo sampling, so let's see how it can help us here, and it will be broadly applicable to this whole class of problems.

Let's see what our PI() function looks like in its concurrent form, MultiPI():

func MultiPI(samples int, threads int) float64 { threadSamples := samples / threads results := make(chan float64, threads) for j := 0; j < threads; j++ { go func() { var inside int for i := 0; i < threadSamples; i++ { x, y := rand.Float64(), rand.Float64() if x*x+y*y <= 1 { inside++ } } results <- float64(inside) / float64(threadSamples) * 4 }() } var total float64 for i := 0; i < threads; i++ { total += <-results } return total / float64(threads) }

Let's run through our changes from PI() to MultiPI():

Firstly, we create a results channel where each goroutine can return its result for pi when it is done. Then we use the go keyword to run each function as its own non-blocking goroutine. Each one processes a subset of total samples. We send our value pi into the channel at the end of the life of the goroutine.

And we're done! Let's go ahead and benchmark this version against our original PI() function. Our benchmark code, setting for 10,000,000 runs and using 4 goroutines for our MultiPI(), looks like:

package main import ( "testing" ) const samples = 10000000 func BenchmarkPI(b *testing.B) { for i := 0; i < b.N; i++ { PI(samples) } } func BenchmarkMultiPI(b *testing.B) { for i := 0; i < b.N; i++ { MultiPI(samples, 4) } }

For the sake of brevity, I won't cover how to run Go benchmarks again here, but you can learn quickly over at Beautifully Simple Benchmarking With Go. Ok, so what do our benchmark looks like?

BenchmarkPI 2 657884228 ns/op BenchmarkMultiPI 2 877696394 ns/op

... well that's horribly disappointing. The concurrent versions is actually slower. That's pretty bad, since I'm running this on a Intel Core i7 with 4 physical cores. What's going on here?

Concurrency isn't parallelism

We've just discovered that concurrency isn't parallelism. Rob Pike, one of the creators of the Go language, dedicates an entire talk, "Concurrency is not Parallelism" to this, which I highly recommend with accompanying slides. Essentially, in MultiPI(), we've broken up our previously synchronous tasks into 4 independent functions that run as goroutines, and they run concurrently, which means that each goroutine doesn't depend on another (unless explicitly told to) and will not wait synchronously for one to finish before running its tasks. For example, if a goroutine stopped to wait for I/O, the Go scheduler would automatically hand over the reins to another goroutine, not waiting for the blocked goroutine to finish. However, goroutines are not in and of themselves separate processes that necessarily run at the same time, in parallel fashion. You can have concurrency without parallelism, as you would always get, for example, on a single core machine running a Go application with multiple goroutines. In the single core case, the Go runtime scheduler will constantly switch between goroutines, but only one goroutine is being processed by the CPU at any instant. As Effective Go states:

Goroutines are multiplexed onto multiple OS threads so if one should block, such as while waiting for I/O, others continue to run.

Typically, goroutines don't even take up a whole OS thread on their own. The scheduler will only create an extra OS thread if it has to. In addition, Go uses only a single core by default:

The current implementation of the Go runtime will not parallelize this code by default. It dedicates only a single core to user-level processing.

So how do we get our MultiPI() to start using our multiple cores? We use Go's runtime package to find out our number of cores and specify that we want to use them all using Go's GOMAXPROCS setting in our opening init() function. This is called simply as:

func init() { runtime.GOMAXPROCS(runtime.NumCPU()) rand.Seed(time.Now().UnixNano()) }

There's one more fix we need to make and it took a little tinkering to figure this out. Previously we used Go's math/rand's rand.Float64 convenience method to get our x and y values. As per this Stack Overflow discussion, this is problematic because the convenience function actually uses a global rand object that has a mutex lock associated with it. So all of our goroutines were previously locked out as they tried to use the same underlying object to call rand.Float64(). With both of these fixes made our final MultiPI() looks like:

func MultiPI(samples int) float64 { cpus := runtime.NumCPU() threadSamples := samples / cpus results := make(chan float64, cpus) for j := 0; j < cpus; j++ { go func() { var inside int r := rand.New(rand.NewSource(time.Now().UnixNano())) for i := 0; i < threadSamples; i++ { x, y := r.Float64(), r.Float64() if x*x+y*y <= 1 { inside++ } } results <- float64(inside) / float64(threadSamples) * 4 }() } var total float64 for i := 0; i < cpus; i++ { total += <-results } return total / float64(cpus) }

Running our benchmarks one more time, fingers crossed ...

BenchmarkPI-8 2 660629709 ns/op BenchmarkMultiPI-8 20 63408278 ns/op

And there we go! That's a 10.4x speedup. That's much more reasonable from our 4 physical core (8 virtual core) processor.

Lessons learned

Concurrency isn't parallelism. -- Goroutines are non-blocking and amazing out of the box for things like web servers which are mostly waiting for I/O. Even a single multiplexed thread on a single core made concurrent can have huge performance benefits over a synchronous implementation. But if we want true parallelism (multiple processes working away at the same time), we need to make sure we're actually using all of our cores using GOMAXPROCS.

Beware of more obscure limits to parallelism. -- Our code originally seemed parallelized, but it really wasn't due to a mutex lock inside a function call to an imported package. The only thing that caught it was good benchmarking. Evidence-based parallelism through good benchmarking is always better than assuming things are parallelized because they look that way at first glance.

That's it for today, hope that was helpful! The full codebase is available on GitHub.

Update: After a great comment by Daniel Heckrath, I changed PI() to use a rand.New() *Rand object instead of the global *Rand object used by rand.Float64(). I had assumed without any locking due to the mutex as in the multicore code, this wouldn't make a difference, but there was actually a 2x speedup for PI(), due to the lack of the overhead of owning, locking and unlocking a mutex for an instance of *Rand vs the globalRand object. The actual multicore speedup was therefore 5.2x for me, much closer to what you'd expect for a 4-core machine. Updated code is in the GitHub repo. Great catch Daniel.

Feel free to send over any questions or mistakes I may have made, you can find me on Twitter @soroushjp :)

-- Soroush