Nobody understands the GIL

Published on June 12, 2013 by Jesse Storimer

Throughout most of my time in the Ruby community, MRI's infamous GIL has been an elusive creature for me. This is a story about thread safety, and finally catching that elusive creature to get a good look at it.

The first time I heard mention of the GIL, it had nothing to do with how it worked, what it did, or why it existed. I only heard about that it was silly because it restricted parallelism or that it was great because it made my code thread-safe. In time, I've gotten more comfortable with multi-threaded programming, and realized that the world is more complicated than that.

I wanted to know, at a deep technical level, how the GIL worked. Only, there's no specification for the GIL, and no documentation. It's essentially unspecified behaviour; an MRI implementation detail. The Ruby core team makes no promises about how it will work or what it guarantees.

But I may be getting ahead of myself.

If you're completely unfamiliar with the GIL, here's my 30 second explanation:

MRI has something called a global interpreter lock (GIL). It's a lock around the execution of Ruby code. This means that in a multi-threaded context, only one thread can execute Ruby code at any one time.

So if you have 8 threads busily working on a 8-core machine, only one thread and one core will be busy at any given time. The GIL exists to protect Ruby internals from race conditions that could corrupt data. There are caveats and optimizations, but this is the gist.

The problem

Back in 2008, Ilya Grigorik gave me a high-level understanding of the GIL with Parallelism is a Myth in Ruby. Even as I learned more about multi-threaded programming in Ruby, that high-level understanding stuck with me. Heck, I recently wrote a book about multi-threading in Ruby, and this high-level understanding of the GIL is all I had.

The problem with having just this high-level understanding is that I can't answer the deep technical questions. Specifically, I want to know if the GIL provides any kind of guarantee about the thread-safety of certain Ruby operations. Let me demonstrate.

Appending to arrays is not thread-safe

Very few things are implicitly thread-safe in Ruby. For example, appending to arrays:

array = [] 5 . times . map do Thread . new do 1000 . times do array << nil end end end . each ( & :join ) puts array . size

Here there's 5 threads sharing one Array object. Each thread pushes nil into the Array 1000 times. So in the end, there should be 5000 elements in the Array, right?

$ ruby pushing_nil.rb 5000 $ jruby pushing_nil.rb 4446 $ rbx pushing_nil.rb 3088

:(

Even this trivial example exposes an operation in Ruby that's not implicitly thread-safe. Or is it? What really happened here?

Notice that MRI produced the correct result, 5000. Both JRuby and Rubinius produced an incorrect result. If you were to run it again, you would probably see MRI give the correct result again, with JRuby and Rubinius giving different incorrect results.

These inconsistent results are because of the GIL. Because MRI has a GIL, even when there are 5 threads running at once, only one thread is active at a time. In other words, things aren't truly parallel. JRuby and Rubinius don't have a GIL, so when you have 5 threads running, you really have 5 threads running in parallel across the available cores.

On the parallel Ruby implementations, the 5 threads are stepping through code that's not thread-safe. They end up interrupting each other and, ultimately, corrupting the underlying data.

How multiple threads can corrupt data

How is this possible? I thought Ruby had our back, right? In lieu of preferring technical details over high-level explanations (that seems to be the theme here) I'll show you how this is technically possible.

Whether you're using MRI, JRuby, or Rubinius, the Ruby language is implemented on top of another language. MRI is implemented in C, JRuby in Java, and Rubinius in a mixture of Ruby and C++. So when you have this single operation in Ruby:

array << nil

that actually expands to dozens or hundreds of lines of code in the underlying implementation. For example, here's the implementation of Array#<< from MRI:

VALUE rb_ary_push ( VALUE ary , VALUE item ) { long idx = RARRAY_LEN ( ary ); ary_ensure_room_for_push ( ary , 1 ); RARRAY_ASET ( ary , idx , item ); ARY_SET_LEN ( ary , idx + 1 ); return ary ; }

Notice at least 4 distinct underlying operations.

Get the current length of the Array. Check if there's room for another element in the Array. Append the element to the Array. Set the length property of the Array to the old value + 1.

Each of these operations calls yet other functions or macros. The reason I'm bringing this up is to show you how multiple threads can corrupt data. In a single-threaded context, you can look at this short C function and easily trace the path of the code.

In other words, we're used to stepping through code in a linear fashion and reasoning about the state of 'the world'. That's often how we write code.

When multiple threads are involved, this is no longer possible. It's as if the rules of physics change. When there are two threads, each thread is tracing its own path through the code. Now you have to maintain 2 (or more) 'pointers' to where each thread is currently executing. Since threads share the same memory space, both threads can be mutating the state of 'the world' at the same time.

It's possible for one thread to interrupt another, change the state of things right from underneath its feet, then have the original proceed, completely unaware that the state of things has changed.

This is the reason that some of the Ruby implementations produced incorrect results when simply appending to an Array. A situation like the following happened.

Here's the base state of our little system here.

There are two active threads, both about to enter this function. Consider steps 1-4 to be pseudo-code for the implementation of Array#<< from MRI you saw earlier. Once both threads enter this function, here's a possible sequence of events, beginning with Thread A.

This looks more complicated, but just follow the directional arrows to take you through the flow of what happens here. I've added little labels at each step to reflect the state of things from the perspective of each thread.

This is just one possible sequence of events.

So what happens is that Thread A starts down the usual sequential path of this function, but when it gets to step 3, it hits a context switch! This pauses Thread A right where it is. Then Thread B takes over and it runs through the entire function, appending its element and incrementing the length attribute.

Once Thread B is done, Thread A is resumed. It picks up exactly where it left off. Remember, Thread A was paused right before it incremented the length attribute, so it goes ahead and increments the length attribute. Only, it doesn't know that Thread B came and changed the state of things right out from under it.

So Thread B set the length to 1, then Thread A set the length to 1, even though both appended their element to the Array. This data has become corrupted. That's why there's a little lightning strike beside the last step in Thread A.

But I thought Ruby had our back?

This sequence of events is what could lead to incorrect results as shown with JRuby or Rubinius from the trivial example way back.

Except, in the case of JRuby and Rubinius, things are even more complicated because the threads can actually run in parallel. In this diagram, one thread is paused while the other progresses, with true parallelism, both threads can progress at the same time.

If you actually ran the example from earlier against one of these implementations, you would've seen that the incorrect result is always different. The context switch that happens here is non-deterministic, it's not predictable. It's possible that it happens earlier in the function, or later, or not at all. The next section will tell you more about this.

So, why doesn't Ruby shield us from this madness? For the same reason that core collections in other programming languages don't offer thread safety guarantees: it's expensive. It's possible for all of the Ruby implementations to provide thread-safe data structures, but that requires extra overhead that would make single-threaded code slower.

Rather than make this tradeoff, the onus is on you, the developer, to provide the thread safety guarantees when you need them.

This leaves two open questions for me, and we still haven't dived into the technical details of the GIL.

If this context switch is possible, why does MRI still produce the correct answer? What the heck is this context switch?

Question #1 is the reason that I started writing this article. A high-level understanding of the GIL can't answer this question. That high-level understanding makes it clear that only one thread will be executing Ruby code at one time. But can a context switch still happen in the middle of a function underlying Ruby? What are the GIL semantics?

But first...

It's the scheduler's fault!

The context switch comes from the operating system thread scheduler. In all of the Ruby implementations that I showed output from, one Ruby thread is backed by one native, operating system thread. The operating system has to ensure that no single thread can hog all of the available resources, like CPU time, so it implements scheduling so that each thread has fair access to the resources.

This manifests as a series of pauses and resumes. Each thread gets a turn to consume resources, then it's paused in its tracks so another thread can have a turn. In time, this thread will be resumed, on and on again.

This is efficient for the operating system, but introduces a degree of randomness and a new angle of correctness to your programs. For example, the Array#<< operation now needs to be aware that it may be paused at any time, and another thread may perform the same operation in parallel, changing the state of 'the world' right under its feet.

What to do? Make key operations atomic.

If you want to be sure that this kind of interruption between threads can't happen, you should make the operation atomic. By making it atomic, you ensure that it can't be interrupted until it's finished. This would prevent our example from being interrupted in step #3 and, ultimately, prevent it from corrupting the data when it resumes at step #4.

The simplest way to make an operation like this atomic is to use a lock. This code is guaranteed to produce the correct result every time on MRI, JRuby, and Rubinius.

array = [] mutex = Mutex . new 5 . times . map do Thread . new do mutex . synchronize do 1000 . times do array << nil end end end end . each ( & :join ) puts array . size

It's guaranteed to be correct because it uses a shared mutex or lock. Once a thread enters the block of code inside the mutex.synchronize block, all other threads must wait until it's finished before they can enter the same code. If you think back to the beginning, I said that many lines of C code may underlie this operation, and the thread scheduler could schedule a context switch between any two such lines.

By making an operation like this atomic, you guarantee that if a context switch occurs inside this block, other threads will not be able to proceed into the same code. The thread scheduler will see this and switch again to another thread. This also guarantees that no other thread can come along and change the state of 'the world' from right underneath its feet. This example is now thread-safe.

The GIL is a lock too

I've just shown how you can use a lock to make something atomic and provide a thread-safety guarantee. The GIL is also a lock, so does it guarantee that all of your Ruby code is thread-safe? Does the GIL make array << nil atomic?

This article has gotten longer enough for one sitting. In part 2 I look deeper into MRI's GIL implementation to answer that question.