In the previous article Does Rails Performance Need an Overhaul? we had discussed the fact that proper Ruby threading is hindered by various broken native extensions. Writing a native extension for Ruby is pretty easy, however writing it right can not only be difficult, but can also be an obscure practice that requires l33t sk1llz because of the lack of documentation in this area. We’ve written several native extensions so far and in the process of figuring out how to make threading-friendly native extensions we had to wade through tons of Ruby source code. In this article I want to teach some best practices in the writing of threading-friendly native extensions.

Threading basics

As discussed in the previous article, Ruby 1.8 implements userspace threads, meaning that no matter how many Ruby threads you have, only one can run at a time, and only on a single CPU core. The threads are scheduled by Ruby itself, not by the operating system.

Ruby 1.9 implements native operating system threads. However it has a global interpreter lock which must be locked when the thread is running Ruby code. This effectively makes Ruby 1.9 single threaded most of the time.

With both Ruby 1.8 and 1.9 threads, system calls such as I/O operations can block the thread and preventing Ruby from context switching to another. Thus, system calls require special attention. Expensive calculations that do not involve system calls can also block the thread, but something can be done those as well, as you will read later on.

Handling I/O

Suppose that you have a file descriptor on which you want to perform some potentially blocking I/O. The naive approach is to perform the I/O command anyway and risk blocking the entire Ruby process. This is exactly what makes the mysql extension thread-unfriendly: while waiting on MySQL no other threads can run, grinding your multi-threaded Rails web app to a halt.

However there are a number of functions in your arsenal that you can use to combat this problem. And as a general rule, you should set to file descriptors to non-blocking mode.

rb_thread_wait_fd(fd)

Just before performing a blocking read, you should call rb_thread_wait_fd() on the file descriptor that you’re reading from. On 1.8, this function marks the current thread as waiting for readable data on this file descriptor and then invokes the scheduler. The scheduler uses the select() system call to check which file descriptors are readable and then selects a thread which may continue. If the file descriptor that you were waiting on is not readable, then your thread will be suspended until the next time the scheduler is invoked and selects your thread. But even if the file descriptor is immediately readable, the scheduler does not guarantee that your thread will be selected immediately.

On 1.9, rb_thread_wait_fd() simply unlocks the global interpreter lock, calls the select() system call on the given file descriptor, and re-acquires the global interpreter lock when select() returns. While select() is blocking, other threads can run.

As an optimization, if only the main thread exists then this function does nothing. This applies to both 1.8 and 1.9.

rb_thread_fd_writable(fd)

This works the same as rb_thread_fd(), but waits until the given file descriptor becomes writable. The single-thread optimization applies here too. You should call rb_thread_fd_writable() just before you perform a write I/O operation.

rb_thread_select()

To wait on multiple file descriptors, use this function instead of select() or poll(). Unlike the native system calls, this function will take care of invoking the scheduler or unlocking the global interpreter lock. Unlike rb_thread_wait_fd() and rb_thread_fd_writable(), there is no do-nothing-when-there’s-only-one-thread optimization here so it will always invoke the scheduler and call select().

rb_io_wait_readable()

I/O system calls can return a variety of error codes that indicate that you should restart the system call, such as EINTR (system call interrupted by signal) and EAGAIN (the file descriptor is set to non-blocking mode and the data is not yet available). You should therefore always call I/O system calls in a loop until it returns success or a different error code. You must however not forget to call rb_thread_wait_fd() or rb_thread_select() before you restart the system call, or you will risk blocking the thread again.

Ruby provides a function rb_io_wait_readable() to aid you in writing restart code. This function should be called right after your I/O reading system call has returned. It checks whether the system call should be restarted (returning Qtrue) or whether you should report an error (returning Qfalse). Here’s a code example:

int done = 0; int ret; /* Have the Ruby scheduler suspend this thread until the file descriptor becomes * readable; or if this is the only thread in the system, rb_thread_wait_fd() does * nothing and we immediately continue to the 'do' loop. */ rb_thread_wait_fd(fd); do { /* Actually you should surround your system call with some more code, but * we'll get to this later. This example code is only partial. */ ret = ...your read system call here... if (ret == -1) { if (rb_io_wait_readable(fd) == Qfalse) { ...throw an exception here... } /* else restart loop */ } else { done = 1; } } while (!done);

rb_io_wait_readable() checks whether errno equals EINTR or ERESTART, in which case it will call rb_thread_wait_for() on the file descriptor and return Qtrue. If errno is EAGAIN or EWOULDBLOCK then it calls rb_thread_select() on the file descriptor and returns true. Otherwise it returns false.

The difference between calling rb_thread_wait_for() and rb_thread_select() here is subtle, but important. The former only blocks (calls select() on the file descriptor) when there are multiple Ruby threads in the Ruby process, while the latter always blocks no matter what. This behavior is important because EAGAIN and EWOULDBLOCK occur when a non-blocking file descriptor is not yet readable; if we don’t block here on a select() then the code will enter a 100% CPU busy loop.

rb_io_wait_writable()

Works the same way as rb_io_wait_readable(). Use this for I/O write operations instead.

Sleeping

Use rb_thread_wait_for() instead of sleep() or usleep(). On 1.8 rb_thread_wait_for() marks the current thread as sleeping for a period of time and then invokes the scheduler, which does not select this thread until the period of time has expired. On 1.9 Ruby unlocks the global interpreter lock, calls some sleeping function, and then re-locks it after that function returns.

Other non-I/O blocking system calls

Sometimes you will want to wait on a blocking system call that isn’t related to I/O, such as waitpid(). There are several ways to deal with these kind of system calls.

Blocking outside the global interpreter lock

This method only works on Ruby 1.9. Unlock the global interpreter lock, do your thing, then re-locks it. Dealing with the global interpreter lock will be discussed later.

Non-blocking polling

Some system calls have non-blocking equivalents which return a certain error instead of blocking. For example waitpid() blocks by default, but it can be set to non-blocking by passing the WNOHANG flag, which causes it to return immediately with an error instead of blocking. You must call the non-blocking version in a loop. Upon detecting a blocking error, you must call rb_thread_polling(). On 1.8 this function lets the scheduler put the current thread to sleep for 60 msec, on 1.9 for 100 msec.

For example, Ruby’s Process#waitpid function does not block other threads. On 1.9 it simply unlocks the global interpreter lock while blocking on waitpid(). On 1.8 it is implemented as follows (simplified version):

retry: int result = waitpid(..., WNOHANG); if (result < 0) { if (errno == EINTR) { /* Process isn't ready yet. Tell the scheduler and then restart the call. */ rb_thread_polling(); goto retry; } else { ...throw exception... } }

The actual code is actually more optimized than this. For example if there's only a single thread in the system then it calls waitpid() without WNOHANG and just have it block.

Calling the system call in a native OS thread and use I/O to report results

This is probably the most complex way but on 1.8 sometimes you don't have any choice. On 1.9 you should always prefer unlocking the global interpreter lock over this method.

Create a pipe, then spawn a native OS thread which calls the system call. When the system call is done, have your native thread report the result back via the pipe. On the Ruby side, use rb_thread_wait_fd() and friends to block on the pipe and then receive the results. Be sure to join the thread after you've read the result because rb_thread_wait_fd() does not necessarily block until there is data, so when rb_thread_wait_fd() returns it is not guaranteed that the thread has returned yet.

Another thing to watch out for is that your thread must not refer to data that's on the Ruby thread's stack. This is because Ruby overwrites the main OS thread's C stack upon context switching to another Ruby thread. For example code like this is not OK:

static void thread_main(int *value) { /* 'value' here refers to the 'value' variable on foobar's stack, but * that data is overwritten when Ruby context switches, so we * really can't use 'value' here! */ } /* Native extension Ruby method. */ static void foobar() { int value = 1234; thread_t thread = create_a_thread(thread_main, &value); ...do something which can cause a Ruby thread context switch... join_thread(thread); }

To pass data to the thread, you should put the data on the heap instead of the stack. This is OK:

typedef struct { ... } Data; static void thread_main(Data *data) { /* 'data' is safe to access. */ } /* Native extension Ruby method. */ static void foobar() { Data *data = malloc(sizeof(Data)); thread_t thread = create_a_thread(thread_main, data); ...do something which can cause a Ruby thread context switch... join_thread(thread); free(data); }

Heavy CPU computations

Not only blocking system calls can block other threads, CPU-heavy computation code can also do that. While executing non-Ruby-API C code, context switching to other threads is not possible. Calls to Ruby APIs may sometimes cause context switching. However there are several ways to make context switching possible while running CPU-heavy computations.

Unlocking the global interpreter lock

This only works on 1.9. Unlock the global interpreter lock and then call the computation code, and relock when done. Consider BCrypt-Ruby as an example. BCrypt is a very heavy hashing algorithm used for securely hashing passwords; depending on the configured cost it could need several minutes to calculate a hash. We've recently patched BCrypt-Ruby to unlock the global interpreter lock while running the BCrypt algorithm, so that when you run BCrypt-Ruby in multiple threads the algorithms can be spread across multiple CPU cores.

However, be aware of the fact that unlocking and relocking the global interpreter lock comes with some overhead as well. Unlocking and relocking the global interpreter lock is only worth it if you know that the computation is going to take a while (say, longer than 50 msec). If the computation time is short then you will actually make your code slower because of all the locking overhead. Therefore BCrypt-Ruby only unlocks the global interpreter lock if the BCrypt cost is set to 9 or higher.

Explicit yielding

You can call rb_thread_schedule() once in a while to force context switching to another thread. However this approach does not allow your code to make use of multiple cores even if you're on 1.9.

Running the C code in a native OS thread

This is pretty much the same approach as described by "Calling the system call in a native OS thread and use I/O to report results". In my opinion, unless your computation takes a very long time, implementing this is almost never worth the trouble. For BCrypt-Ruby we didn't bother: if you want multi-core support in BCrypt-Ruby you need to be on 1.9.

TRAP_BEG/TRAP_END and the global interpreter lock

TRAP_BEG and TRAP_END

On 1.8, you should surround system calls with calls to TRAP_BEG and TRAP_END. TRAP_BEG performs some preparation work. TRAP_END performs a variety of things:

It checks whether there are any pending signals, e.g. whether the user pressed Ctrl-C. If so it will raise an appropriate SignalException. It also calls the scheduler if a certain amount of time has been spent on the current thread.

On 1.9 TRAP_BEG and TRAP_END are macros that unlock and lock the global interpreter lock. However these macros are deprecated and are likely to disappear in the future so you should not use them on 1.9. Instead, you should use rb_thread_blocking_region().

On 1.9 TRAP_BEG and TRAP_END are defined in ruby/backward/rubysig.h.

rb_thread_blocking_region()

This is a 1.9-specific function which allows you to call a function outside the global interpreter lock. Its declaration is as follows:

rb_thread_blocking_region(rb_blocking_function_t *func, void *data1, rb_unblock_function_t *ubf, void *data2);

func is a pointer to a function that is to be called outside the global interpreter lock. This function must look similar to:

VALUE foobar(void *data)

The data passed via the data1 parameter is passed to the function.

ubf is either RUBY_UBF_IO (indicating that you're performing some kind of I/O operation) or RUBY_UBF_PROCESS (indicating that you're calling some kind of process management system call). However I'm not sure what this parameter exactly does. data2 is supposedly passed to ubf when it's called.

The return value of this function is the return value of func .

Global interpreter lock caveats

Do not call any Ruby API functions while the global interpreter lock is unlocked! No rb_yield(), rb_str_new(), or anything. The entirety of the Ruby API is only safe to call when the global interpreter lock is obtained.