Why Zapier Doesn't Use Gevent (Yet)

Rob Golding-Day / January 10, 2016

At Zapier, we make a lot of outbound API requests. Almost 500 per second, to be exact. That means we spend a long time waiting for those APIs to get back to us—particularly if a popular service is experiencing issues. In fact, if you were to add up all the time we spend waiting across our fleet, it’s the equivalent of 4 minutes of CPU time per second of real time!

This type of workload is, typically, very well suited to asynchronous event-based implementations such as Python’s Gevent. Our backend is based on Python & Django, so we thought Gevent might work well for us. We did some research into the benefits we’d see by switching our polling system to Gevent. Here's what we discovered—and why we won’t be switching just yet!

Some Background

At its core, Zapier is a fairly standard monolithic Python app, with some work being done in microservices—a setup that surely wouldn’t look out of place in many other Python shops! We use Celery for asynchronous processing, which is obviously the mainstay of our workload. Everything that’s needed to make a Zap run happens inside a Celery task.

The Problem

To ensure that we have enough bandwidth to consume all of our background tasks, we run 92 Celery workers on each server. Each of those processes uses over 300MB of RAM, which puts a hard limit on how many processes we can run across our infrastructure. As we scale up, we have to add more servers with plenty of RAM to house the increasing number of processes that are required—even if we have spare CPU cycles that could do more work.

Beyond the RAM issues, it’s not possible to have those processes schedule work cooperatively. By random chance, they might all be idle, waiting for API responses—or they might be simultaneously executing some CPU-intensive code, causing the server to become overloaded.

In an ideal world, we could allow real work to be performed in those 4 minutes of downtime we have every second. That would let us raise the ceiling that RAM usage places on our infrastructure, and give us more confidence that individual servers won’t get overloaded.

In summary, our workload involves a large amount of I/O but we’re constrained by the amount of RAM available on our servers. We also have some spare CPU capacity, but we can’t easily make use of it.

Enter Gevent

Gevent—to vastly oversimplify it—implements cooperative multitasking by letting a single Python process perform multiple tasks simultaneously (though not in parallel) via lightweight threads called Greenlets.

The main difference between Greenlets and traditional threads is that they can’t context-switch without explicit permission to do so. With Gevent, this usually happens right after a request is made over the network, since we can expect to wait a little while for the result. You can allow a context switch to happen anywhere, though, by using gevent.sleep(0) at any point in your code.

Greenlets help make efficient use of the time between making a request and receiving the response, by allowing work to continue whilst we wait for the I/O to complete.

For us, the main benefit of Gevent would be that each Python process can do more work than it could before. This means we don’t need as much RAM to scale up, helping increase the utilization of our existing worker servers.

Problems with Gevent

Gevent is extremely powerful, but it's not a panacea. Here are some of the issues we encountered:

Third-Party Libraries

The first problem with using Gevent is third-party library support. To use it, any libraries that make calls over the network must either use pure Python (so that they can use the monkey-patched socket module) or be written specifically with Gevent support.

For us, that would mean switching out our existing MySQL database connection library for one that is less mature and battle-tested. Though not a deal-breaker, it caused us some amount of worry.

Greenlet-Safety

Secondly, all code must be written with Greenlet-safety in mind. Whilst that's easier than making code thread-safe (since we know precisely when the context switches can happen), it still requires a significant amount of work to ensure that no shared resources are accessed in areas of the codebase that may context-switch. For example, accessing a global variable in the same portion of code that makes an HTTP request could cause conflicts.

At Zapier, we're sometimes forced to use a third-party library to communicate with an API (generally, those that use something like Thrift instead of HTTP). In those cases, we can't be 100% sure their code is Greenlet-safe without auditing it ourselves.

Connection Pooling

Thirdly, connections to backend services—such as the database and memcached—must be shared between Greenlets in a process, lest the number of concurrent connections spiral out of control. The way to handle this is with connection pooling, which allows a bounded number of connections to be used by many Greenlets, so long as they are returned to the pool as soon as they are no longer needed.

Whilst this works, it significantly increases code complexity, and requires that developers are aware of the restrictions around certain services. For example, forgetting to return a checked-out connection to MySQL before making an outbound HTTP request could have a knock-on effect that grinds the whole task queue to a halt.

Blocking Operations

Finally, and most fatally for Zapier, a single Greenlet can block the entire process if it does not yield in a reasonable length of time. On our backend, we sometimes perform a significant amount of CPU intensive data transformation and deduplication. Whilst that work is being carried out, no other Greenlets are able to run, effectively stalling the process.

Though we can work around this issue by explicitly allowing the Gevent to yield, that means two things for us:

We must be certain our code is Greenlet-safe

We must deal with increased code complexity

Specifically, developers working on code inside the critical path must be aware that they cannot rely on the usual single-threaded model. Access to shared resources must be synchronized, and yielding often (via gevent.sleep(0) ) is necessary to prevent locking up the queue.

Our Results

After a couple of months of on-and-off work, we successfully implemented all of the above changes to run our backend infrastructure on Gevent. We were able to increase worker utilization by about 20%, whilst reducing RAM usage. With this setup, we could switch the type of instances we use for those that favour more CPU-heavy workloads, and reduce the overall number required to handle our base load—saving us money on our hosting costs!

However, we also suffered an approximate 50% slowdown in task execution times, due to the fact that we do so many operations that blocked the process.

In the end, we decided that the increase in code complexity and slowdown in task execution times wasn’t worth the RAM and hosting cost savings, and we put off switching to Gevent for now.

In the future, it’s highly likely we'll make the transition. We’re starting to move more of our core workload out into microservices, and as we do so, Gevent becomes more and more attractive. Until then, though, your Zaps will continue to be powered by our single-threaded Python processes!