The @Cloudflare team just pushed a change that improves our network's performance significantly, especially for particularly slow outlier requests. How much faster? We estimate we're saving the Internet ~54 years *per day* of time we'd all otherwise be waiting for sites to load. — Matthew Prince (@eastdakota) June 28, 2018

10 million websites, apps and APIs use Cloudflare to give their users a speed boost. At peak we serve more than 10 million requests a second across our 151 data centers. Over the years we’ve made many modifications to our version of NGINX to handle our growth. This is blog post is about one of them.

How NGINX works

NGINX is one of the programs that popularized using event loops to solve the C10K problem. Every time a network event comes in (a new connection, a request, or a notification that we can send more data, etc.) NGINX wakes up, handles the event, and then goes back to do whatever it needs to do (which may be handling other events). When an event arrives, data associated with the event is already ready, which allows NGINX to efficiently handle many requests simultaneously without waiting.

num_events = epoll_wait(epfd, /*returned=*/events, events_len, /*timeout=*/-1); // events is list of active events // handle event[0]: incoming request GET http://example.com/ // handle event[1]: send out response to GET http://cloudflare.com/

For example, here's what a piece of code could look like to read data from a file descriptor:

// we got a read event on fd while (buf_len > 0) { ssize_t n = read(fd, buf, buf_len); if (n < 0) { if (errno == EWOULDBLOCK || errno == EAGAIN) { // try later when we get a read event again } if (errno == EINTR) { continue; } return total; } buf_len -= n; buf += n; total += n; }

When fd is a network socket, this will return the bytes that have already arrived. The final call will return EWOULDBLOCK which means we have drained the local read buffer, so we should not read from the socket again until more data becomes available.

Disk I/O is not like network I/O

When fd is a regular file on Linux, EWOULDBLOCK and EAGAIN never happens, and read always waits to read the entire buffer. This is true even if the file was opened with O_NONBLOCK . Quoting open(2) :

Note that this flag has no effect for regular files and block devices

In other words, the code above basically reduces to:

if (read(fd, buf, buf_len) > 0) { return buf_len; }

Which means that if an event handler needs to read from disk, it will block the event loop until the entire read is finished, and subsequent event handlers are delayed.

This ends up being fine for most workloads, because reading from disk is usually fast enough, and much more predictable compared to waiting for a packet to arrive from network. That's especially true now that everyone has an SSD, and our cache disks are all SSDs. Modern SSDs have very low latency, typically in 10s of µs. On top of that, we can run NGINX with multiple worker processes so that a slow event handler does not block requests in other processes. Most of the time, we can rely on NGINX's event handling to service requests quickly and efficiently.

SSD performance: not always what’s on the label

As you might have guessed, these rosy assumptions aren’t always true. If each read always takes 50µs then it should only take 2ms to read 0.19MB in 4KB blocks (and we read in larger blocks). But our own measurements showed that our time to first byte is sometimes much worse, particularly at 99th and 999th percentile. In other words, the slowest read per 100 (or per 1000) reads often takes much longer.

SSDs are very fast but they are also notoriously complicated. Inside them are computers that queue up and re-order I/O, and also perform various background tasks like garbage collection and defragmentation. Once in a while, a request gets slowed down enough to matter. My colleague Ivan Babrou ran some I/O benchmarks and saw read spikes of up to 1 second. Moreover, some of our SSDs have more performance outliers than others. Going forward we will consider performance consistency in our SSD purchases, but in the meantime we need to have a solution for our existing hardware.

Spreading the load evenly with SO_REUSEPORT

An individual slow response once in a blue moon is difficult to avoid, but what we really don't want is a 1 second I/O blocking 1000 other requests that we receive within the same second. Conceptually NGINX can handle many requests in parallel but it only runs 1 event handler at a time. So I added a metric that measures this:

gettimeofday(&start, NULL); num_events = epoll_wait(epfd, /*returned=*/events, events_len, /*timeout=*/-1); // events is list of active events // handle event[0]: incoming request GET http://example.com/ gettimeofday(&event_start_handle, NULL); // handle event[1]: send out response to GET http://cloudflare.com/ timersub(&event_start_handle, &start, &event_loop_blocked);

p99 of event_loop_blocked turned out to be more than 50% of our TTFB. Which is to say, half of the time it takes to serve a request is a result of the event loop being blocked by other requests. event_loop_blocked only measures about half of the blocking (because delayed calls to epoll_wait() are not measured) so the actual ratio of blocked time is much higher.

Each of our machines run NGINX with 15 worker processes, which means one slow I/O should only block up to 6% of the requests. However, the events are not evenly distributed, with the top worker taking 11% of the requests (or twice as many as expected).

SO_REUSEPORT can solve the uneven distribution problem. Marek Majkowski has previously written about the downside in the context of other NGINX instances, but that downside mostly doesn't apply in our case since upstream connections in our cache process are long-lived, so a slightly higher latency in opening the connection is negligible. This single configuration change to enable SO_REUSEPORT improved peak p99 by 33%.

Moving read() to thread pool: not a silver bullet

A solution to this is to make read() not block. In fact, this is a feature that's implemented in upstream NGINX! When the following configuration is used, read() and write() are done in a thread pool and won't block the event loop:

aio threads; aio_write on;

However when we tested this, instead of 33x response time improvement, we actually saw a slight increase in p99. The difference was within margin of error but we were quite discouraged by the result and stopped pursuing this option for a while.

There are a few reasons why we didn’t see the level of improvements that NGINX saw. In their test, they were using 200 concurrent connections to request files that were 4MB in size, which were residing on spinning disks. Spinning disks increase I/O latency so it makes sense that an optimization that helps latency would have more dramatic effect.

We are also mostly concerned with p99 (and p999) performance. Solutions that help the average performance don't necessarily help with outliers.

Finally, in our environment, typical file sizes are much smaller. 90% of our cache hits are for files smaller than 60KB. Smaller files mean fewer occasions to block (we typically read the entire file in 2 reads).

If we look at the disk I/O that a cache hit has to do:

// we got a request for https://example.com which has cache key 0xCAFEBEEF fd = open("/cache/prefix/dir/EF/BE/CAFEBEEF", O_RDONLY); // read up to 32KB for the metadata as well as the headers // done in thread pool if "aio threads" is on read(fd, buf, 32*1024);

32KB isn't a static number, if the headers are small we need to read just 4KB (we don't use direct IO so kernel will round up to 4KB). The open() seems innocuous but it's actually not free. At a minimum the kernel needs to check if the file exists and if the calling process has permission to open it. For that it would have to find the inode of /cache/prefix/dir/EF/BE/CAFEBEEF , and to do that it would have to look up CAFEBEEF in /cache/prefix/dir/EF/BE/ . Long story short, in the worst case the kernel has to do the following lookups:

/cache /cache/prefix /cache/prefix/dir /cache/prefix/dir/EF /cache/prefix/dir/EF/BE /cache/prefix/dir/EF/BE/CAFEBEEF

That's 6 separate reads done by open() compared to just 1 read done by read() ! Fortunately, most of the time lookups are serviced by the dentry cache and don't require trips to the SSDs. But clearly having read() done in thread pool is only half of the picture.

The coup de grâce: non-blocking open() in thread pools

So I modified NGINX to do most of open() inside the thread pool as well so it won't block the event loop. And the result (both non-blocking open and non-blocking read):

On June 26 we deployed our changes to 5 of our busiest data centers, followed by world wide roll-out the next day. Overall peak p99 TTFB improved by a factor of 6. In fact, adding up all the time from processing 8 million requests per second, we saved the Internet 54 years of wait time every day.

We've submitted our work to upstream. Interested parties can follow along.

Our event loop handling is still not completely non-blocking. In particular, we still block when we are caching a file for the first time (both the open(O_CREAT) and rename() ), or doing revalidation updates. However, those are rare compared to cache hits. In the future we will consider moving those off of the event loop to further improve our p99 latency.

Conclusion

NGINX is a powerful platform, but scaling extremely high I/O loads on linux can be challenging. Upstream NGINX can offload reads in separate threads, but at our scale we often need to go one step further. If working on challenging performance problems sounds exciting to you, apply to join our team in San Francisco, London, Austin or Champaign.