Facebook and the kernel

Benefits for LWN subscribers The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

As one of the plenary sessions on the first day of the Linux Storage, Filesystem, and Memory Management (LSFMM) Summit, Btrfs developer Chris Mason presented on how his new employer, Facebook, uses the Linux kernel. He shared some of the eye-opening numbers that demonstrate just how much processing Facebook does using Linux, along with some of the "pain points" the company has with the kernel. Many of those pain points are shared with the user-space database woes that were presented in an earlier session (and will be discussed further at the Collaboration Summit that follows LSFMM), he said, so he mostly concentrated on other problem areas.

Architecture and statistics

In a brief overview of Facebook's architecture, he noted that there is a web tier that is CPU- and network-bound. It handles requests from users as well as sending replies. Behind that is a memcached tier that caches data, mostly from MySQL queries. Those queries are handled by a storage tier that is a collection of several different database systems: MySQL, Hadoop, RocksDB, and others.

Within Facebook, anyone can look at and change the code in its source repositories. The facebook.com site has its code updated twice daily, he said, so the barrier to getting new code in the hands of users is low. Those changes can be fixes or new features.

As an example, he noted that the "Look Back" videos, which were created by Facebook for each user and reviewed all of their posts to the service, added a huge amount of data and required a lot more network bandwidth. The process of creating and serving all of those videos was the topic of a Facebook engineering blog post. In all 720 million videos were created, which required an additional 11 petabytes of storage, as well as consuming 450 Gb/second of peak network bandwidth for people viewing the videos. The Look Back feature was conceived, provisioned, and deployed in only 30 days, he said.

The code changes quickly, so when performance or other problems crop up, he and other kernel developers can tell others in the company that "you're doing it wrong". In fact, he said, "they love that". It does mean that he has to come up with concrete suggestions on how to do it right, but Facebook is not unwilling to change its code.

Facebook runs a variety of kernel versions. The "most conservative" hosts run a 2.6.38-based kernel. Others run the 3.2 stable series with roughly 250 patches. Other servers run the 3.10 stable series with around 60 patches. Most of the patches are in the networking and tracing subsystems, with a few memory-management patches as well.

One thing that seemed to surprise Mason was the high failure tolerance that the Facebook production system has. He mentioned the 3.10 pipe race condition that Linus Torvalds fixed. It is a "tiny race", he said, but Facebook was hitting it (and recovering from it) 500 times per day. The architecture of the system is such that it could absorb that kind of failure rate without users noticing anything wrong.

Pain points

Mason asked around within Facebook to try to determine what the worst problem is that the company has with the kernel. In the end, two features were mentioned the most frequently: Stable pages and the completely fair queueing (CFQ) I/O scheduler. "I hope we never find those guys", he said with a laugh, since Btrfs implements stable pages. In addition, James Bottomley noted that Facebook already employs another CFQ developer (Jens Axboe).

Another area that was problematic for Facebook is surprises with buffered I/O latency, especially for append-only database files. Most of the time, those writes go fast, but sometimes they are quite slow. He would like to see the kernel avoid latency spikes like that.

He would like to see kernel-style spinlocks be available from user space. Rik van Riel suggested that perhaps POSIX locks could use adaptive locking, which would spin for a short time then switch to sleeping if the lock did not become available quickly. The memcached tier has a kind of user-space spinlock, Mason said, but it is "very primitive compared to the kernel".

Fine-grained I/O priorities is another wish list item for Facebook (and for the PostgreSQL developers as well). There are always cleaners and compaction threads that need to do I/O, but shouldn't hold off the higher-priority "foreground" I/O. Mason was asked about how the priorities would be specified, by I/O operation or file range, for example. In addition, he was asked about how fine-grained the priorities needed to be. Either way of specifying the priorities would be reasonable, and Facebook really only needs two (or few) priority levels: low and high.

The subject of ionice was raised again. One of the problems with that as a solution is that it only works with the (disabled by Facebook) CFQ scheduler. Bottomley suggested making ionice work with all of the schedulers, which Mason said might help. In order to do that, though, Ted Ts'o noted that the writeback daemon will have to understand the ionice settings.

Another problem area is logging. Facebook logs a lot of data and the logging workloads have to use fadvise() and madvise() to tell the kernel that those pages should not be saved in the page cache. "We should do better than that." Van Riel suggested that the page replacement patches in recent kernels may make things better. Mason said that Facebook does not mind explicitly telling the kernel which processes are sequentially accessing the log files, but continually calling *advise() seems excessive.

Josef Bacik has also been working on a small change to Btrfs to allow rate limiting buffered I/Os. It was easy to do in Btrfs, Mason said, but if the idea pans out would move elsewhere for more general availability. Jan Kara was concerned that only limiting buffered I/O would be difficult since there are other kinds of I/O bound for the disk at any given time. Mason agreed, saying that the solution would not be perfect but might help.

Bottomley noted that ionice is an existing API that should be reused to help with these kinds of problems. Similar discussions of using other mechanisms in the past have run aground on "painful arguments about which API is right", he said. Just making balance_dirty_pages() aware of the ionice priority may solve 90% of the problem. Other solutions can be added later.

Mason explained that Facebook stores its logs in a large Hadoop database, but that the tools for finding problems in those logs are fairly primitive— grep essentially. He said that he would "channel Lennart [Poettering] and Kay [Sievers]" briefly to wish for a way to tag kernel messages. Bottomley's suggestion that Mason bring it up with Linus Torvalds at the next Kernel summit was met with widespread chuckling.

Danger tier

While 3.10 is fairly close to the latest kernels, Mason would like to run even more recent kernels. To that end, he is creating something he calls the "danger tier". He ported the 60 patches that Facebook currently adds to 3.10.x to the current mainline Git tree and is carving out roughly 1000 machines to test that kernel in the web tier. He will be able to gather lots of performance metrics from those systems.

As a simple example of the kinds of data he can gather, he put up a graph of request response times (without any units) that was gathered over 3 days. It showed a steady average response time line all the way at the bottom as well as the ten worst systems' response times. Those not only showed large spikes in the response times, but also that the baseline for those systems was roughly twice that of the average. He can determine which systems those are, ssh in, and try to diagnose what is happening with them.

He said that was just an example. Eventually he will be able to share more detailed information that can be used to try to diagnose problems in newer kernels and get them fixed more quickly. He asked for suggestions of metrics to gather for the future. With that, his session slot expired.

[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]

