Solid-state storage devices and the block layer

LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

Over the last few years, it has become clear that one of the most pressing scalability problems faced by Linux is being driven by solid-state storage devices (SSDs). The rapid increase in performance offered by these devices cannot help but reveal any bottlenecks in the Linux filesystem and block layers. What has been less clear, at times, is what we are going to do about this problem. In his LinuxCon Japan talk, block maintainer Jens Axboe described some of the work that has been done to improve block layer scalability and offered a view of where things might go in the future.

While workloads will vary, Jens says, most I/O patterns are dominated by random I/O and relatively small requests. Thus, getting the best results requires being able to perform a large number of I/O operations per second (IOPS). With a high-end rotating drive (running at 15,000 RPM), the maximum rate possible is about 500 IOPS. Most real-world drives, of course, will have significantly slower performance and lower I/O rates.

SSDs, by eliminating seeks and rotational delays, change everything; we have gone from hundreds of IOPS to hundreds of thousands of IOPS in a very short period of time. A number of people have said that the massive increase in IOPS means that the block layer will have to become more like the networking layer, where every bit of per-packet overhead has been squeezed out over time. But, as Jens points out, time is not in great abundance. Networking technology went from 10Mb/s in the 1980's to 10Gb/s now, the better part of 30 years later. SSDs have forced a similar jump (three orders of magnitude) in a much shorter period of time - and every indication suggests that devices with IOPS rates in the millions are not that far away. The result, says Jens, is "a big problem."

This problem pops up in a number of places, but it usually comes down to contention for shared resources. Locking overhead which is tolerable at 500 IOPS is crippling at 500,000. There are also problems with contention at the hardware level too; vendors of storage controllers have been caught by surprise by SSDs and are having to scramble to get their performance up to the required levels. The growth of multicore systems naturally makes things worse; such systems can create contention problems throughout the kernel, and the block layer is no exception. So much of the necessary work comes down to avoiding contention.

Before that, though, some work had to be done just to get the block layer to recognize that it is dealing with an SSD and react accordingly. Traditionally, the block layer has been driven by the need to avoid head seeks; the use of quite a bit of CPU time could be justified if it managed to avoid a single seek. SSDs - at least the good ones - care a lot less about seeks, so expending a bunch of CPU time to avoid them no longer makes sense. There are various ways of detecting SSDs in the hardware, but they don't always work, especially with the lower-quality devices. So the block layer exports a flag under

/sys/block/<device>/queue/rotational

which can be used to override the system's notion of what kind of storage device it is dealing with.

Improving performance with SSDs can be a challenging task. There is no single big bottleneck which is causing performance problems; instead, there are numerous small things to fix. Each fix yields a bit of progress, but it mostly serves to highlight the next problem. Additionally, performance testing is hard; results are often not reproducible and can be perturbed by small changes. This is especially true on larger systems with more CPUs. Power management can also get in the way of the generation of consistent results.

One of the first things to address on an SSD was queue plugging. On a rotating disk, the first I/O operation to show up in the request queue will cause the queue to be "plugged," meaning that no operations will actually be dispatched to the hardware. The idea behind plugging is that, by allowing a little time for additional I/O requests to arrive, the block layer will be able to merge adjacent requests (reducing the operation count) and sort them into an optimal order, increasing performance. Performance on SSDs tends not to benefit from this treatment, though there is still a little value to merging requests. Dropping (or, at least, reducing) plugging not only eliminates a needless delay; it also reduces the need to take the queue lock in the process.

Then, there is the issue of request timeouts. Like most I/O code, the block layer needs to notice when an I/O request is never completed by the device. That detection is done with timeouts. The old implementation involved a separate timeout for each outstanding request, but that clearly does not scale when the number of such requests can be huge. The answer was to go to a per-queue timer, reducing the number of running timers considerably.

Block I/O operations, due to their inherently unpredictable execution times, have traditionally contributed entropy to the kernel's random number pool. There is a problem, though: the necessary call to add_timer_randomness() has to acquire a global lock, causing unpleasant systemwide contention. Some work was done to batch these calls and accumulate randomness on a per-CPU basis, but, even when batching 4K operations at a time, the performance cost was significant. On top of it all, it's not really clear that using an SSD as an entropy source makes a lot of sense. SSDs lack mechanical parts moving around, so their completion times are much more predictable. Still, for the moment, SSDs contribute to the entropy pool by default; administrators who would like to change that behavior can do so by changing the queue/add_random sysfs variable.

There are other locking issues to be dealt with. Over time, the block layer has gone from being protected by the big kernel lock to a block-level lock, then to a per-disk lock, but lock contention is still a problem. The I/O scheduler adds contention of its own, especially if it is performing disk-level accounting. Interestingly, contention for the locks themselves is not usually the problem; it's not that the locks are being held for too long. The big problem is the cache-line bouncing caused by moving the lock between processors. So the traditional technique of dropping and reacquiring locks to reduce lock contention does not help here - indeed, it makes things worse. What's needed is to avoid taking the lock altogether.

Block requests enter the system via __make_request() , which is responsible for getting a request (represented by a BIO structure) onto the queue. Two lock acquisitions are required to do this job - three if the CFQ I/O scheduler is in use. Those two acquisitions are the result of a lock split done to reduce contention in the past; that split, when the system is handling requests at SSD speeds, makes things worse. Eliminating it led to a roughly 3% increase in IOPS with a reduction in CPU time on a 32-core system. It is, Jens says, a "quick hack," but it demonstrates the kind of changes that need to be made.

The next step for this patch is to drop the I/O request allocation batching - a mechanism added to increase throughput on rotating drives by allowing the simultaneous submission of multiple requests. Jens also plans to drop the allocation accounting code, which tracks the number of requests in flight at any given time. Counting outstanding I/O operations requires global counters and the associated contention, but it can be done without most of the time. Some accounting will still be done at the request queue level to ensure that some control is maintained over the number of outstanding requests. Beyond that, there is some per-request accounting which can be cleaned up and, Jens thinks, request completion can be made completely lockless. He hopes that this work will be ready for merging into 2.6.38.

Another important technique for reducing contention is keeping processing on the same CPU as often as possible. In particular, there are a number of costs which are incurred if the CPU which handles the submission of a specific I/O request is not the CPU which handles that request's completion. Locks are bounced between CPUs in an unpleasant way, and the slab allocator tends not to respond well when memory allocated on one processor is freed elsewhere in the system. In the networking layer, this problem has been addressed with techniques like receive packet steering, but, unlike some networking hardware, block I/O controllers are not able to direct specific I/O completion interrupts to specific CPUs. So a different solution was required.

That solution took the form of smp_call_function() , which performs fast cross-CPU calls. Using smp_call_function() , the block I/O completion code can direct the completion of specific requests to the CPU where those requests were initially submitted. The result is a relatively easy performance improvement. A dedicated administrator who is willing to tweak the system manually can do better, but that takes a lot of work and the solution tends to be fragile. This code - which was merged back in 2.6.27 and made the default in 2.6.32 - is an easier way that takes away a fair amount of the pain of cross-CPU contention. Jens noted with pride that the block layer was not chasing the networking code with regard to completion steering - the block code had it first.

On the other hand, the blk-iopoll interrupt mitigation code was not just inspired by the networking layer - some of the code was "shamelessly stolen" from there. The blk-iopoll code turns off completion interrupts when I/O traffic is high and uses polling to pick up completed events instead. On a test system, this code reduced 20,000 interrupts/second to about 1,000. Jens says that the results are less conclusive on real-world systems, though.

An approach which "has more merit" is "context plugging," a rework of the queue plugging code. Currently, queue plugging is done implicitly on I/O submission, with an explicit unplug required at a later time. That has been the source of a lot of bugs; forgetting to unplug queues is a common mistake to make. The plan is to make plugging and unplugging fully implicit, but give I/O submitters a way to inform the block layer that more requests are coming soon. It makes the code more clear and robust; it also gets rid of a lot of expensive per-queue state which must be maintained. There are still some problems to be solved, but the code works, is "tasty on many levels," and yields a net reduction of some 600 lines of code. Expect a merge in 2.6.38 or 2.6.39.

Finally, there is the "weird territory" of a multiqueue block layer - an idea which, once again, came from the networking layer. The creation of multiple I/O queues for a given device will allow multiple processors to handle I/O requests simultaneously with less contention. It's currently hard to do, though, because block I/O controllers do not (yet) have multiqueue support. That problem will be fixed eventually, but there will be some other challenges to overcome: I/O barriers will become significantly more complicated, as will per-device accounting. All told, it will require some major changes to the block layer and a special I/O scheduler. Jens offered no guidance as to when we might see this code merged.

The conclusion which comes from this talk is that the Linux block layer is facing some significant challenges driven by hardware changes. These challenges are being addressed, though, and the code is moving in the necessary direction. By the time most of us can afford a system with one of those massive, 1 MIOPS arrays on it, Linux should be able to use it to its potential.

