Atomic I/O operations

Benefits for LWN subscribers The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

According to Btrfs developer Chris Mason, tuning Linux filesystems to work well on solid-state storage devices is a lot like working on an old, clunky car. Lots of work goes into just trying to make the thing run with decent performance. Old cars may have mainly hardware-related problems, but, with Linux, the bottleneck is almost always to be found in the software. It is, he said, hard to give a customer a high-performance device and expect them to actually see that performance in their application. Fixing this problem will require work in a lot of areas. One of those areas, supporting and using atomic I/O operations, shows particular potential.

The problem

To demonstrate the kind of problem that filesystem developers are grappling with, Chris started with a plot from a problematic customer workload on an ext4 filesystem; it showed alternating periods of high and low I/O throughput rates. The source of the problem, in this case, was a combination of (1) overwriting an existing file and (2) a filesystem that had been mounted in the data=ordered mode. That combination causes data blocks to be put into a special list that must get flushed to disk every time that the filesystem commits a transaction. Since the system in question had a fair amount of memory, the normal asynchronous writeback mechanism didn't kick in, so dirty blocks were not being written steadily; instead, they all had to be flushed when the transaction commit happened. The periods of low throughput corresponded to the transaction commits; everything just stopped while the filesystem caught up with its pending work.

In general, a filesystem commit operation involves a number of steps, the first of which is to write all of the relevant file data and wait for that write to complete. Then the critical metadata can be written to the log; once again, the filesystem must wait until that write is done. Finally, the commit block can be written — inevitably followed by a wait. All of those waits are critical for filesystem integrity, but they can be quite hard on performance.

Quite a few workloads — including a lot of database workloads — are especially sensitive to the latency imposed by waits in the filesystem. If the number of waits could be somehow reduced, latency would improve. Fewer waits would also make it possible to send larger I/O operations to the device, with a couple of significant benefits: performance would improve, and, since large chunks are friendlier to a flash-based device's garbage-collection subsystem, the lifetime of the device would also improve. So reducing the number of wait operations executed in a filesystem transaction commit is an important prerequisite for getting the best performance out of contemporary drives.

Atomic I/O operations

One way to reduce waits is with atomic I/O operations — operations that are guaranteed (by the hardware) to either succeed or fail as a whole. If the system performs an atomic write of four blocks to the device, either all four blocks will be successfully written, or none of them will be. In many cases, hardware that supports atomic operations can provide the same integrity guarantees that are provided by waits now, making those waits unnecessary. The T10 (SCSI) standard committee has approved a simple specification for atomic operations; it only supports contiguous I/O operations, so it is "not very exciting." Work is proceeding on vectored atomic operations that would handle writes to multiple discontiguous areas on the disk, but that has not yet been finalized.

As an example of how atomic I/O operations can help performance, Chris looked at the special log used by Btrfs to implement the fsync() system call. The filesystem will respond to an fsync() by writing the important data to a new log block. In the current code, each commit only has to wait twice, thanks to some recent work by Josef Bacik: once for the write of the data and metadata, and once for the superblock write. That work brought a big performance boost, but atomic I/O can push things even further. By using atomic operations to eliminate one more wait, Chris was able to improve performance by 10-15%; he said he thought the improvement should be better than that, but even that level of improvement is the kind of thing database vendors send out press releases for. Getting a 15% improvement without even trying that hard, he said, was a nice thing.

At Fusion-io, work has been done to enable atomic I/O operations in the MariaDB and Percona database management systems. Currently, these operations are only enabled with the Fusion-io software development kit and its "DirectFS" filesystem. Atomic I/O operations allowed the elimination of the MySQL-derived double-buffering mode, resulting in 43% more transactions per second and half the wear on the storage device. Both improvements matter: if you have made a large investment in flash storage, getting twice the life is worth a lot of money.

Getting there

So it's one thing to hack some atomic operations into a database application; making atomic I/O operations more generally available is a larger problem. Chris has developed a set of API changes that will allow user-space programs to make use of atomic I/O operations, but there are some significant limitations, starting with the fact that only direct I/O is supported. With buffered I/O, it just is not possible for the kernel to track the various pieces through the stack and guarantee atomicity. There will also need to be some limitations on the maximum size of any given I/O operation.

An application will request atomic I/O with the new O_ATOMIC flag to the open() system call. That is all that is required; many direct I/O applications, Chris said, can benefit from atomic I/O operations nearly for free. Even at this level, there are benefits. Oracle's database, he said, pretends it has atomic I/O when it doesn't; the result can be "fractured blocks" where a system crash interrupts the writing of a data block that been scattered across a fragmented filesystem, leading to database corruption. With atomic I/O operations, those fragmented blocks will be a thing of the past.

Atomic I/O operation support can be taken a step further, though, by adding asynchronous I/O (AIO) support. The nice thing about the Linux AIO interface (which is not generally acknowledged to have many nice aspects) is that it allows an application to enqueue multiple I/O operations with a single system call. With atomic support, those multiple operations — which need not all involve the same file — can all be done as a single atomic unit. That allows multi-file atomic operations, a feature which can be used to simplify database transaction engines and improve performance. Once this functionality is in place, Chris hopes, the (currently small) number of users of the kernel's direct I/O and AIO capabilities will increase.

Some block-layer changes will clearly be needed to make this all work, of course. Low-level drivers will need to advertise the maximum number of atomic segments any given device will support. The block layer's plugging infrastructure, which temporarily stops the issuing of I/O requests to allow them to accumulate and be coalesced, will need to be extended. Currently, a plugged queue is automatically unplugged when the current kernel thread schedules out of the processor; there will need to be a means to require an explicit unplug operation instead. This, Chris noted, was how plugging used to work, and it caused a lot of problems with lost unplug operations. Explicit unplugging was removed for a reason; it would have to be re-added carefully and used "with restraint." Once that feature is there, the AIO and direct I/O code will need to be modified to hold queue plugs for the creation of atomic writes.

The hard part, though, is, as usual, the error handling. The filesystem must stay consistent even if an atomic operation grows too large to complete. There are a number of tricky cases where this can come about. There are also challenges with deadlocks while waiting for plugged I/O. The hardest problem, though, may be related to the simple fact that the proper functioning of atomic I/O operations will only be tested when things go wrong — a system crash, for example. It is hard to know that rarely-tested code works well. So there needs to be a comprehensive test suite that can verify that the hardware's atomic I/O operations are working properly. Otherwise, it will be hard to have full confidence in the integrity guarantees provided by atomic operations.

Status and future work

The Fusion-io driver has had atomic I/O operation support for some time, but Chris would like to make this support widely available so that developers can count on its presence. Extending it to NVM Express is in progress now; SCSI support will probably wait until the specification for vectored atomic operations is complete. Btrfs can use (vectored) atomic I/O operations in its transaction commits; work on other filesystems is progressing. The changes to the plugging code are done with the small exception of the deadlock handler; that gap needs to be filled before the patches can go out, Chris said.

From here, it will be necessary to finalize the proposed changes to the kernel API and submit them for review. The review process itself could take some time; the AIO and direct I/O interfaces tend to be contentious, with lots of developers arguing about them but few being willing to actually work on that code. So a few iterations on the API can be expected there. The FIO benchmark needs to be extended to test atomic I/O. Then there is the large task of enabling atomic I/O operation in applications.

For the foreseeable future, a number of limitations will apply to atomic I/O operations. The most annoying is likely to be the small maximum I/O size: 64KB for the time being. Someday, hopefully, that maximum will be increased significantly, but for now it applies. The complexity of the AIO and direct I/O code will challenge filesystem implementations; the code is far more complex than one might expect, and each filesystem interfaces with that code in a slightly different way. There are worries about performance variations between vendors; Fusion-io devices can implement atomic I/O operations at very low cost, but that may not be the case for all hardware. Atomic I/O operations also cannot work across multiple devices; that means that the kernel's RAID implementations will require work to be able to support atomic I/O. This work, Chris said, will not be in the initial patches.

There are alternatives to atomic I/O operations, including explicit I/O ordering (used in the kernel for years) or I/O checksumming (to detect incomplete I/O operations after a crash). But, for many situations, atomic I/O operations look like a good way to let the hardware help the software get better performance from solid-state drives. Once this functionality finds its way into the mainline, taking advantage of fast drives might just feel a bit less like coaxing another trip out of that old jalopy.

[Your editor would like to thank the Linux Foundation for supporting his travel to LinuxCon Japan.]

