Solving the ext3 latency problem

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

fsync()

One might think that the ext3 filesystem, by virtue of being standard on almost all installed Linux systems for some years now, would be reasonably well tuned for performance. Recent events have shown, though, that some performance problems remain in ext3, especially in places where thesystem call is used. It's impressive what can happen when attention is drawn to a problem; the 2.6.30 kernel will contain fixes which seemingly eliminate many of the latencies experienced by ext3 users. This article will look at the changes that were made, including a surprising change to the default journaling mode made just before the 2.6.30-rc1 release.

The problem, in short, is this: the ext3 filesystem, when running in the default data=ordered mode, can exhibit lengthy stalls when some process calls fsync() to flush data to disk. This issue most famously manifested itself as the much-lamented Firefox system-freeze problem, but it goes beyond just Firefox. Anytime there is reasonably heavy I/O going on, an fsync() call can bring everything to a halt for several seconds. Some stalls on the order of minutes have been reported. This behavior has tended to discourage the use of fsync() in applications and it makes the Linux desktop less fun to use. It's clearly worth fixing - but nobody did that for years.

When Ted Ts'o looked into the problem, he noticed an obvious problem: data sent to the disk via fsync() is put at the back of the I/O scheduler's queue, behind all other outstanding requests. If processes on the system are writing a lot of data, that queue could be quite long. So it takes a long time for fsync() to complete. While that is happening, other parts of the filesystem can stall, eventually bringing much of the system to a halt.

The first fix was to mark I/O requests generated by fsync() with the WRITE_SYNC operation bit, marking them as synchronous requests. The CFQ I/O scheduler tries to run synchronous requests (which generally have a process waiting for the results) ahead of asynchronous ones (where nobody is waiting). Normally, reads are considered to be synchronous, while writes are not. Once the fsync() -related requests were made synchronous, they were able to jump ahead of normal I/O. That makes fsync() much faster, at the expense of slowing down the I/O-intensive tasks in the system. This is considered to be a good tradeoff by just about everybody involved. (It's amusing to note that this change is conceptually similar to the I/O priority patch posted by Arjan van de Ven some time ago; some ideas take a while to reach acceptance).

Block subsystem maintainer Jens Axboe disliked the change, though, stating that it would cause severe performance regressions for some workloads. Linus made it clear, though, that the patch was probably going to go in, and that, if the CFQ I/O scheduler couldn't handle it, there would soon be a change to a different default scheduler. Jens probably would have looked further in any case, but the extra motivation supplied by Linus is unlikely to have slowed this process down.

The problem, as it turns out, is that WRITE_SYNC actually does two things: putting the request onto the higher-priority synchronous queue, and unplugging the queue. "Plugging" is the technique used by the block layer to issue requests to the underlying disk driver in bursts. Between bursts, the queue is "plugged," causing requests to accumulate there. This accumulation gives the I/O scheduler an opportunity to merge adjacent requests and issue them in some sort of reasonable order. Judicious use of plugging improves block subsystem performance significantly.

Unplugging the queue for a synchronous request can make sense in some situations; if somebody is waiting for the the operation, chances are they will not be adding any adjacent requests to the queue, so there is no point in waiting any longer. As it happens, though, fsync() is not one of those situations. Instead, a call to fsync() will usually generate a whole series of synchronous requests, and the chances of those requests being adjacent to each other is fairly good. So unplugging the queue after each synchronous request is likely to make performance worse. Upon identifying this problem, Jens posted a series of patches to fix it. One of them adds a new WRITE_SYNC_PLUG operation which queues a synchronous write without unplugging the queue. This allows operations like fsync() to create a series of operations, then unplug the queue once at the end.

While he was at it, Jens fixed a couple of related issues. One was that the block subsystem can still sometimes run synchronous requests behind asynchronous requests in some situations. The code here is a bit tricky, since it may be desirable to let a few asynchronous requests through occasionally to prevent them from being starved entirely. Jens changed the balance to ensure that synchronous requests get through in a timely manner.

Beyond that, the CFQ scheduler uses "anticipatory" logic with synchronous requests; upon executing one such request, it will stall the queue to see if an adjacent request shows up. The idea is that the disk head will be ideally positioned to satisfy that request, so the best performance is obtained by not moving it away immediately. This logic can work well for synchronous reads, but it's not helpful when dealing with write operations generated by fsync() . So now there's a new internal flag that prevents anticipation when WRITE_SYNC_PLUG operations are executed.

Linus liked the changes:

Goodie. Let's just do this. After all, right now we would otherwise have to revert the other changes as being a regression, and I absolutely _love_ the fact that we're actually finally getting somewhere on this fsync latency issue that has been with us for so long.

It turns out that there's a little more, though. Linus noticed that he was still getting stalls, even if they were much shorter than before, and he wondered why:

One thing that I find intriguing is how the fsync time seems so _consistent_ across a wild variety of drives. It's interesting how you see delays that are roughly the same order of magnitude, even though you are using an old SATA drive, and I'm using the Intel SSD.

The obvious conclusion is that there was still something else going on. Linus's hypothesis was that the volume of requests pending to the drive was large enough to cause stalls even when the synchronous requests go to the front of the queue. With a default configuration, requests can contain up to 512KB of data; stack up a couple dozen or so of those, and it's going to take the drive a little while to work through them. Linus experimented with setting the maximum size (controlled by /sys/block/drive/queue/max_sectors_kb ) to 64KB, and reports that things worked a lot better. As of this writing, though, the default has not been changed; Linus suggested that it might make sense to cap the maximum amount of outstanding data, rather than the size of any individual request. More experimentation is called for.

There is one other important change needed to get a truly quick fsync() with ext3, though: the filesystem must be mounted in data=writeback mode. This mode eliminates the requirement that data blocks be flushed to disk ahead of metadata; in data=ordered mode, instead, the amount of data to be written guarantees that fsync() will always be slower. Switching to data=writeback eliminates those writes, but, in the process, it also turns off the feature which made ext3 seem more robust than ext4. Ted Ts'o has mitigated that problem somewhat, though, by adding in the same safeguards he put into ext4. In some situations (such as when a new file is renamed on top of an existing file), data will be forced out ahead of metadata. As a result, data loss resulting from a system crash should be less of a problem.

Sidebar: data=guarded Another alternative to data=ordered may be the data=guarded mode proposed by Chris Mason. This mode would delay size updates to prevent information disclosure problems. It is a very new patch, though, which won't be ready for 2.6.30.

data=writeback

data=ordered

The other potential problem withis that, in some situations, a crash can leave a file with blocks allocated to it which have not yet been written. Those blocks may contain somebody else's old data, which is a potential security problem. Security is a smaller issue than it once was, for the simple reason that multiuser Linux systems are relatively scarce in 2009. In a world where most systems are dedicated to a single user, the potential for information disclosure in the event of a crash seems vanishingly small. In other words, it's not clear that the extra security provided byis worth the associated performance costs anymore.

So Ted suggested that, maybe, data=writeback should be made the default. There was some resistance to this idea; not everybody thinks that ext3, at this stage of its life, should see a big option change like that. Linus, however, was unswayed by the arguments. He merged a patch which creates a configuration option for the default ext3 data mode, and set it to "writeback." That will cause ext3 mounts to silently switch to data=writeback mode with 2.6.30 kernels. Says Linus:

I'm expecting that within a few months, most modern distributions will have (almost by mistake) gotten a new set of saner defaults, and anybody who keeps their machine up-to-date will see a smoother experience without ever even realizing _why_.

It's worth noting that this default will not change anything if (1) the data mode is explicitly specified when the filesystem is mounted, or (2) a different mode has been wired into the filesystem with tune2fs . It will also be ineffective if distributors change it back to "ordered" when configuring their kernels. Some distributors, at least, may well decide that they do not wish to push that kind of change to their users. We will not see the answer to that question for some months yet.

