LCA: Why filesystems are hard

Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

The ext4 filesystem is reaching the culmination of a long development process. It has been marked as stable in the mainline kernel for over a year, distributions are installing it by default, and it may start to see more widespread enterprise-level deployment toward the end of this year. At linux.conf.au 2010, ext4 hacker Ted Ts'o talked about the process of stabilizing ext4 and why filesystems take a long time to become ready for production use.

In general, Ted says, people tend to be overly optimistic about how quickly a filesystem can stabilize. It is not a fast process, for a number of fairly clear reasons. In general, there are some aspects of software which can make it hard to test and debug. These include premature optimization ("the root of all evil"), the presence of large amounts of internal state, and an environment involving a lot of parallelism. Any of these features will make code more difficult to understand and complicate the testing environment.

Filesystems suffer from all of these problems. Users demand that a general-purpose filesystem be heavily optimized for a wide variety of workloads; this optimization work must be done at all levels of the code. The entire job of a filesystem is to store and manage internal state. Among other things, that makes it hard for developers to reproduce problems; specific bugs are quite likely to be associated with the state of a specific filesystem which a user may be unwilling to share even in the absence of the practical difficulties implicit in making hundreds of gigabytes of data available to developers. And parallelism is a core part of the environment for any general-purpose filesystem; there will always be many things going on at once. All of these factors combine to make filesystems difficult to stabilize.

What it comes down to, Ted says, is that filesystems, like fine wines, have to age for a fair period of time before they are ready. But there's an associated problem: the workload-dependent nature of many filesystem problems guarantees that filesystem developers cannot, by themselves, find all of the bugs in their code. There will always be a need for users to test the code and report their experiences. So filesystem developers have a strong incentive to encourage users to use the code, but the more ethical developers (at least) do not want to cause users to lose data. It's a fine line which can be hard to manage.

So what does it take to get a filesystem written and ready for use? As part of the process of seeking funding for Btrfs development, Ted talked to veterans of a number of filesystem development projects over the years. They all estimated that getting a filesystem to a production-ready state would require something between 75 and 100 person-years of effort - or more. That can be a daunting thing to tell corporate executives when one is trying to get a project funded; for Btrfs, Ted settled for suggesting that every company involved should donate two engineers to the cause. Alas, not all of the companies followed through completely; vague problems associated with an economic crisis got in the way.

An associated example: Sun started working on the ZFS filesystem in 2001. The project was only announced in 2005, with the first shipments happening in 2006. But it is really only in the last year or so that system administrators have gained enough confidence in ZFS to start using it in production environments. Over that period of time, the ZFS team - well over a dozen people at its peak - devoted a lot of time to the development of the filesystem.

So where do things stand with ext4? It is, Ted says, an interesting time. It has been shipping in community distributions for a while, with a number of them now installing it by default. With luck, the long term support and enterprise distributions will start shipping it soon; enterprise-level adoption can be expected to follow a year or so after that.

Over the last year or so, There have been something between 60 and 100 ext4 patches in each mainline kernel release. Just under half of those are bug fixes; many of the rest are cleanup patches. There's also a small amount of new feature and performance enhancement work still. Ted noted that the number of bug fixes has not been going down in recent releases. That, he says, is to be expected; the user community for ext4 is growing rapidly, and more users will find (and report) more bugs.

A certain number of those bugs are denial of service problems; many of those are system crashes in response to a corrupted on-disk filesystem image. A larger share of the problems are race conditions and, especially deadlocks. There are a few problems associated with synchronization; one does not normally notice these at all unless the system crashes at the wrong time. And there are a few memory leaks, usually associated with poorly-tested error-handling paths.

The areas where the bulk of these bugs can be found is illuminating. There have been problems in the interaction between the block allocator and the online resize functionality - it turns out that people do not resize filesystems often, so this code is not always all that heavily tested. Other bugs have come up in the interaction between block pre-allocation and out-of-space handling. Online defragmentation has had a number of problems, including one nasty security bug; it turned out that nobody had really been testing that code. The FIEMAP ioctl() command, really only used by one utility, had some problems. There were issues associated with disk quotas; this feature, too, is not often used, especially by filesystem developers. And there have been problems with the no-journal mode contributed by Google; the filesystem had a number of "there is always a journal" assumptions inherited from ext3, but, again, few people have tested this feature.

The common theme here should be clear: a lot of the bugs turning up in this stage of the game are associated with little-used features which have not received as much testing as the core filesystem functions. The good news is that, as a result, most of the bugs have not actually affected that many users.

There was one problem in particular which took six months to find; about once a month, it would corrupt a filesystem belonging to a dedicated and long-suffering tester. It turned out that there was a race condition which could corrupt the disk if two processes were writing the same file at the same time. Samba, as it happens, does exactly that, whereas the applications run by most filesystem developers do not. The moral of the story: just because the filesystem developer has not seen problems does not mean that the code is truly safe.

Another bug would only strike if the system crashed at just the wrong time; it had been there for a long time before anybody noticed it. How long? The bug was present in the ext3 filesystem as well, but nobody ever reported it.

There have also been a number of performance problems which have been found and fixed. Perhaps the most significant one had to do with performance in the writeback path. According to Ted, the core writeback code in the kernel is fairly badly broken at the moment, with the result that it will not tell the filesystem to write back more than 1024 blocks at a time. That is far too small for large, fast devices. So ext4 contains a hack whereby it will write back much more data than the VFS layer has requested; it is justified, he says, because all of the other filesystems do it too. In general, nobody wants to touch the writeback code, partly because they fear breaking all of the workarounds which have found their way into filesystem-specific code over the years.

Ted concluded by noting that, in one way, filesystems are easy: the Linux kernel contains a great deal of generic support code which does much of the work. But the truth of the matter is that they are hard. There are lots of workloads to support, the performance demands are strong, and there tend to be lots of processes running in parallel. The creation of a new filesystem is done as a labor of love; it's generally hard to justify from a business perspective. This reality is reflected in the fact that almost nobody is investing in filesystem work currently, with the one high-profile exception being Sun and its ZFS work. But, Ted noted, that work has cost them a lot, and it's not clear that they have gotten a return which justifies that investment. Hopefully the considerable amount of work which has gone into Linux filesystem development will have a more obvious payback.

