One billion files on Linux

Benefits for LWN subscribers The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

What happens if you try to put one billion files onto a Linux filesystem? One might see this as an academic sort of question; even the most enthusiastic music downloader will have to work a while to collect that much data. It would require over 30,000 (clean) kernel trees to add up to a billion files. Even contemporary desktop systems, which often seem to be quite adept at the creation of vast numbers of small files, would be hard put to make a billion of them. But, Ric Wheeler says, this is a problem we need to be thinking about now, or we will not be able to scale up to tomorrow's storage systems. His LinuxCon talk used the billion-file workload as a way to investigate the scalability of the Linux storage stack.

One's first thought, when faced with the prospect of handling one billion files, might be to look for workarounds. Rather than shoveling all of those files into a single filesystem, why not spread them out across a series of smaller filesystems? The problems with that approach are that (1) it limits the kernel's ability to optimize head seeks and such, reducing performance, and (2) it forces developers (or administrators) to deal with the hassles involved in actually distributing the files. Inevitably things will get out of balance, forcing things to be redistributed in the future.

Another possibility is to use a database rather than the filesystem. But filesystems are familiar to developers and users, and they come with the operating system from the outset. Filesystems also are better at handling partial failure; databases, instead, tend to be all-or-nothing affairs.

If one wanted to experiment with a billion-file filesystem, how would one come up with hardware which is up to the task? The most obvious way at the moment is with external disk arrays. These boxes feature non-volatile caching and a hierarchy of storage technologies. They are often quite fast at streaming data, but random access may be fast or slow, depending on where the data of interest is stored. They cost $20,000 and up.

With regard to solid-state storage, Ric noted only that 1Tb still costs a good $1000. So rotating media is likely to be with us for a while.

What if you wanted to put together a 100Tb array on your own? They did it at Red Hat; the system involved four expansion shelves holding 64 2Tb drives. It cost over $30,000, and was, Ric said, a generally bad idea. Anybody wanting a big storage array will be well advised to just go out and buy one.

The filesystem life cycle, according to Ric, starts with a mkfs operation. The filesystem is filled, iterated over in various ways, and an occasional fsck run is required. At some point in the future, the files are removed. Ric put up a series of plots showing how ext3, ext4, XFS, and btrfs performed on each of those operations with a one-million-file filesystem. The results varied, with about the only consistent factor being that ext4 generally performs better than ext3. Ext3/4 are much slower than the others at creating filesystems, due to the need to create the static inode tables. On the other hand, the worst performers when creating 1 million files were ext3 and XFS. Everybody except ext3 performs reasonably well when running fsck - though btrfs shows room for some optimization. The big loser when it comes to removing those million files is XFS.

To see the actual plots, have a look at Ric's slides [PDF].

It's one thing to put one million files into a filesystem, but what about one billion? Ric did this experiment on ext4, using the homebrew array described above. Creating the filesystem in the first place was not an exercise for the impatient; it took about four hours to run. Actually creating those one billion files, instead, took a full four days. Surprisingly, running fsck on this filesystem only took 2.5 hours - a real walk in the park. So, in other words, Linux can handle one billion files now.

That said, there are some lessons that came out of this experience; they indicate where some of the problems are going to be. The first of these is that running fsck on an ext4 filesystem takes a lot of memory: on a 70Tb filesystem with one billion files, 10GB of RAM was needed. That number goes up to 30GB when XFS is used, though, so things can get worse. The short conclusion: you can put a huge amount of storage onto a small server, but you'll not be able to run the filesystem checker on it. That is a good limitation to know about ahead of time.

Next lesson: XFS, for all of its strengths, struggles when faced with metadata-intensive workloads. There is work in progress to improve things in this area, but, for now, it will not perform as well as ext3 in such situations.

According to Ric, running ls on a huge filesystem is "a bad idea"; iterating over that many files can generate a lot of I/O activity. When trying to look at that many files, you need to avoid running stat() on every one of them or trying to sort the whole list. Some filesystems can return the file type with the name in readdir() calls, eliminating the need to call stat() in many situations; that can help a lot in this case.

In general, enumeration of files tends to be slow; we can do, at best, a few thousand files per second. That may seem like a lot of files, but, if the target is one billion files, it will take a very long time to get through the whole list. A related problem is backup and/or replication. That, too, will take a very long time, and it can badly affect the performance of other things running at the same time. That can be a problem because, given that a backup can take days, it really needs to be run on an operating, production system. Control groups and the I/O bandwidth controller can maybe help to preserve system performance in such situations.

Finally, application developers must bear in mind that processes which run this long will invariably experience failures, sooner or later. So they will need to be designed with some sort of checkpoint and restart capability. We also need to do better about moving on quickly when I/O operations fail; lengthy retry operations can take a slow process and turn it into an interminable one.

In other words, as things get bigger we will run into some scalability problems. There is nothing new in that revelation. We've always overcome those problems in the past, and should certainly be able to do so in the future. It's always better to think about these things before they become urgent problems, though, so talks like Ric's provide a valuable service to the community.

