KHB: A Filesystems reading list

Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

We've all been there - you're wandering around a party at some Linux event clutching your drink and looking for someone to talk to, but everyone is having some obscure technical conversation full of unfamiliar jargon. Then, as you slide past a cluster of important-looking people, you overhear the word "superblock" and think, "Superblock, that's a file system thing... I read about file systems in operating systems class once." Gratefully, you join the conversation, only to discover while you know some of the terms - cylinder group, indirect block, inode - you're still unable to come up with stunning ripostes like, "Aha, but that's really just another version of soft updates, and it doesn't solve the nlinks problem." (Admiring silence ensues.) Now what? You want to be able to make witty remarks about the pros and cons of journaling while throwing back the last of your martini, but you don't know where to start.

Fortunately, you can get a decent grasp of modern file systems without reading a whole book on file systems. (I haven't yet read a book on file systems I would recommend, anyway.) After reading these file systems papers (or at least their abstracts), you'll be able to at least fake a working knowledge of file systems - as long as everyone is drinking and it's too loud to hear anyone clearly. Enjoy!

The Basics

These papers are oldies but goodies. While the systems they describe are fairly obsolete and have been heavily improved since these initial descriptions, they make a good introduction to file systems structure and terminology.

A Fast File System for UNIX by Marshall Kirk McKusick, William Joy, Samuel Leffler and Robert Fabry. This paper describes the first version of the original UNIX file system that was suitable for production use. It became known as FFS (Fast File System) or UFS (UNIX File System). The "fast" part of the name comes from the fact that the original UNIX file system maxed out at about 5% of disk bandwidth, whereas the first iteration of FFS could use about 50% - a huge improvement. This paper is absolutely foundational, as the majority of production UNIX file systems are FFS-style file systems. While some parts of this paper are obsolete (check out the section on rotational delay), it's a simple, readable explanation of basic file system architecture that you can refer back to time and again. Also, it's pretty fun to read a paper describing the first implementation of, for example, symbolic links for a UNIX file system.

For extra credit, you can read the original file system checker paper, Fsck - the UNIX file system check program, by Marshall Kirk McKusick and T. J. Kowalski. It describes the major issues in checking and repairing file system metadata consistency. Improving fsck is a hot topic in file systems right now, so reading this paper might be worthwhile.

Vnodes: An Architecture for Multiple File System Types in Sun UNIX by Steve Kleiman. The original UNIX file system interface had been designed to support exactly one kind of file system. With the advent of FFS and other file systems, operating systems now needed to support several different file systems. Several solutions were proposed, but the dominant solution ended up being the VFS (Virtual File System) interface, first proposed and implemented by Sun. This paper explains the rationale behind VFS and vnodes.

Design and Implementation of the Sun Network Filesystem by Russel Sandberg, David Goldberg, Steve Kleiman, Dan Walsh, and Bob Lyon. Once upon a time (1985, specifically), people weren't really clear on why you would want a network file system (as opposed to, for example, a network disk or copying around files via rcp). This paper explains the needs and requirements that resulted in the invention of NFS, the network file system everyone loves to hate but uses all the time anyway. It also discusses the design of the VFS. A fun quote from the paper: "One of the advantages of the NFS was immediately obvious: as the df output below shows, a diskless workstation can have access to more than a Gigabyte of disk!"

Slaying the fsck dragon

One of the major problems in file systems is keeping the on-disk data consistent in the event that a file system is interrupted in the middle of update (for example, if the system loses power). Original FFS solved this problem by running fsck on the file system after a crash or other unclean unmount, but this took a really long time and could lose data. Many smart people thought about this problem and came up with four major approaches: journaling, log-structured file systems, soft updates, and copy-on-write. Each method provided a way of quickly recovering the file system after a crash. The most popular approach was journaling, since it was both relatively simple and easy to "bolt-on" to existing FFS-style file systems.

Journaling file systems solve the fsck problem by first writing an entry describing an update to the file system to a on-disk journal - a record of file system operations. Once the journal entry is complete, the main file system is updated; if the operation is interrupted, the journal entry is replayed on the next mount, completing any half-finished operations in progress at the time of the crash. Most production file systems (including ext3, XFS, VxFS, logging UFS, and reiserfs) use journaling to avoid fsck after a crash. No canonical journaling paper exists outside the database literature (from whence the idea was lifted wholesale), but Journaling the Linux ext2fs Filesystem by Stephen Tweedie is a good choice for learning both journaling techniques in general and the details of ext3 in particular.

The Design and Implementation of a Log-Structured File System by Mendel Rosenblum and John K. Ousterhout. Journaling file systems have to write each operation to disk twice: once in the log, and once in the final location. What would happen if we only wrote the data to disk once - in the journal? While the log-structured architecture was an electrifying new idea, it ultimately turned out to be impractical for production use, despite the concerted efforts of many computer science researchers. Today, no major production file system is log-structured. (Note that a log-structured file system is not the same as a logging file system - logging is another name for journaling.)

If you're looking for cocktail party gossip, Margot Seltzer and several colleagues published papers critiquing and comparing log-structured file systems to variations of FFS-style file systems, in which LFS usually came out rather the worse for the wear. This led to a semi-famous flame war in the form of web pages, archived here.

Soft Updates: A Technique for Eliminating Most Synchronous Writes in the Fast Filesystem by Marshall Kirk McKusick and Greg Ganger. Soft updates carefully orders writes to a file system such that in the event of a crash, the only inconsistencies are relatively harmless ones - leaked blocks and inodes. After a crash, the file system is mounted immediately and fsck runs in the background. The performance of soft updates is excellent, but the complexity is very high - as in, soft updates has been implemented only once (on BSD) to my knowledge. Personally, it took me about 5 years to thoroughly understand soft updates and I haven't met anyone other than the authors who claimed to understand it well enough to implement it. The paper is pretty understandable up to about page 5, at which point your head will explode. Don't feel bad about this, it happens to everyone.

File System Design for an NFS File Server Appliance by Dave Hitz, James Lau, and Michael Malcom. This paper describes the file system used inside NetApp file servers, Write-Anywhere File Layout (WAFL), as of 1994 (it's been improved in many ways since then). WAFL was the first major use of a copy-on-write file system - one in which "live" (in use) metadata is never overwritten in place but copied elsewhere on disk. Once a consistent set of updates has been written to disk, the "superblock" is re-written to point to the new set of metadata. Copy-on-write has an interesting set of trade-offs all its own, but has been implemented in a production file system twice now; Solaris's ZFS is also a copy-on-write file system.

File system performance

Each of these papers focuses on file system performance, but also introduces more than one interesting idea and makes a good starting point for exploring several areas of file system design and implementation.

Extent-like Performance from a UNIX File System by Larry McVoy and Steve Kleiman. This 1991 paper describes optimizations to FFS that doubled file system bandwidth for sequential I/O workloads. While the optimizations described in this paper are considered old hat these days (ever heard of readahead?), it's a good introduction to file system performance.

Sidebar: Where are they now? You might have recognized some of the names in the author lists of the papers in this article - and chances are, you aren't recognizing their names because of their file system work. What else did these people do? Here's a totally non-scientific selection. Bill Joy - co-founded Sun Microsystems

Larry McVoy - wrote BitKeeper, co-founded BitMover

Steve Kleiman - CTO, Network Appliance

Mendel Rosenblum - co-founder, VMWare

John Ousterhout - wrote Tcl/Tk, co-founded several companies

Margot Seltzer - co-founder, Sleepycat Software

Dave Hitz - co-founder, Network Appliance Obviously, anyone wanting to found a successful company and make millions of dollars should consider writing a file system first.

Scalability in the XFS File System by Adam Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck. This paper describes the motivation and implementation of XFS, a 64-bit file system using extents, B+ trees, dynamically allocated inodes, and journaling. XFS is not by any means an FFS-style file system and reading this paper will give you the basics on most extent-based file systems. It also describes quite a few useful optimizations for avoiding fragmentation, scaling to multiple threads, and the like.

The Utility of File Names by Daniel Ellard, Jonathan Ledlie, and Margot Seltzer. File system performance and on-disk layout can be vastly improved if the file system can predict (with reasonable accuracy) the size and access pattern of a file before it writes it to disk. The obvious solution is to add a new set of file system interfaces allowing the application to give explicit hints about the size and properties of a new file. Unfortunately, the history of file systems is littered with unused per-file interfaces like this (how often do you set the noatime flag on a file?). However, it turns out that applications are already giving these hints - in the form of file names, permissions, and other per-file properties. This paper is the first in a series demonstrating that a file system can make useful predictions about the future of a file based on the file name and other properties.

Further reading and acknowledgments

If you are interested in learning more about file systems, check out the Linux file systems wiki , especially the reading list . If you have a good file systems paper or book, please add it to the list, which is publicly editable (look for the password on the front page of the wiki). Note that I will ignore any comments of the form "You should have included paper XYZ!" unless it is also added to the reading list on the file systems wiki - WITH a short summary of the paper. With any luck, we'll have a fairly complete list of Linux file systems papers in the next few days.

If you are interested in working on file systems, or any other area of systems programming, you should contact the author at val dot henson at gmail dot com.

Thanks to Nikita Danilov, Zach Brown, and Kristen Accardi for paper suggestions and encouragement to write this article. Thanks to Theodore Y. Ts'o for actually saying something very similar to the stunning riposte in the first paragraph (which was, by the way, a completely accurate and very incisive criticism of what I was working on at the moment).

