POSIX v. reality: A position on O_PONIES

LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

fsync()

Sure, programmers (especially operating systems programmers) love their specifications. Clean, well-defined interfaces are a key element of scalable software development. But what is it about file systems, POSIX, and when file data is guaranteed to hit permanent storage that brings out the POSIX fundamentalist in all of us? The recent fsync() / rename() / O_PONIES controversy was the most heated in recent memory but not out of character for-related discussions. In this article, we'll explore the relationship between file systems developers, the POSIX file I/O standard, and people who just want to store their data.

In the beginning, there was creat()

Like many practical interfaces (including HTML and TCP/IP), the POSIX file system interface was implemented first and specified second. UNIX was written beginning in 1969; the first release of the POSIX specification for the UNIX file I/O interface (IEEE Standard 1003.1) was released in 1988. Before UNIX, application access to non-volatile storage (e.g., a spinning drum) was a decidedly application- and hardware-specific affair. Record-based file I/O was a common paradigm, growing naturally out of punch cards, and each kind of file was treated differently. The new interface was designed by a few guys (Ken Thompson, Dennis Ritchie, et alia) screwing around with their new machine, writing an operating system that would make it easier to, well, write more operating systems.

As we know now, the new I/O interface was a hit. It turned out to be a portable, versatile, simple paradigm that made modular software development much easier. It was by no means perfect, of course: a number of warts revealed themselves over time, not all of which were removed before the interface was codified into the POSIX specification. One example is directory hard links, which permit the creation of a directory cycle - a directory that is a descendant of itself - and its subsequent detachment from the file system hierarchy, resulting in allocated but inaccessible directories and files. Recording the time of the last access time - atime - turns every read into a tiny write. And don't forget the apocryphal quote from Ken Thompson when asked if he'd do anything differently if he were designing UNIX today: "If I had to do it over again? Hmm... I guess I'd spell 'creat' with an 'e'". (That's the creat() system call to create a new file.) But overall, the UNIX file system interface is a huge success.

POSIX file I/O today: Ponies and fsync()

Over time, various more-or-less portable additions have accreted around the standard set of POSIX file I/O interfaces; they have been occasionally standardized and added to the canon - revelations from latter-day prophets. Some examples off the top of my head include pread()/pwrite(), direct I/O, file preallocation, extended attributes, access control lists (ACLs) of every stripe and color, and a vast array of mount-time options. While these additions are often debated and implemented in incompatible forms, in most cases no one is trying to oppose them purely on the basis of not being present in a standard written in 1988. Similarly, there is relatively little debate about refusing to conform to some of the more brain-dead POSIX details, such as the aforementioned directory hard link feature.

Why, then, does the topic of when file system data is guaranteed to be "on disk" suddenly turn file systems developers into pedantic POSIX-quoting fundamentalists? Fundamentally (ha), the problem comes down to this: Waiting for data to actually hit disk before returning from a system call is a losing game for file system performance. As the most extreme example, the original synchronous version of the UNIX file system frequently used only 3-5% of the disk throughput. Nearly every file system performance improvement since then has been primarily the result of saving up writes so that we can allocate and write them out as a group. As file systems developers, we are going to look for every loophole in fsync() and squirm our way through it.

[PULL QUOTE: As file systems developers, we are going to look for every loophole in fsync() and squirm our way through it. END QUOTE] Fortunately for the file systems developers, the POSIX specification is so very minimal that it doesn't even mention the topic of file system behavior after a system crash. After all, the original FFS-style file systems (e.g., ext2) can theoretically lose your entire file system after a crash, and are still POSIX-compliant. Ironically, as file systems developers, we spend 90% of our brain power coming up with ways to quickly recover file system consistency after system crash! No wonder file systems users are irked when we define file system metadata as important enough to keep consistent, but not file data - we take care of our own so well. File systems developers have magnanimously conceded, though, that on return from fsync() , and only from fsync() , and only on a file system with the right mount options, the changes to that file will be available if the system crashes after that point.

At the same time, fsync() is often more expensive than it absolutely needs to be. The easiest way to implement fsync() is to force out every outstanding write to the file system, regardless of whether it is a journaling file system, a COW file system, or a file system with no crash recovery mechanism whatsoever. This is because it is very difficult to map backward from a given file to the dirty file system blocks needing to be written to disk in order to create a consistent file system containing those changes. For example, the block containing the bitmap for newly allocated file data blocks may also have been changed by a later allocation for a different file, which then requires that we also write out the indirect blocks pointing to the data for that second file, which changes another bitmap block... When you solve the problem of tracing specific dependencies of any particular write, you end up with the complexity of soft updates. No surprise then, that most file systems take the brute force approach, with the result that fsync() commonly takes time proportional to all outstanding writes to the file system.

So, now we have the following situation: fsync() is required to guarantee that file data is on stable storage, but it may perform arbitrarily poorly, depending on what other activity is going on in the file system. Given this situation, application developers came to rely on what is, on the face of it, a completely reasonable assumption: rename() of one file over another will either result in the contents of the old file, or the contents of the new file as of the time of the rename() . This is a subtle and interesting optimization: rather than asking the file system to synchronously write the data, it is instead a request to order the writes to the file system. Ordering writes is far easier for the file system to do efficiently than synchronous writes.

However, the ordering effect of rename() turns out to be a file system specific implementation side effect. It only works when changes to the file data in the file system are ordered with respect to changes in the file system metadata. In ext3/4, this is only true when the file system is mounted with the data=ordered mount option - a name which hopefully makes more sense now! Up until recently, data=ordered was the default journal mode for ext3, which, in turn, was the default file system for Linux; as a result, ext3 data=ordered was all that many Linux application developers had any experience with. During the Great File System Upheaval of 2.6.30, the default journal mode for ext3 changed to data=writeback , which means that file data will get written to disk when the file system feels like it, very likely after the file's metadata specifying where its contents are located has been written to disk. This not only breaks the rename() ordering assumption, but also means that the newly renamed file may contain arbitrary garbage - or a copy of /etc/shadow , making this a security hole as well as a data corruption problem.

Which brings us to the present day fsync / rename / O_PONIES controversy, in which many file systems developers argue that applications should explicitly call fsync() before renaming a file if they want the file's data to be on disk before the rename takes effect - a position which seems bizarre and random until you understand the individual decisions, each perfectly reasonable, that piled up to create the current situation. Personally, as a file systems developer, I think it is counterproductive to replace a performance-friendly implicit ordering request in the form of a rename() with an impossible to optimize fsync() . It may not be POSIX, but the programmer's intent is clear - no one ever, ever wrote " creat(); write(); close(); rename(); " and hoped they would get an empty file if the system crashed during the next 5 minutes. That's what truncate() is for. A generalized " O_PONIES do-what-I-want" flag is indeed not possible, but in this case, it is to the file systems developers' benefit to extend the semantics of rename() to imply ordering so that we reduce the number of fsync() calls we have to cope with. (And, I have to note, I did have a real, live pony when I was a kid, so I tend to be on the side of giving programmers ponies when they ask for them.)

My opinion is that POSIX and most other useful standards are helpful clarifications of existing practice, but are not sufficient when we encounter surprising new circumstances. We criticize applications developers for using folk-programming practices ("It seems to work!") and coming to rely on file system-specific side effects, but the bare POSIX specification is clearly insufficient to define useful system behavior. In cases where programmer intent is unambiguous, we should do the right thing, and put the new behavior on the list for the next standards session.



