Tue 26 April 2016 tags: storage

"POSIX is obsolete." If you're a filesystem developer, you've probably heard that many times. I certainly have. It doesn't tell me anything I didn't already know about POSIX, but it does tell me two things about whoever says it.

They don't know what POSIX is.

They're lazy.

To the first point, many people seem unaware that POSIX is an actual set of standards - IEEE 1003.1 in several variations, plus descendants. These standards cover a lot more than just operations on files, and technically "POSIX" only refers to systems that have passed a set of conformance tests covering all of those. Nonetheless, people often use "POSIX" to mean only the section dealing with file operations, and only in a loose sense of things that implement something like the standard without having been tested against it. Many systems, notably including Linux, pretty explicitly do not claim to comply with the actual standard.

That brings me to the second point. The "POSIX is obsolete" claim often comes from people who can't tell babies from bathwater. They take a few reasonable concerns about the POSIX standard (which I'll get to in a moment) and use those as an excuse to throw out everything to do with POSIX as it exists in the real world. That's the lazy way out. To pick just one example, if nested directories weren't useful, there wouldn't be about twenty implementations of them on top of various object stores, all of them incompatible with each other and each one subject to numerous race conditions or other bugs that could cost users their data. There's value in the POSIX(ish) feature set. There's even more value in the fact that every programming language and application already knows how to speak that language, without requiring API-specific adapters or shims.

I'm not going to defend the official POSIX standard. It's not obsolete, but it is outdated. Yes, there's a difference. Something that's obsolete can never recover its former usefulness. Something that's out of date can. A standard that described the actual and remarkably uniform behavior of the filesystems that are actually out there today would satisfy the POSIX goal of supporting application portability. That would be a good thing ... but it wouldn't be the best thing, because "POSIX in practice" is almost as outdated as "POSIX as written". Both are based on a computing model that I'd place at the late 80s - before SMP and NUMA and complicated cache/memory hierarchies, but even more importantly before distributed systems became the norm. Things that might have made sense or seemed feasible in the original POSIX context nearly thirty years ago often make absolutely no sense today, or in some cases we've learned since then that they were bad ideas all along. These anachronisms make it very difficult to achieve correctness and/or acceptable performance. They create the breathing room exploited by the peddlers of deficient almost-filesystems like HDFS and object stores like S3, which end up being even worse fits for application developers' needs that a filesystem would have been.

An up-to-date filesystem interface would avoid these ills. My goal in this article is to cast light on some of the problems with the current interface, and in some cases propose solutions. Each section will refer to a specific system call, but those system calls are merely exemplars or archetypes representing more general problems that actually affect multiple calls.

Rename

One example of the "stuck in the 80s" syndrome is rename. In the worst case, this might affect four objects:

The source directory.

A separate destination directory.

The object being renamed (e.g. to update ".." if it's also a directory).

Another object at the destination, which is effectively being unlinked.

Thus, a rename might involve several operations. A failure of any one might require a rollback, which might itself fail, etc. In a local-filesystem context, where there's a single journal and you have options like throwing a lock around everything, it might not seem like an intractable problem. However, in a distributed filesystem where you don't have such luxuries (and the probability of that double failure is much higher), it's a total nightmare. It's a nightmare that we in distributed systems have learned to avoid, but neither the de jure nor de facto standards have kept up because everyone who has any influence there has remained stuck in the 80s.

The simple solution to this problem is to have each operation affect as few objects as possible. A single-directory rename where the target doesn't already exist is an easy case affecting only one object. It's reasonable to expect that filesystems - even if they're distributed - will fully support this case. It's also reasonable for the filesystem to return EEXIST if the destination already exists, or EXDEV if the rename is across directories. Any application that can't handle EXDEV for a cross-directory rename is broken already, because it's always possible that the source and destination are on completely different filesystems. If an application wants to ensure that renames are always atomic, they already need to deal with that above the filesystem anyway, so why impose burdensome requirements on the filesystem as well?

Fsync

Everyone seems to know about one problem with fsync - that it can create huge latency bubbles. However, I see that problem as only the visible tip of the crapberg that is fsync, O_SYNC and all of their friends. In a way, it's really a side effect of the most fundamental problem with POSIX - that it's clueless about the relationship between consistency, durability, and ordering.

Let's start with consistency and durability. POSIX is very strict about consistency, requiring full and immediate visibility of any write to any subsequent reader. That probably didn't sound too bad for a single-processor system in the 80s. For a modern distributed filesystem - or even a local filesystem running on a big NUMA machine - it can be quite burdensome. (Yes, this is closely related to the rename-atomicity issue above.) By contrast, POSIX is very loose about durability. Most programmers know that a literal write isn't guaranteed to hit storage unless O_SYNC is set or fsync is issued. What many don't know is that other modifying operations have similar behavior. For directory operations, an fsync is required on the directory being modified, usually requiring that the directory be opened for the sole purpose of issuing that fsync. That's neither convenient nor efficient for anyone, really. The problem of what to fsync for a cross-directory rename is left as an exercise for the reader. ;)

What's being missed here is that consistency and durability need to be tunable separately. POSIX requires strong consistency and weak durability, but many applications need the exact opposite. As with the various kinds of barriers and flushes at the CPU level, there should be separate calls to ensure previously-deferred consistency and to ensure previously-deferred durability. Forcing every application toward one corner of the consistency/durability space is a huge part of the reason distributed databases - which allow more flexibility regarding these tradeoffs - have come to be used in so many cases where a distributed filesystem would have made more sense.

Now, what about ordering? The problem here is that there's no way to ensure that the system will respect the ordering of two operations unless you wait for the first to complete before even issuing the second. In yet another echo of the 80s, this might make sense when that just means returning from one syscall and issuing another. However, if the filesystem happens to be distributed, we're talking about putting a network round-trip delay between two things that could have been pipelined. Filesystem developers well understand the importance of pipelining instead of playing ping-pong, because they rely on exactly that model from the block layer below them. Application developers should have access to the same thing, as should distributed filesystems layered on top of local ones. It should be possible at the very least to indicate which writes on a file descriptor are part of a reorderable group and which must retain their order relative to groups before or after. As with the durability/consistency options mentioned above, this gives application developers a powerful and yet portable way to manage tradeoffs that are important to them.

At this point, we can go back to that above-the-waterline issue of latency bubbles. It's bad that unconstrained buffering can mean that whoever calls fsync might have to wait for gigabytes of pending data to get flushed out. It's far worse that entanglement might mean that they have to wait for gigabytes of completely unrelated data from other users to get flushed out. Worst of all, if fsync can't finish any writes it has no way to say so. If it fails, you have no idea what data didn't actually make it to disk. Unfortunately, I don't think there's a reasonable way for a standard to address that. Maybe adding some sort of control over per-file-descriptor buffer limits would be feasible. Beyond that, you start getting into multi-tenant issues that tend to exist only in proprietary form bound about with patents, and that's a poor basis for a standard. On the other hand, I think people often only use fsync because it's the only hammer they have. Who really wants an interface that often destroys performance while making no guarantees of correctness? If they had finer-grain control over consistency and durability and ordering, maybe they wouldn't even need to call fsync.

Readdir

This is actually one of the areas where the problem lies not with POSIX the standard but with "POSIX" as it exists in the world. The high cost of readdirp ("readdir plus") comes from NFS. The utterly insane d_off behavior that we Gluster developers and others have had to put up with is actually specific to Linux. These are real pain points, but this is getting long enough already so I'll just leave them alone for now. The problem with readdir, as defined in the standard, is that it's just too limited. Often, users or applications are actually only looking for files that meet certain criteria - most especially a name matching a certain pattern. That's why "find" and other utilities exist. Unfortunately, POSIX offers no option other than listing every single file in a directory (in effectively random order) and filtering out the ones you don't want. In yet another repetition of what should by now be a familiar pattern, this is particularly deadly for a distributed filesystem where all of those entries have to be passed over the network. We in Gluster-land would be glad to do some filtering or pattern matching ourselves, if users had some sort of standards-based way to tell us what they want. POSIX even defines the syntax for name-based matching, in the definition for fnmatch and elsewhere. It's just defined and implemented at the wrong level. This is such a common and severe problem that it seems like about time to combine these already-standardized pieces into something that serves users better.

Chmod

Access control is another area where people often can't distinguish between the POSIX standard and "POSIX" implementations in the real world. The standard only defines the simple user/group/other permissions we're all familiar with. It's a very useful model, and I think failing to support it is one of the most egregious examples of laziness on the part of the object-store folks. However, it's clearly not sufficient for all needs, so there was a later attempt to add more complex access control lists (see "man chacl" if you're not familiar with them). However, despite the fact that some popular platforms did implement the ACL semantics defined in POSIX.1e draft 17 (really), it never actually became a standard. Maybe that's for the best, because these ACLs still rely on the concept of a group, and that has (at least) the following problems in a distributed world:

There has to be at least some agreement between clients and servers about what groups mean, or else comparison between the group(s) being presented and the group(s) allowed to perform an action just make no sense. I had to deal with exploits based on this when I was at Encore in 1990, and it doesn't seem like things have gotten a lot better since.

Attaching long lists of groups to every request, because you never know which one(s) might confer the needed access for that operation, is inefficient. Also, arbitrary-length lists are a pain from a protocol-definition standpoint, and no finite number ever seems to be enough. From Gluster I know of installations where users literally belong to hundreds of groups.

There are still use cases that groups don't satisfy, such as access via a specific program (MTS had PKEY access for this before UNIX ever existed) or for a limited time.

What would be better? Capabilities. No, not the horrid mess of meaningless flags born of POSIX.1e and adopted by Linux. In the real CS literature, which those people apparently never read, a capability is an unforgeable token that can be communicated to others and which confers access to an object. Modern capabilities use end-to-end cryptography instead of relying on operating systems or other intermediaries to maintain a "chain of custody" between the granter, user, and target of the capability. This means anyone can make one up on the fly, attach it to an object, and then send it to any ad hoc collections of entities should have access. This collection can include both users and programs, with no requirement for either to be registered as a member of a group. You can do anything with this model that you can do with user/group permissions or POSIX.1e ACLs, plus a whole lot more, with better security and without the implementation problems mentioned above.

Conclusions

I'm sure there are many more parts of POSIX (both in the standard and in practice) that I could pick at, but hopefully these are enough to get started. The point is not in the specifics but in the fact that (a) there are serious problems with the current standard and (b) solutions to those problems are mostly well known. It's a crying shame that neither the official standard nor the dictators of the unofficial standard (i.e. what popular OSes actually implement) reflect the hard work and ingenuity of so many computer scientists or our fellow practitioners over in database-land. The people who say "POSIX is obsolete" are incorrect today, but if we filesystem developers keep screwing up so badly they might eventually be right.