Historically, I’ve been a bit of a filesystem purist. If somebody called something a filesystem that wasn’t fully integrated into the kernel so that any application could transparently read, write and even map files within a familiar hierarchical directory/file structure, or if it didn’t support consistency at least as good as NFS (preferably better) I’d be quick to correct them. I’d say that what they had wasn’t a filesystem, with the clear implication that it was something less than a filesystem. My views have softened a bit over the years, though, and now I see a lot more value than I used to in data stores that aren’t filesystems. I’ve always seen value in databases, of course, but over the last few years more and more people have been putting data into things that are neither filesystems nor databases. In many cases, the value of these other data stores, and/or of what’s stored in them, exceeds that of the filesystems and databases (including the quasi-databases that I’m sure database purists complain about the same way I have about quasi-filesystems). At first I wondered why people were doing things this way. Many of these folks were doing distributed programming. Were they simply ignorant about distributed filesystems? Were they too lazy to figure out the wheel that already exists before they went and invented their own? Those two explanations seemed most likely for a while, but it turns out that there are better reasons as well, and it’s those reasons I’ll explain here.



The most prominent example of a non-filesystem non-database data store is the distributed key-value store. Create a key (often some sort of hash), and attach some arbitrary amount of data to it. Provide the key later, and get the data back. Examples include both Dynamo and S3 at Amazon, Project Voldemort, and many more. For the most part, these share some characteristics:

Flat namespace, often with not-particularly-human-readable names.

No hierarchical directory/file structure.

Limited file/object metadata.

Limited operations (compared to filesystems), most notably lacking rename.

Little or no support for access at the sub-object level.

Weak consistency – readers might get stale data (or no data) even after a “completed” write.

One of the most fundamental operations in any data store is turning some sort of name into an identifier or handle for a particular unique object. In a filesystem, this means looking up the unique ID (typically an inode number) for each directory component of the path from the root on down. This can be so expensive even for a local filesystem that most operating systems have a special “name cache” just for this purpose. Much of the design effort for any distributed filesystem is oriented around doing the same thing efficiently across a network . . . and that’s where rename operations have earned a position of special notoriety. In addition to the ordering problems at the heart of the great fsync debate, renames make it possible for an already-resolved partial path – which might be the top of an entire large directory subtree – to disappear suddenly. Even more fun can be had when parts of the old path are repopulated with new directories/files, so that two nodes trying to resolve the same path might get different answers depending on their prior name-cache state. Basically rename (and rewriting of symbolic links) requires that name caches be kept consistent. In a distributed system, that can involve quite a bit of complexity and performance cost.

Consistency problems can be solved, of course. There’s additional complexity involved, and some performance cost, but I’ve long held that those could be kept to a minimum and that the benefits are worth it. The thing that really changed my mind about this was an observation in the Dynamo paper: strong consistency reduces availability. I’ve always thought of data availability in terms of data not being lost or stranded on the other side of a failed network connection. The Dynamo insight is that many applications have to do a lot of work within a small acceptable-response-time window, and to make sure that they fit into that window they have to impose deadlines on all sub-operations including data access. If consistency issues make data unavailable within that deadline then they’ve made it unavailable period, with practically the same effect as if the data were unavailable in any other sense. As someone who has done hard-realtime work I have trouble applying the term “realtime” to a web app that’s responding to a user, but by most definitions they would qualify. In such an environment, which has become extremely common, foregoing (some) consistency to ensure that data are available – meaning available within a given time – makes perfect sense.

This same consistency/availability tradeoff is the reason that these kinds of key-value stores usually eschew not only consistency within a file/object, but the whole structure of hierarchical names, renames, etc. They don’t need the naming flexibility. They do need lookups to be consistently fast, which means easily distributing keys among many nodes and not having to worry about them moving around. As the Dynamo paper also explains, many applications are happy with only primary-key access, and they don’t really need the primary key to be anything human-readable because it’s already stored in a database. If anybody wants more complex naming, they can implement it readily enough on top of the raw key-value functionality. If they don’t, they can get the primary key for something from their database query result and use it directly to get their data without any further name lookups.

The conclusion I draw from this exploration is that filesystems are overkill for many people. They provide functionality those people don’t need, and in return exact a cost those people are unwilling to pay, so alternatives have been developed. I think it’s time for more people to realize that there are no longer only two ways to store and access your data. Key-value stores aren’t just something layered on top of filesystems or databases. They – even more than object storage or content-addressable storage – represent a primary data-access paradigm in their own right.