Modern transactional databases overwhelmingly don’t operate under textbook “ACID” isolation, or serializability. Instead, these databases—like Oracle 11g and SAP HANA—offer weaker guarantees, like Read Committed isolation or, if you’re lucky, Snapshot Isolation. There’s a good reason for this phenomenon: weak isolation is faster—often much faster—and incurs fewer aborts than serializability. Unfortunately, the exact behavior of these different isolation levels is difficult to understand and is highly technical. One of 2008 Turing Award winner Barbara Liskov’s Ph.D. students wrote an entire dissertation on the topic, and, even then, the definitions we have still aren’t perfect and can vary between databases.

The core problem: a monstrous abstraction in (most) every database

Despite the ubiquity of weak isolation, I haven’t found a database architect, researcher, or user who’s been able to offer an explanation of when, and, probably more importantly, why isolation models such as Read Committed are sufficient for correct execution. It’s reasonably well known that these weak isolation models represent “ACID in practice,” but I don’t think we have any real understanding of how so many applications are seemingly (!?) okay running under them. (If you haven’t seen these models before, they’re a little weird. For example, Read Committed isolation generally prevents users from reading uncommitted or non-final writes but allows a number of bad things to happen, like lost updates during concurrent read-modify-write operations. Why is this apparently okay for many applications?) In the research community and in our classrooms, we’ve historically contented ourselves with studying serializability and, to a certain extent, Snapshot Isolation. Understanding weak isolation as deployed in real-world databases is a wide open problem with serious consequences.

To put this problem in perspective, there’s a flood of interesting new research that attempts to better understand programming models like eventual consistency. And, as you’re probably aware, there’s an ongoing and often lively debate between transactional adherents and more recent “NoSQL” upstarts about related issues of usability, data corruption, and performance. But, in contrast, many of these transactional adherents and the research community as a whole have effectively ignored weak isolation—even in a single server setting and despite the fact that literally millions of businesses today depend on weak isolation and that many of these isolation levels have been around for almost three decades.

Why might weak isolation actually work, and how did we get into this mess?

I can offer a few guesses as to why weak isolation works:

Non-serializable isolation “anomalies” are really just different kinds of race conditions. To have a race, you need concurrency. For low-traffic and low-contention applications, it’s possible that anomalies don’t arise (e.g., the read-modify-write race from above might not affect applications because there simply aren’t concurrent transactions). When anomalies do arise, it’s possible that the read-write anomalies don’t translate into application data corruption. Just because a read/write race occurs doesn’t mean all outcomes are necessarily wrong (e.g., two writers might perform the exact same read-modify-write operations with the same outcome regardless of order). It’s possible data is actually (occasionally) corrupted, and apps just don’t care. Or, when they do, they manually issue the customer an IOU and/or send them an overdraft notice.

However, these are still conjectures. With a few exceptions, we’re in uncharted terrority.

Why haven’t we already solved this problem? One possibility: as a community tradition, database architects have prided themselves on top-down design of systems that fulfill beautiful interfaces. For example, the relational database revolution in the 1970s was fomented by the introduction of an amazing abstraction: relational algebra. Serializability is a similarly powerful abstraction that vastly simplifies end-user programming. In contrast, weak isolation is grotesque. Its development has been overwhelmingly mechanism-driven rather than policy- or application-driven. Do you know how Jim Gray et al. invented Read Committed back in 1975? Realizing serializability via two-phase locking is expensive, the System R gang changed their long read locks to short read locks. No joke. (Aside: look at that typesetting!) What about Snapshot Isolation? In the early 1990s, companies like Interbase and Microsoft started shipping databases with a fancy new multi-version concurrency control mechanism. It took another paper by Gray and friends to define what these systems had actually implemented.

Towards a better understanding and better system design

Understanding weak isolation is not “just” an academic problem—it’s a problem database users face every day. If a user wants to make responsible use of weak isolation, she first needs to learn the particulars of each isolation model (which receive only partial treatment even in the best textbooks). Second, she has to manually translate the set of low-level read/write anomalies that define each model to her application logic. This is hard. With effort and a lot of education, it can be done. But remember, if our user’s database doesn’t support serializability and she cares about correctness, it must be done. Is this good design? Is this the best we can do? Exposing models that benefit the systems builder rather than the end user is, in my opinion, antithetical to the database tradition and to empathetic, user-centric design.

I think there are at least three fronts for making progress on this problem:

As a community, I think it’s time to pay attention to and give existing weak isolation models the deep treatment they deserve. We’ve a few hints already. For example, my dissertation work (and plenty of excellent related work) helps identify and apply conditions under which we can race and preserve correctness (and concurrent execution). But there’s much work to be done in specifically examining existing and widely-deployed models. There are numerous Ph.D. dissertations to be written in this space and, more importantly, serious potential for impact on practice. As we develop new systems, we can avoid making the same mistakes as our architectural ancestors by focusing on applications, not mechanisms. I’d personally welcome a moratorium on writing papers on new isolation, data consistency, and read/write transaction models unless there’s a clear and specific set of motivating and specific application-driven use cases. There’s real promise in moving beyond the read-write interface of traditional isolation models entirely and building concurrency control systems that operate at a higher level of abstraction (holy grail: the application level). This helps both systems and users. Systems can exploit greater concurrency and therefore provide higher performance and availability. Users don’t have to think about data races. This is a topic for another post, but work on Bloom/CALM, CRDTs, and I-confluent coordination avoidance hints at what’s achievable by reasoning about applications, not read/write histories.

We can do better, and there’s a ton of interesting research and design to be done. Go!

Notes

You can follow me on Twitter here.

Please enable JavaScript to view the comments powered by Disqus.