The Return of ACID in the Design of NoSQL Databases

Stephen Pimentel, FoundationDB, http://www.foundationdb.com/

The CAP ( C onsistency, A vailability, and P artition-tolerance) theorem has been widely interpreted within the NoSQL community as justifying the weakening of consistency in order to support high availability. The eventual consistency model, in particular, has become a design default for many NoSQL systems. A more careful look at the CAP theorem shows that it precludes only small portion of the design space for distributed systems and does not compel the adoption of eventual consistency. Moreover, the weakness of eventual consistency creates a technical debt that is borne by the developers making use of the data store.

Many NoSQL designers are therefore exploring a return to transactions with ACID ( A tomicity, C onsistency, I solation, and D urability) properties as the preferred means of managing concurrency for a broad range of applications. Through the use of tailored optimizations, designers are finding that implementation of ACID transactions need not sacrifice scalability, fault-tolerance, or performance.

Questions about the CAP Theorem

The CAP theorem was inspired by a conjecture of Brewer [1] and proven by Gilbert and Lynch [2]. In its popular formulation, it states that a distributed system cannot simultaneously provide C onsistency, A vailability, and P artition-tolerance. Give the three properties (C, A, and P), a system can at best support two of the three.

Within the NoSQL community, the CAP theorem has been taken to provide persuasive justification for the adoption of weaker models of consistency. In particular, eventual consistency has been widely adopted, becoming something of a default for NoSQL design. Eventual consistency guarantees only that, given a sufficient period of time during which there are no writes, all previous writes will propagate through the system so that all replicated copies of the data will be consistent. This design is often lauded for allowing low latency and high availability in distributed systems on the assumption that stronger consistency is simply not attainable.

However, as the developer community has gained more experience with the use of NoSQL databases, questions have arisen regarding both the technical motivations and consequences of weak consistency.

As Gilbert and Lynch note, their proof relies on an asynchronous model in which "there is no clock, and nodes must make decisions based only on the messages received and local computation." This model is far more restrictive than actual networks, which "are not purely asynchronous. If you allow each node in the network to have a clock, it is possible to build a more powerful service" with stronger consistency.

In a recent retrospective on the CAP theorem and its influence on distributed database design, Brewer [3] notes that "CAP prohibits only a tiny part of the design space: perfect availability and consistency in the presence of partitions, which are rare." In fact, "because partitions are rare, there is little reason to forfeit C or A when the system is not partitioned." Yet systems that adopt eventual consistency forfeit C during all concurrent operations, not just when there are partitions.

The Weakness of Eventual Consistency

Eventually consistent systems can return stale data, i.e., data values that are not the most recently written. Moreover, the degree of staleness is unbounded in the sense that there is no constant time interval within which the system can guarantee that consistency will be restored. Rather, the degree of staleness is a complex probabilistic function based on a number of system parameters.

In an attempt to "quantify the degree to which eventual consistency is both eventual and consistent," Bailis et al. derive formulas for the probability of reading stale or inconsistent data in an eventually consistent system [4]. In particular, they provide probabilistic bounds on staleness as measured by either elapsed time or the number of elapsed versions. Such probabilistic bounds underscore the observation that developers employing such systems must account for and handle the possibility of inconsistent data.

Nevertheless, might there not be use cases for which the benefits of eventual consistency are worth its costs? Indeed, many discussions of eventual consistency motivate its adoption by its compatibility with a maximal degree of availability. For those systems that require such maximal availability, it is interesting to examine more closely the level of consistency that can be achieved.

Mahajan et al. [5] define an "always-available" system as one in which nodes never refuse reads or writes (as a transactional system might in the case of transaction failure). They revisit Gilbert and Lynch�s work and strengthen its result by employing an asynchronous model allowing local clocks, but no clock globally visible to all nodes. In this context, they define a consistency model called Real Time Causal (RTC) consistency that is strictly stronger than eventual consistency. They prove that RTC consistency is a "tight bound" on the consistency achievable in an always-available system, in the sense that RTC consistency can be provided in such a system, and no stronger consistency model can be. Given this result, it is unclear that eventual consistency remains an attractive alternative even for systems that must be always-available.

Google Returns to Transactions

Brewer [3] highlights "the hidden cost of forfeiting consistency, which is the need to know the system's invariants. The subtle beauty of a consistent system is that the invariants tend to hold even when the designer does not know what they are." In other words, when an application requires concurrent operations, weakened consistency creates a "technical debt" that is pushed onto the developers making use of the data store.

Google has experienced this technical debt and responded by returning to transactions in three major systems, Percolator, Megastore, and, most recently, Spanner. This shift to transactions is significant in that Google�s Bigtable [6], which lacks transactions, was the inspiration and model for many other NoSQL databases that adopt eventual consistency.

Percolator performs the core function around which Google�s business was founded, indexing the web in support of search. Peng and Dabek [7] explain that they "built Percolator to create Google's large �base� index, a task previously performed by MapReduce." The decision to employ transactions was directly motivated by the limitations of Bigtable. "Distributed storage systems like Bigtable can scale to the size of our repository but don't provide tools to help programmers maintain data invariants in the face of concurrent updates." In contrast, transactions "make it more tractable for the user to reason about the state of the system and to avoid the introduction of errors into a long-lived repository."

Percolator was built on top of Bigtable. It "stores its locks in special in-memory columns in the same Bigtable that stores data and reads or modifies the locks in a Bigtable row transaction when accessing data in that row." While Percolator was deemed successful in meeting its design goals, Peng and Dabek note that the decision to implement the system on top of Bigtable imposed a performance disadvantage. "The conventional wisdom on implementing databases is to �get close to the iron� and use hardware as directly as possible since even operating system structures like disk caches and schedulers make it hard to implement an efficient database. In Percolator we not only interposed an operating system between our database and the hardware, but also several layers of software and network links. The conventional wisdom is correct: this arrangement has a cost."

Megastore is a NoSQL database that Google developed to support interactive online services [8]. As in Percolator, the designers of Megastore decided to support transactions due to the limitations of eventual consistency. "Despite the operational advantages of eventually consistent systems, it is currently too difficult to give up the read-modify-write idiom in rapid application development."

Megastore provides a data model based on hierarchically structured entity groups, defined in term of tables that are mapped onto Bigtable. Entity groups provide application-defined locality that is used to "partition the database and replicate each partition separately, providing full ACID semantics within partitions, but only limited consistency guarantees across them." Bigtable is used for storage of a replica within a single data center, and Paxos is used for coordination among replicas.

Megastore has an interesting design and is notable for its low-latency implementation of Paxos. Nevertheless, like Percolator, it has significant limitations. It requires developers to structure their data into entity groups, and a single transaction can only access data from a single entity group. Hence, Megastore provides ACID semantics within a replication unit but not across replication units.

Google�s Spanner [9] is a scalable, multiversion database designed to support replication across multiple, widely separated data centers. It is Google�s first major distributed database not built on top of Bigtable. Google "consistently received complaints from users that Bigtable can be difficult to use" for applications "that want strong consistency in the presence of wide-area replication." In particular, "the lack of cross-row transactions in Bigtable led to frequent complaints." As a result, the designers now believe that "it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions."

Spanner�s application-facing data model is similar to Megastore�s, consisting of semi-relational tables implemented on top of a key-value store. However, Spanner�s most notable design detail is its reliance on a unique timekeeping system which acts as a sort of global clock. By combining measurements from GPS and atomic clocks a timestamp�s uncertainty can be strongly bounded. Spanner uses these timestamps to ensure that transactions satisfy external consistency, which is equivalent to linearizability and strictly stronger than serializability.

Reaching for Full ACID

Transactions characterized by the ACID properties of A tomicity, C onsistency, I solation, and D urability are the most tractable and powerful construct for managing concurrency, allowing multiple clients to simultaneously read and write data without fear of conflicts. Transactions serve as a building block for abstractions upon which additional functionality can be constructed.

Motivated by the limitations of less than full consistency, academic researchers have continued to work on distributed database designs that support full ACID transactions. Two recent research systems in this space are CloudTPS and Calvin, which have a common design feature of separating transactional processing into a layer above the distributed storage.

In CloudTPS [10], applications issue transactions to a Transaction Processing System (TPS), which corresponds to the transaction manager of a conventional system. The TPS is composed of a number of Local Transaction Managers (LTMs), each of which is responsible for a subset of data and which thereby distribute the load of transaction processing in a scalable manner. TPS employs the Two-Phase Commit (2PC) protocol. Its designers "observe that CloudTPS is mostly latency-bound." Relating to the use of 2PC, the "main factors influencing performance are the network round-trip times and the queuing delays inside LTMs. CloudTPS is therefore best suited for deployments with a single data center."

Calvin [11] aims to provide a scalable, distributed database with "high availability and full ACID transactions" on top of an underlying non-transactional storage system. A control layer above the storage system handles distributed transaction processing and replication management. Calvin manages distributed transactions using a deterministic locking protocol that eliminates the need for a distributed commit protocol. When multiple nodes need to coordinate on transactions, they do so "before they acquire locks and begin executing the transactions," allowing maximal "concurrency in transaction execution."

The deterministic protocol requires all transactions to be executed in a batch mode managed by sequencers and schedulers distributed across multiple nodes. "Calvin divides time into 10-millisecond epochs" during which transaction requests are collected at each sequencer. "At the end of each epoch, all requests that have arrived at a sequencer node are compiled into a batch." The batches are then replicated across nodes and arranged into a global transaction order by the schedulers. The main limitation of this design is that it gives up the ability to non-deterministically abort or reorder transactions during execution.

For replicating transaction requests, Calvin supports both an asynchronous replication mode and a Paxos-based synchronous replication mode. In the synchronous mode, "all sequencers within a replication group use Paxos to agree on a combined batch of transaction requests for each epoch." The current implementation relies on ZooKeeper for Paxos, although the designers note that latency could be improved by "implementing a more streamlined Paxos agreement protocol."

Achieving High Performance with Full ACID

A recurrent theme in the NoSQL literature is the need to handle partitions and, in particular, to trade off consistency and availability in the presence of partitions. For example, Brewer [3] advocates a design strategy in which systems "detect partitions, enter an explicit partition mode that can limit some operations, and initiate a recovery process to restore consistency."

There is, however, another perspective, one that may prove both simpler and more powerful. According to this perspective, the fundamental problem of distributed systems is not partition but concurrency. Even without partitions, concurrency creates the possibility of conflicting operations and the need to handle them.

Once one decides to handle concurrency through transactions, partition becomes a special case of latency. If one node sends a message to another and receives no response within an acceptable interval, the sender may not know if the connection to the receiver is down or if the receiver�s response is just delayed in processing, and in the end it doesn�t matter. In this light, one can treat partition as an increase in latency beyond a threshold of acceptability as observed by some node.

Hence, a system that supports transactions doesn�t need a special partition mode. In the presence of partitions, transactions may fail, but transactions sometimes fail in any case. The challenge then becomes to always maintain consistency and, subject to that constraint, maximize performance.

Conclusions

There is an emerging understanding within the NoSQL community that the CAP property of consistency is crucial for applications in which developers must maintain data invariants while performing concurrent updates. Giving up that consistency for an eventual consistency model creates a technical debt that is borne by the developers making use of the data store. Many NoSQL designers are therefore exploring a return to transactions with ACID properties as the preferred means of managing concurrency for a broad range of applications. Through the use of tailored optimizations, designers are finding that implementation of ACID transactions need not sacrifice scalability, fault-tolerance, or performance.

References

[1] E. Brewer, "Towards Robust Distributed Systems," Proceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing (PODC '00), pp. 7-10, 2000.

[2] S. Gilbert and N. Lynch, "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services," ACM SIGACT News, vol. 33, pp. 51-59, June 2002.

[3] E. Brewer, "CAP Twelve Years Later: How the 'Rules' Have Changed," Computer, pp. 23-29, February 2012.

[4] P. Bailis, S. Venkataraman, M. J. Franklin, J. M. Hellerstein, and I. Stoica, "Probabilistically Bounded Staleness for Practical Partial Quorums," Proceedings of the VLDB Endowment, vol. 5, 2012.

[5] P. Mahajan, L. Alvisi, and M. Dahlin, "Consistency, Availability, and Convergence," Technical Report (UTCS TR-11-22), 2011.

[6] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, "Bigtable: A Distributed Storage System for Structured Data," Proceedings of the 7th USENIX Symposium on Operating System Design and Implementation (OSDI '06), 2006.

[7] D. Peng and F. Dabek, "Large-Scale Incremental Processing Using Distributed Transactions and Notifications," Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI �10), 2010.

[8] J. Baker, C. Bond, J. C. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh, "Megastore: Providing Scalable, Highly Available Storage for Interactive Services," Proceedings of the 5th Biennial Conference on Innovative Data Systems Research (CIDR '11), 2011.

[9] J. Corbet et al., "Spanner: Google�s Globally-Distributed Database," Proceedings of the 10th USENIX Symposium on Operating System Design and Implementation (OSDI �12), 2012.

[10] Z. Wei, G. Pierre, and C.-H. Chi, "CloudTPS: Scalable Transactions for Web Applications in the Cloud," IEEE Transactions on Services Computing, Special Issue on Cloud Computing 2011, 2011.

[11] A. Thomson, T. Diamond, S.-C. Weng, K. Ren, P. Shao, and D. J. Abadi, "Calvin: Fast Distributed Transactions for Partitioned Database Systems," Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12), 2012.

Database Knowledge

Database Tutorials and Videos

Click here to view the complete list of archived articles

This article was originally published in the Spring 2013 issue of Methods & Tools