Jepsen: Crate 0.54.9 version divergence

In the last Jepsen analysis, we saw that RethinkDB 2.2.3 could encounter spectacular failure modes due to cluster reconfiguration during a partition. In this analysis, we’ll talk about Crate, and find out just how many versions a row’s version identifies.

Crate is a shared-nothing, “infinitely scalable”, eventually-consistent SQL database built on Elasticsearch.

Because Elasticsearch has and continues to lose and corrupt data in response to network partitions and other faults, some might question whether Elasticsearch is appropriate for a primary data store. Crate’s co-founders knew about these hazards, and promised to publish fault-tolerance documentation in October 2014.

As of June 28, 2016, I still can’t locate a single mention of these risks on the Crate web site. There are zero results for partition tolerance. One result for safety (about the Python client’s thread-safety). Zero links to the Elasticsearch resiliency page. Two results for fault tolerance, neither of which discuss the known risks in Elasticsearch’s homegrown leader election and replication algorithm. The multi-node best practices and shared-nothing docs only discuss setting minimum_master_nodes to a majority, which is insufficient to prevent split-brain and data loss in Elasticsearch.

Consistency

Crate’s position on consistency is a little complicated. The product overview claims “Crate is eventually consistent but offers transactional semantics”, but the storage and consistency docs say “Crate does not provide transactions”. They go on to explain:

By offering read-after-write consistency we allow synchronous real time access to single records, immediately after they were written. Example: after posting a form and having written its record, the record will be consistent and immediately available when selecting this primary key across the whole cluster. However if you want to calculate a sum (or other aggregation query) this record may not yet be included in the aggregation, but only a few milliseconds later.

This is probably not true: Crate 0.54.9 uses Elasticsearch 1.7, which definitely loses updates. If Crate loses updates, it’s unlikely that you can guarantee reading the latest write, let alone reading that write ever again. Even the unreleased Elasticsearch 5.0.0 still fails its Jepsen tests according to Elastic, so claiming linearizable reads on single keys might be a bit of a stretch.

Let’s look at something easier: row versioning. Like Elasticsearch, Crate claims to offer optimistic concurrency control, in which every row has a _version column which is incremented with each update. You can perform safe conditional updates by ensuring that the _version hasn’t changed since your last read:

cr > update locations set description = 'Updated description' ... where id = 5 and "_version" = 3 ; UPDATE OK , 1 row affected (... sec )

A natural question presents itself: do _version numbers uniquely identify versions of a document? Let’s find out!

Designing a test

We’ll create a simple table mapping integer keys to values:

create table registers ( id integer primary key , value integer ) with ( number_of_replicas = \ "0-all\" ) " ))

Then we’ll perform reads and writes, by primary key, against that table.

( case ( :f op ) :read ( ->> ( sql! conn "select value, \"_version\" from registers where id = ?" k ) :rows first ( independent/tuple k ) ( assoc op :type :ok , :value )) :write ( let [ res ( sql! conn "insert into registers (id, value) values (?, ?) on duplicate key update value = VALUES(value)" k v )] ( assoc op :type :ok )))

We’ll perform blind reads, and write a series of unique integer values to the table, to make sure we can tell different writes apart. Then we’ll have five clients (one per node) perform writes, while another five clients (again, one per node) perform reads. We’ll test ten keys concurrently, in case not all keys are affected equally by a given failure mode. Meanwhile, we’ll cause network partitions and heal the network every two minutes. We can use a complex topology, like overlapping majorities, or a simple 2/3 split; almost any partition which isolates a primary node is sufficient.

Our check of correctness is simple–we won’t be verifying linearizability, or checking for dirty reads, lost updates, etc. Instead, we just want to ensure that each _version of a given row identifies a single value .

( reify checker/Checker ( check [ _ test model history opts ] ( let [ ; Find all successful reads, and group them by _version reads ( ->> history ( filter op/ok? ) ( filter # ( = :read ( :f % ))) ( map :value ) ( group-by :_version )) ; For each [version, reads] pair, discard those with one value multis ( remove ( fn [ [ k vs ] ] ( = 1 ( count ( set ( map :value vs ))))) reads )] ; The history is valid if no versions had multiple values { :valid? ( empty? multis ) :multis multis }))))

This test reliably demonstrates that two distinct versions of a document can have the same Crate _version field. For example:

{ :multi { :valid? false , :results { 0 { :valid? false , :multis ([ 67237 [{ :value 67250 , :_version 67237 } { :value 67250 , :_version 67237 } { :value 67250 , :_version 67237 } { :value 67250 , :_version 67237 } { :value 67250 , :_version 67237 } { :value 67250 , :_version 67237 } { :value 68687 , :_version 67237 } { :value 68687 , :_version 67237 }]])} , 7 { :valid? true , :multis ()} , 1 { :valid? true , :multis ()} , 4 { :valid? true , :multis ()} , ...

Keys 7, 1, and 4 passed: each version identified a single write. However, key 0 did not: version 67237 identified two versions, one in which the value was 67250, and another in which it was 68687. I’ve chosen a short run here for clarity, but other runs involve long-lived split-brain lasting the duration of a partition. It’s possible for a Crate cluster to maintain two different copies of a row with the same _version for over 220 seconds, both visible to reads.

This is issue 3711.

Discussion

Crate’s transaction isolation model might best be described as Overly Optimistic Concurrency Control. Not only can a document have multiple _version s visible at any given time, but each _version can have multiple versions.

This invalidates the whole point of MVCC: conditional updates only prevent write loss if the version you read is the same as the version you replace. In the example above, a client performing, say, an increment against row 0 could read 67250 instead of 68687, then successfully commit against version 67237, causing the loss of over a thousand writes. This behavior depends on network and partition topology, which node or nodes clients connect to, and various timescales.

Crate recommends making backups, ensuring you have multiple replicas, and setting minimum_master_nodes “to avoid split brain”. None of these will help. Backups cause write loss, instead of preventing it: they can’t help you avoid data corruption due to conflicting version identifiers or any of Elasticsearch’s other exciting behaviors, without losing other committed operations to rollback. Multiple replicas makes the problem worse, by creating more network links that can fail, and more replicas that can diverge. minimum_master_nodes is necessary, but not sufficient, to avoid write loss in Elasticsearch prior to 5.0.0. Crate uses Elasticsearch 1.7 as of 0.54.9, and is upgrading to 2.3.3.

Building a database on Elasticsearch is something of a double-edged sword. Crate has been able to focus on hard problems like query planning, joins, aggregations, and so on–without having to take on the tough work of building a storage layer, cluster membership, replication algorithm, etc. However, Crate is tightly coupled to Elasticsearch, and is dependent on the Elastic team for improvements to that technology. Elasticsearch’s consistency issues have been well-known for years, and the process to fix them is still ongoing. It’s not clear what Crate can do to get out of this situation: a rewrite would be complex and expensive (and introduce new and unknown failure modes), whereas fixing Elasticsearch’s consistency problems could easily consume person-years of engineering time that a small company can ill-afford.

There are good reasons to use Crate: distributed SQL stores, especially with Crate’s capacity for aggregations and joins, are hard to come by. Moreover, Crate introduces several helpful features not present in Elasticsearch. That said, the risk of data loss is real, and is unlikely to be resolved at any point in the near future. I recommend that Crate users avoid using Crate as their system of record–at least, where each record matters. Like Elasticsearch itself, you should use a safer database as your primary store, and continuously backfill data from that primary store into Crate for querying. Crate may also be suitable for cases where occasional data loss or corruption does is mostly harmless, e.g. high-volume sensor data, observability, analytics, etc.