The trouble with timestamps

Some folks have asked whether Cassandra or Riak in last-write-wins mode are monotonically consistent, or whether they can guarantee read-your-writes, and so on. This is a fascinating question, and leads to all sorts of interesting properties about clocks and causality.

There are two families of clocks in distributed systems. The first are often termed wall clocks, which correspond roughly to the time obtained by looking at a clock on the wall. Most commonly, a process finds the wall-time clock via gettimeofday(), which is maintained by the operating system using a combination of hardware timers and NTP–a network time synchronization service. On POSIX-compatible systems, this clock returns integers which map to real moments in time via a certain standard, like UTC, POSIX time, or less commonly, TAI or GPS.

The second type are the logical clocks, so named because they measure time associated with the logical operations being performed in the system. Lamport clocks, for instance, are a monotonically increasing integer which are incremented on every operation by a node. Vector clocks are a generalization of Lamport clocks, where each node tracks the maximum Lamport clock from every other node.

Consider a typical TCP service. Requests from a client are distributed through one or more load balancers to a pool of stateless application servers, which run the business logic. Those application servers persist their shared state in a distributed database, like Cassandra or Riak. I’m gonna focus on Cassandra since it doesn’t offer logical clocks, but most of this discussion applies to Riak with Last Write Wins enabled, as well. The question at hand: what safety properties will the service’s operations exhibit?

Cassandra uses wall-clock timestamps provided by the server, or optionally by the client, to order writes. It makes several guarantees about the monotonicity of writes and reads given timestamps. For instance, Cassandra guarantees most of the time that if you write successfully to a quorum of nodes, any subsequent read from a quorum of nodes will see that write or one with a greater timestamp.

How do those ordering properties translate to the application’s operations?

Session consistency

Cassandra provides session consistency for timestamps, which means that a process which accesses the database in the context of a session is guaranteed to always read its latest write or one with a higher timestamp. There is, however, no guarantee about the visibility of writes to other nodes. This suggests an important caveat: clients of the service will not read their own writes unless they too maintain an open session (e.g. over a TCP connection) with a particular app server. The app is responsible for ensuring that its application sessions are disconnected if it ever loses its Cassandra session.

[Update] Peter Bailis points out that jbellis reverted the patch adding read-your-writes consistency a week later, so I guess that no Cassandra release has actually tried to provide read-your-writes. My mistake, I think–the ticket and docs are somewhat unclear on the current state of affairs.

OK, so we’re using TCP or long-lived HTTP connections, instead of making discrete HTTP requests to the service, and we’ve added appropriate lifecycle handlers to the app server. Our session now stretches continuously from a given Cassandra node through the application server to the client. Are operations session-consistent?

Well, not exactly. Cassandra uses the JVM’s System.getCurrentTimeMillis for its time source, which is backed by gettimeofday. Pretty much every Cassandra client out there does something similar. That means that the timestamps for writes made in a session are derived either from a single Cassandra server clock, or a single app server clock. These clocks can flow backwards, for a number of reasons:

Hardware wonkiness can push clocks days or centuries into the future or past.

Virtualization can wreak havoc on kernel timekeeping.

Misconfigured nodes may not have NTP enabled, or may not be able to reach upstream sources.

Upstream NTP servers can lie.

When the problem is identified and fixed, NTP corrects large time differentials by jumping the clock discontinously to the correct time.

Even when perfectly synchronized, POSIX time itself is not monotonic.

That last one might come as a surprise, because we usually think of POSIX time as being “the number of seconds since an epoch”. This isn’t quite true. Because Of Reasons, POSIX days are defined as 86400 seconds in length. However, real days aren’t exactly 86400 seconds. The powers-that-be occasionally schedule leap seconds to correct for the drift. On those occasions, the system clock will either skip a second, or double-count a second–e.g., counting 59:60.7, 59:60.8, 59:60.9, 59:60.0, 59:60.1, and then repeating the previous second’s worth of timestamps before continuing on.

There are therefore some POSIX timestamps which do not refer to any time, and there are some POSIX timestamps which refer to two distinct times. This most recently happened on July 1st, 2012, and again a month later. This causes so many problems that Google actually smears out the leap second over the course of the day, preserving monotonicity.

If the system clock goes backwards for any reason, Cassandra’s session consistency guarantees no longer hold. Say a client writes w1 just prior to a leap second, then writes w2 just after the leap second. Session consistency demands that any subsequent read will see w2–but since w2 has a lower timestamp than w1, Cassandra immediately ignores w2 on any nodes where w1 is visible. In fact, Cassandra’s monotonicity guarantees operate in reverse, doing their best to make sure that subsequent writes are not visible.

How do you fix this? Use monotonic clocks. You can’t use Lamport clocks because they’d lead to all kinds of awkwardness between nodes, but vector clocks or dotted version vectors would be appropriate. You can also enforce that time never goes backwards in the context of a session. Both the database and (if client timestamps are used) all client code should check to make sure that timestamps never go backwards in the context of a session or process, and delay timestamp generation if necessary. Higher latencies or client-side exceptions are almost certainly preferable to silently lost data.

Monotonic reads and writes

Cassandra also claims to offer monotonic read consistency, which means that if a client has seen any particular value for a key, it will never read an older value.

Because system clocks are not monotonic, writes can’t be monotonic either. Consider this same sequence as before:

Process A writes w1 with timestamp t=2 Process A writes w2 with timestamp t=1 Process A reads w1, but expected w2

These writes are clearly not monotonic: w2 should have won. We could also consider multiple clients. This case doesn’t require system clocks to go backwards–it can happen whenever clocks aren’t tightly synchronized between database servers or client nodes:

Process A writes w1 with timestamp t=2 Process B reads w1 Process B writes w2 with timestamp t=1 Process B reads w1, but expected w2

It’s a little tough to work around this one because w2 isn’t just temporarily invisible–it’s gone forever. It might survive on an isolated node for a bit, but eventually the Cassandra or Riak LWW rules will ensure it’s destroyed in favor of the earlier write, because its timestamp is smaller. Depending on whether you consider successfully written values as “seen” by a process, this also violates monotonic reads as well.

Again, this isn’t a bug in the database–as far as LWW is concerned, this is correct behavior. It’s a problem with the coupling between the wall-clock causality model and the application model. If a client considers the data that it writes successfully as “seen”, LWW can’t preserve monotonicity.

Doomstones

Deletes in Cassandra and Riak work by writing a tombstone record, which has a particular timestamp. All objects with a lower timestamp will be silently deleted until GC removes the tombstone record–which means that a rogue client or node can cause the destruction of every write to a record for days to weeks afterwards.

Process A deletes a row with t=100000000 Process B writes w1 with timestamp t=1 Process B reads null, but expected w1 This continues for days

This actually happens all the time in LWW systems, but on short-enough timescales that you might not notice. Every time you delete a cell or row, or empty a CQL collection, all writes to that record are discarded for a short time frame–however many seconds separate A’s clock from the furthest-behind node. This is why it’s so hard to write automated tests for Riak which do rapid create/delete cycles without vclocks: you start dipping below the causality horizon, so to speak.

This behavior violates strong, eventual, causal, read-your-writes, session, and monotonic write consistency, and depending on how you interpret “seen”, violates monotonic read consistency as well.

What does all this mean?

Timestamps, as implemented in Riak, Cassandra, et al, are fundamentally unsafe ordering constructs. In order to guarantee consistency you, the user, must ensure locally monotonic and, to some extent, globally monotonic clocks. This is a hard problem, and NTP does not solve it for you. When wall clocks are not properly coupled to the operations in the system, causal constraints can be violated. To ensure safety properties hold all the time, rather than probabilistically, you need logical clocks.

The safest option I can think of is to use a strong coordinator for your timestamps, like an atomic incrementing counter in Zookeeper. That’s slow and limits your availability, but there are some tricks you can use to slice the problem into more manageable pieces, reducing contention. You probably don’t need coordinated timestamps between Cassandra rows or Riak objects, for example.

A somewhat less safe but reasonable option is to use NTP rigorously on your machines, and sync it to TAI or GPS instead of POSIX time or UTC. Make sure you measure your clock skew: everyone I know who says they run NTP has discovered, at one point or another, that a node was way out of sync. If you want rough correspondence to POSIX time, you can still ensure monotonicity by running your own NTP pool and slurring leap seconds over longer time frames.

Or you could choose a database which supports logical clocks for operations where consistency guarantees matter. Chances are your system has some operations where safety is critical–for those, use logical clocks. When it’s OK to have fuzzy ordering constraints, feel free to use wall clocks. They’re often slightly faster, even if their behavior is harder to reason about than their logical counterparts.

For a good non-mathematical overview of weak consistency properties, see Werner Vogels' Eventually Consistent paper. Google’s Spanner paper explores strong consistency guarantees which are achievable by placing strict bounds on clock skew. To explore consistency coupling in more detail, including how to overlay stronger consistency models onto weaker datastores, try Peter Bailis' Bolt-on Causal Consistency. Happy reading!