-- Scalable Read Atomic Transaction for Partitioned Datastore

A Story

You are building a messaging app. You start with a non-partitioned single database, where you store unseen message count and the actual messages in two different tables. It served you well ... until more and more people are using your app and the data no longer fits in a single host. Then you tried to partition the data; put unseen message count and messages into two different databases. Works great, until you got bug reports from your upset users complaining that they are seeing a big red badge when there's no unseen messages.

The problem

When you partitioned the data, you have to now deal with Multi-Partition Transaction for certain use cases. The badge count here is just an example of that. (For the sake of this post, let's assume the Badge Count here's pulled from the phone instead of pushed like iOS's app badge count. You can think of this badge as a badge inside your app.)

There are other use cases that share the same underlying challenge. E.g. dis-aggregated secondary index, cross shard transaction on a sharded database. To abstract the problem, let's say you have x=0 and y=0 two variables and you update them to x=1 and y=1 . While it's being updated and you perform a read, you don't want to get x==1 and y == 0 or x==0 and y==1 . You want to get either x==0 and y==0 or x==1 and y==1 .

Multi-Partition Transaction

The most important guarantees we care about here are Write Atomicity(all or none updates would be performed) and Read Atomicity(all or none updates from a transaction would be visible).

Atomic Write can be achieved by Two Phase Commit. But two phase locking based Atomic Read would be too expensive for read heavy workload, as it requires two RTTs and no concurrent transactions be performed on overlapping keys. So for read heavy workload from Multi-partitioned datastore, how can we perform Atomic Reads at scale?

RAMP Transaction

Pater Bailis introduced RAMP transaction to solve exactly this problem. I am going to explain the gist of the idea using the example above. The main idea being, with enough metadata about the transaction being stored at write time, client can detect fractured read and fix it up. And what's really cool is that it requires no coordination (unlike two phase commit) and it allows concurrent read/write transactions.

Write

[step-1] client pick a monotonically increasing and unique id ts t . You can imagine it being a tuple of client hostname and timestamp. And two ts s can be compared based on timestamps.

[step-2] set x=1, ts=ts t . It would look something like the following. The write is not yet visible to client by default yet.

on host 0

Key Value TS other_participants x 1 ts t y

[step-3] set y=1, ts=ts t .

on host 1

Key Value TS other_participants y 1 ts t x

Assume it used to be

Key Value TS other_participants y 0 ts p z

[step-4] set x=1 to be visible on host 0 by setting the last_committed_ts to be ts t .

Key last_committed_ts x ts t

[step-5] set y=1 to be visible on host 1

Notice that this assumes ts t > ts p . If it's the other way around, last_committed_ts would stay to be ts p .

Key last_committed_ts y ts t

Read

Let's say a read is literally performed when the write transaction is in progress. And client got something like this:

(key: x, value:1, ts:ts t , participants: y) and (key:y, value:0, ts:ts p , participants: z), when the write transaction is in between step-4 and step-5.

Step-5 above assumes ts p < ts t . And in this case, client noticed that y claims to be in the same transaction as x at ts t . Since ts p < ts t , client would know that it's missing (key: y, ts:ts t ), and can fetch that specific version from host 1 before it's fully committed. This means write can literall "fail" at step-5 and it will be fine. Also notice that if write operation were aborted at step-4, no one would ever read and data at ts t . Hence it's invisible to clients. See? No coordination needed! No rollback. This is effectively the main idea behind RAMP transaction.

Wrinkle in the original RAMP

ts p < ts t .

Well does that always hold, even if ts t ( set y=1 ) happened after ts p set y=0 ?

The answer is NO. Since the timestamp in RAMP is client timestamp, which is subject to various kind of clock-skew or even bogus local timestamp. The original RAMP paper has a policy that highest-timestamped write wins, instead of latest write wins. So it means even if the second write happens later, as long as its timestamp is smaller, it will effectively lose to the first write. The data in some sense is "lost". This is not an inherit limitation of performing RAMP transaction though, you can easily iron out the wrinkle by performing read-modify-write on each key to ensure ts is monotonically increasing. Naive HLC would work pretty well here. But it's not free as it will impact throughput as well as latency.

Reduce Write Amplification

The Write Amplification of RAMP is not negligible, as it needs to store transaction participants as well as multiple versions of each key.

Can we cache recent writes for each key instead?