Preface

At re:Invent 2017, Amazon Web Services (AWS) announced Amazon Time Sync Service, a highly accurate and reliable time reference that is natively accessible from Amazon EC2 instances. It is much like the Google TrueTime published in 2012. Why do Google and AWS both want to make efforts to provide global time service? Is there any inspiration for building distributed database? This topic is important to think about.

Time synchronization remains a hard nut to crack in distributed systems, especially for distributed databases such as TiDB where time is used to confirm the order of the transaction to guarantee the ACID compliance.

In this post, I will introduce the existing solutions to tackle the time synchronization issue in distributed systems, as well as their pros and cons. I will also share why we chose to use the timestamp oracle (TSO) from Google Percolator in TiDB.

Order of the events

Linearizability is important for distributed systems, especially for distributed databases. We can't allow reading stale value after the update. For example, if account A has $100, and transfers $10 to B. After the transaction finishes, we can't read $100 again from account A.

Another simple explanation is: if we write a data at time T1 like set a = 10 , after T1 , we must always read 10 as the value of a , not 11 or any other values.

But how can we ensure we can read the newest data? How can we know the order of two events?

We usually use "happened before"( hb or -> ) to describe the relationship of two causal events. For two events e1 and e2 , e1 -> e2 , we can say that e1 happened before e2 , or e1 causes e2 .

If we have only one process, determining the order is easy, because all the events can only happen in the process in sequence. However, in a distributed system, things become more complex. The events may happen in different places, and it becomes hard to determine the order of all the events.

At first, we may consider using wall time, but the time is not the same in all the processes -- one process may run faster, and the other process may walk slower. So we can't use the time directly to check the order of the events. Luckily, we have other ways to do it.

Logical clock

A simple way is to use logical clock which was proposed by Lamport in 1978 for timestamping and ordering events in a distributed system. The algorithm is:

Every process starts with an initialized clock counter. A process increments the value of its counter before each event in that process. When a process sends a message, it includes the counter value with the message. When a process receives a message, it increases its clock counter value to be bigger than both its current clock counter value and the value in the received message. If two events have the same clock value, we may think they happen simultaneously so we should use process ID or any other unique ID to differentiate them.

Using this can easily determine the order of events. For example, assuming we have processes P1 , P2 , both with an initialized counter value 0 .

P1 increases the counter value to 1 and executes event A . P2 wants to do something but needs to know the result of the event A , so it increases its counter to 1 and sends a message with counter 1 to P1 . P1 receives the message and increases its counter to 2 , then increases the counter to 3 and replies to P2 . P2 receives the message and increases the counter to 4 then executes event B with counter 5 .

We use C(A) and C(B) as the counter value of the events, if events A and B happen in one process, and A happens before B , we can know that C(A) < C(B) . If A and B happen in different processes, we can also know C(A) < C(B) based on the message, so if A happens before B, we can infer C(A) < C(B) . But if C(A) < C(B) , it doesn't necessarily mean that A happens before B.

If the two events are not causally related (no communication between the processes), we can't determine the order of the events. We can use vector clock to fix this. But whether it's logical clock or vector clock, they both have a disadvantage: we can't know what time the event happens because both of the two clocks merely record the order of the events instead of the time.

To get the chronological order of the events, we have to go back to square one and use real time. But we can't depend on Network Time Protocol (NTP) directly because it has some errors and the time is not accurate, so what should we do?

TrueTime

In Google Spanner, it uses TrueTime API to fix the problem of time. Spanner uses GPS and Atomic Clock to correct the time and can guarantee clock uncertainty bound (ε) is very small. The ε value is less than 7ms in 2012; it may be less now.

The TrueTime API is very simple:

Method Return TT.now() TTinterval: [earliest, latest] TT.after(t) true if t has definitely passed TT.before(t) true if t has definitely not arrived

Spanner can't get an accurate time point but only an interval. We can use function now to get an interval. Assuming event A happens at the time point tt.a , and tt.b is for event B . We can know that tt.a.earliest <= tt.a <= tt.a.latest . For event A and B , if A happens before B , we can infer tt.a.latest < tt.b.earliest .

Because TrueTime has the clock uncertainty bound ε , so for every transaction commit, it must wait 2ε time to guarantee linearizability, but ε is so small that the performance is still high.

The biggest hurdle to adopting TrueTime is that it depends on special hardware, such as GPS clocks and atomic clocks, which many companies do not have.

Hybrid Logical Clock

Hybrid logical clock (HLC) is another way for timekeeping and timestamping in distributed systems.

Based on NTP, HLC can only read time from NTP, but it won't change it. HLC contains two parts: physical clock + logical clock. For example, assuming:

pt : the physical time

l : the maximum of pt information learned so far

c : the logical clock To compare the order of two events, we can first check their l time, if equal, we can check c time, for any two events e and f , if e happened before f , we can know (l.e, c.e) < (l.f, c.f) .

The HLC algorithm for node j :

Initialize l.j = 0 and c.j = 0 when node j starts up. Send a message to another node, or a local event happens: l'.j = l.j l.j = max(l'.j, pt.j) if (l.j = l'.j) then c.j = c.j + 1 else c.j = 0 Timestamp with l.j, c.j Receive a message from node m . l'.j = l.j

l.j = max(l'.j, l.m, pt.j) if (l.j = l'.j = l.m) then c.j = max(c.j, c.m) + 1 else if (l.j = l'.j) then c.j = c.j + 1 else if (l.j = l.m) then c.j = c.m + 1 else c.j = 0 Timestamp with l.j, c.j