Recently, two of my coworkers started working on an MMO. Naturally, this led to discussion about networking. The discussion reminded me of a puzzle I came up with a few years ago, that changed how I think about networks and time.

Puzzle: Timely Robots

Suppose there are two robots, each isolated in a separate room. The robots can track time, but they don’t have access to synchronized clocks (their internal clocks differ by an unknown amount). Furthermore, the only way for the robots to communicate is by sending signals down two (very long) cables connecting the two rooms in a cycle.

The cables are one-way-only and may or may not have the same length. As a result, the communication latencies may not be equal: either both latencies will be 2 seconds, or one latency will be 3 seconds and the opposite latency will be 1 second. The puzzle is: create a network protocol that the robots can execute to determine which case they are in (2s:2s or 3s:1s), or prove that no such protocol exists.

If you want to solve the puzzle for yourself, pause and think now. Spoilers follow the puzzle diagram.

Solution: Skewing the Asymmetry

My initial intuition, and the initial intuition of everyone else I’ve told the puzzle to (so far), has been that there should be a protocol to distinguish the two cases. This intuition is wrong. There is no such protocol.

There are three parameters relevant to the puzzle: the delay from the robot in room A (which I will arbitrarily call the ‘client’) to the robot in room B (the ‘server’), the opposite delay from room B to room A, and the unknown skew between their clocks.

Here’s a sequence diagram where each parameter is varied in turn, to give an intuitive idea of what they do:

Notice how, as each parameter changes, the message arrival times (as measured by the receiver) change. Changing one of the parameters creates observable differences, and these differences can be used to determine the value of said parameter when the other two are known.

If two of the parameters are varying/unknown, it’s still possible to solve for them by using the observed differences. For example, given synchronized clocks (a clock skew of 0 seconds), the delay from the server to the client can be determined with a single time-stamped packet.

Unfortunately, solving for two parameters is the limit. There’s not enough measurable information present to solve a third parameter. The effects of changing the clock skew are indistinguishable from the effects of changing the asymmetry in the latencies (while keeping the round trip time constant).

Here’s a sequence diagram varying the parameters in sync, cancelling all observable effects:

In the above diagram, even though the underlying parameters are changing, the times that messages are sent and received is never changing. Even when we bend the rules and make one of the delays negative, sending packets backwards in time, there are no observable differences. (Which suggests an intuitive reason for there to be no solution: imagine talking through a time portal to someone in the year 2000. The conversation works exactly like a normal conversation, despite the one-way latencies being radically different.)

Any protocol that correctly reports “We’re in the 3s:1s case!” or “We’re in the 2s:2s case!” in one situation can be fooled into giving the wrong answer by switching to the other case and adjusting the unknown skew between the two robots’ clocks by one second. Everything about what happens as the protocol runs, that the robots can measure, will remain identical. They will report the same result as before, except now they’re quite wrong.

So there is no protocol that solves the puzzle. We can always find an incorrectly categorized case by picking a clock skew that counters the latency asymmetry.

Consequences

The (lack of) solution to the puzzle suggests a useful trick for thinking about networked code. Does having two different one-way delays make things more confusing? No problem! Go ahead and assume they’re equal, or that one of them is zero! As long as you don’t change the round trip time, or accidentally sneak in some synchronized clocks, the analysis will give exactly identical results.

For example, when thinking about the lock-step protocol that a game like WarCraft 3 uses, I find it easiest to imagine that messages from the server to the client (which determine when actions actually occur) travel instantaneously. This ensures that all clients apply actions at the same time, and thus are seeing the exact same game state at the exact same time. In reality each player will be seeing slightly older or newer game times, but (as I’ve been explaining) that has no observable effect within the game or on the analysis.

Before considering the puzzle I hadn’t truly understood the fact that, in an online game, different players would be seeing slightly different times. Good thing that assuming they’re seeing the same time is a perfectly valid way of modelling what’s happening!

The limitations exemplified in the puzzle also have unfortunate real world consequences. For example, consider the Network Time Protocol, for synchronizing computer clocks. If the NTP protocol works correctly when the delays are even, then it will be off by up to half of the round trip time when the delays are almost all in one direction. Without knowledge about the one way latencies (beyond the fact that they aren’t negative), it’s impossible for the error bars on synchronized times to be reduced below the round trip time.

Luckily, in the real world, there are reasonable assumptions about latencies that can be used to reduce the amount of error. Especially if you’re querying multiple servers multiple times. On the other hand, these assumptions can be broken. For example, a malicious router might add one second of delay only to outgoing packets and thus skew the synchronized time by a half second. On the other other hand, the “malicious” router is technically fixing the clock skew via the same asymmetric delay that causes it, so… not a particularly terrifying scenario.

Bigger Networks

Things get more complicated when there are more participants in the network. In a network with three participants (A, B and C) there are six one-way latencies: A to B, B to A, B to C, C to B, A to C, and C to A. These one way latencies determine the round trip times of the five possible cycles: A to B to A, A to C to A, B to C to B, A to B to C to A, and A to C to B to A. Of these five, only four are actually necessary to measure (the round trip time of the fifth one can be computed in terms of the other round trip times). Other measurable values include the difference-between-two-paths times, such as A to C vs A to B to C, but these can be derived from the round trip times.

So, the space of measurable values has four dimensions, but the underlying space of possible one-way latencies has six dimensions. Six variables and four constraints. That leaves two degrees of freedom. Just what we need to give each node an arbitrary unobservable clock skew:

This ‘extra degrees of freedom’ problem happens in networks of any size. The delays in a network with participants are parametrized by one-way delays. Round trip and path-difference times based on these delays can be observed, but the many different cycles and pairs of paths do not all give orthogonal results. We’re always left with degrees of freedom, corresponding to unknown relative clock skews. The case with 2 participants, with a single clock skew between the two, is just the simplest non-trivial case.

In a sense, the inability to measure clock skews is more a fundamental problem than the inability to measure one-way latencies. Each relative clock skew corresponds exactly to one of the dimensions of the unmeasurable space. The uncertainties that prevent us from measuring the one-way latencies all reduce to uncertainties in how well clocks can be synchronized. Clock synchronization is hard.

Variation

One thing that makes it a bit clearer that it’s the clock skews that are fundamentally unobservable, as opposed to the one-way latencies, is that we can measure variations in one-way latencies.

In practice, latencies are not constant. Different packets may take different routes, or be queued for different amounts of time, creating small variations. Also, these variations may not be symmetric. For example, your roommate might saturate your upload bandwidth, creating intermittent delays on your outgoing packets, without saturating your download bandwidth. These variations have unique observable artifacts, typically in the form of stuttering, that depend on the direction.

When your roommate saturates your upload bandwidth, and you’re making a voip call, then data sent to you will arrive at a steady rate but data sent by you will leave in bursts. You won’t directly notice any problem, but the person you are talking to will hear audio cutting in and out. The variations have observable consequences, depending on their direction.

In the following diagram, two peers (A and B) can determine that it is the A to B path that is varying by measuring the difference in arrival times between periodic ticks:

Given the assumption that the magnitude of the variation was related to the magnitude of the one-way latency, we could estimate the mean of the one-way latency based on the sampled variation. Without this assumption, the variation gives no information about the means of the one-way latencies, which is what we’d need in order to do clock synchronization.

The difficulty of clock synchronization, combined with the fact that latencies can vary, is why distributed algorithms use concepts like Lamport timestamps and vector clocks.

Summary

One-way latencies are inherently difficult to measure, due to the difficulty of synchronizing clocks. This actually makes it easier to think about many networked protocols, because you can imagine the latency wherever it is most convenient (as long as you don’t change any round trip times).

—

Discuss on Hacker News, Reddit

—



Twisted Oak Studios offers consulting and development on high-tech interactive projects. Check out our portfolio, or Give us a shout if you have anything you think some really rad engineers should help you with.



Archive