Hashed and Hierarchical Timing Wheels



Today’s paper was written in 1987 by George Varghese and Tony Lauck from Digital Equipment Corporation (!), and has withstood the test of time. It’s about how to efficiently implement a timer facility that allows you to start a fixed length timer and perform some action once it has expired. The ability to set timeouts on requests or operations is almost a given, so being able to handle these timers efficiently is quite important.

The paper outlines 7 different schemes for implementing timers and provides a thorough analysis of tradeoffs for each one. What’s great about the paper is that it manages to provide its own context by creating a framework for evaluating timers, and concisely summarizing existing timer implementations. Then it continues on to provide descriptions of its own unique schemes, evaluating them within the framework.

Timer Evaluation Framework

The interface provided by the timer facility is as follows:

START_TIMER(Interval, Timer_ID, Expiry_Action) Start a timer lasting Interval, identified by Timer_ID, and when expired, perform Expiry_Action STOP_TIMER(Timer_ID) Stop the timer identified by Timer_ID, and clean it up

Simply we want to be able to start a timer and possibly stop it in the case we’re no longer interested in it. Now for all the schemes we will assume we have some source of clock ticks, probably hardware, but this could also be some other software interface. Whatever resolution these clock ticks are (every second, every millisecond, or every microsecond) are the “atomic”, individisble clock ticks we will construct our timer from. For our calculations we will assume our granularity is T clock ticks.

Then to evaluate our timer implementation we will also consider two more operations:

PER_TICK_BOOKKEEPING For every T clock ticks we will have to check for expired timers. This is how much work we have to do when we check for expired timers. EXPIRY_PROCESSING This is the routine that does the Expiry_Action from the START_TIMER call.

Scheme 1 - Straightforward

The first approach is the most straightforward. START_TIMER allocates a timer struct that stores the expiry action and the interval of the timer. PER_TICK_BOOKKEEPING goes through all the timers and decrements the interval by one tick, and if the timer has counted down to zero, it calls the expiry action.

It’s fast for all operations except PER_TICK_BOOKKEEPING and to quote the paper is only appropriate if

there are only a few outstanding timers.

most timers are stopped within a few ticks of the clock.

PER_TICK_BOOKKEEPING is done with suitable performance by special-purpose hardware.

Scheme 2 - Ordered List / Timer Queues

Instead of having PER_TICK_BOOKKEEPING operate on every timer, this scheme has it operate on just one. Scheme 2 stores the absolute expiry time for timers instead of how much remaining time it has less. It then keeps the timers in a sorted list, ordered by their expiry time, with the lowest in the front, so the head of the list is the timer that will expire first.

Now START_TIMER has to perform possibly O(n) work to insert the timer into the sorted list, but PER_TICK_BOOKKEEPING is vastly improved. On each tick, the timer only increments the current time, and compares it with the head of the list. If the timer is expired, it calls EXPIRY_PROCESSING and removes it from the list, continuing until the head of the list is an unexpired timer. This is the best performance we can get. We only do constant work per tick, and then the necessary work when timers are expired.

Scheme 3 - Tree-based Algorithms

The paper points out that there is a great deal of similarity between trying to manage which timer has the smallest expiration time and sorting. The difference between timers and classical sorting is that we don’t know all of the elements to start with, as more will come at some later point. The authors call this modified version, “dyanmic sorting.”

Using that lens, we can view the ordered list in Scheme 2 as simply insertion sort, which takes O(n) work per item. However this isn’t the best and we can do better, usually via quicksort or merge sort. Quicksort doesn’t necessarily translate well to the problem, but the authors suggest using a balanced binary search tree in place of a list and using that to keep track of the timers. In this case our START_TIMER costs only O(log(n)) and everything else is the same.

Timing Wheels

The paper goes on a bit of a tangent to describe how Discrete Event Simulation handles the need to process scheduled events in order. To quote the paper:

In the simulation of digital circuits, it is often sufficient to consider event scheduling at time instants that are multiples of the clock interval, say c. Then, after the program processes an event, it increments the clock variable by c until it finds any outstanding events at the current time. It then executes the event(s).

The upshot of this, is that researchers had already developed a common way to handle this problem called a timing wheel.

The idea is that given we are at time t we’ll have a timing wheel that consists of an array of lists. The array is of size N and each index i in the array holds a list of timers that will expire at time t + i . This allows us to schedule events up to N clock ticks away. For events beyond that, the timing wheel also has an overflow list of timers that expire at a time later than t + N - 1 .

START_TIMER will add the timer either to the appropriate list in the array, or into the overflow list. In the common case, PER_TICK_BOOKKEEPING increments the current time index, checks the current time index in the array for expired timers and performs the expiry action as necessary. Every N clock ticks, PER_TICK_BOOKKEEPING will need to reset the current time index to 0 and “rotate” the timing wheel, moving items from the overflow list into the appropriate list in the array.

The downside to this approach is that as we approach N clock ticks, right before we’ll need to perform a rotation, it’s more and more likely that timers will be added to the overflow list. One known solution at the time was to rotate the timing wheel half way through, every N/2 ticks. This reduces the severity of the problem, but doesn’t quite do away with it.

Scheme 4 - Basic Scheme for Timer Intervals within a Specified Range

The authors propose Scheme 4, a nice solution to the problem given extra constraints. If we know that we’ll only have to handle timers for periods less than some max interval ( N ), then we can perform the following optimization.

We create a timing wheel consisting of an array of N slots. The array is a circular buffer indexed by the current time i by i mod N . START_TIMER for a timer with interval j gets placed into the list located at index (i + j) mod N . PER_TICK_BOOKKEEPING only has to increment the current time i and inspect the list at index i mod N and perform expiry actions as necessary. This is ideal in that START_TIMER does only O(1) work and PER_TICK_BOOKKEEPING does only O(1) extra work, besides the necessary work of handling expiry actions of expired timers.

The downside to this solution is that we can only provide timers of a max interval N . And if we decide to grow N , then we in turn must grow the size of our timing wheel, meaning that if our clock resolution is one millisecond, and we wish to have a max interval of one hour, we’ll need a 3.6 million element timing wheel. This can quickly become prohibitively expensive in terms of memory usage.

Scheme 5 - Hash Table with Sorted Lists in each Bucket

The authors propose Scheme 5 to help solve both problems. The authors first write:

The previous scheme has an obvious analogy to inserting an element in an array using the element value as an index. If there is insufficient memory, we can hash the element value to yield an index.

This is a little unclear to me, but I think the intuition is: given that we have a fixed amount of memory to store these timers, instead of having a memory slot for each possible timer expiration date, we can have each slot represent many possible expiration dates. Then to figure out which expiration dates correspond to which slots, we use a hash function to hopefully evenly distribute our expiration dates over all slots.

This is in fact what Scheme 5 does. The authors advise using a power of 2 array size, and a bit mask. The lower bits, lb , index into the array at point (i + lb) mod N where i is the current time and N is the array size. Then the higher order bits are stored into the list at that index. Concretely the authors write:

In Figure 9, let the table size be 256 and the timer be a 32 bit timer. The remainder on division is the last 8 bits. Let the value of the last 8 bits be 20. Then the timer index is 10 (Curent Time Pointer) + 20 (remainder) = 30. The 24 high order bits are then inserted into a list that is pointed to by the 30th element.

By storing the high order bits as well, we can now handle timers that are greater than MaxInterval in period as well. The high order bits will tell us on which rotation around does the timer actually expire. So for Scheme 5:

START_TIMER will take the low order bits and use it to index into the array, and put the high order bits into a sorted list at that index, with the lowest value at the head. The worst case running time is O(n) where n is the number of timers, if they all manage to fall into the same slot.

STOP_TIMER can use a pointer to the timer in the list returned from START_TIMER , and if the list is doubly linked, simply delete it in constant O(1) time.

PER_TICK_BOOKKEEPING increments the current time, checks the list at the current index, and sees if it is equal to the high order bits of the current time. If it is, then the timer is expired, and it performs its expiry action, continuing on through the list in this fashion. Here again, we only perform constant work besides the necessary work of expiring timers.

The downside of this scheme of course is that in some cases START_TIMER can be quite slow. This leads us to Scheme 6.

Scheme 6: Hash Table with Unsorted Lists in each Bucket

Scheme 6 is exactly the same as Scheme 5, but with each slot holding an unsorted list. This makes START_TIMER take constant time, but the tradeoff is that now we have to do extra work in PER_TICK_BOOKKEEPING . On each tick we have to go through all the elements in the list and compare them to the current time (or alternatively decrement each item in the list by 1 each iteration, and expire when it reaches 0).

This makes the worst case scenario for PER_TICK_BOOKKEEPING O(n) as well, which is quite bad if that’s actually what occurred each tick. Instead we really amortize the work that PER_TICK_BOOKKEEPING is doing every tick over all TableSize ticks. So on average we’re doing n / TableSize work each tick. Then it becomes a question of how many outstanding timers do we have and what is our table size. If n / TableSize < 1 then we’re doing constant work on average for each tick. If we configure appropriately we should be able to achieve good performance.

A third option that occurs to me is that you could use a balanced binary search tree instead of a list for each bucket. However it’s unclear if there are real benefits there given cache effects, but it is possible you could squeeze more performance if you were already bumping up against the maximum table size you could reasonably allow.

Scheme 7: Exploiting Hierarchy

The final scheme takes advantage of the hierarchical nature of time. When we ourselves keep time, we don’t count in just seconds, we use minutes, hours, days, weeks, etc. The idea here is the same, instead of just having one timer with one resolution, instead we could have multiple timers each with its own resolution. The example the authors give is:

A 100 element array in which each element represents a day

A 24 element array in which each element represents an hour

A 60 element array in which each element represents a minute

A 60 element array in which each element represents a second Thus instead of 100 * 24 * 60 * 60 = 8.64 million locations to store timers up to 100 days, we need only 100 + 24 + 60 + 60 = 244 locations.

The way it works is that every second we process one tick for the second timing wheel. Once we’ve made one full revolution of the second timing wheel, we process one tick of the minute wheel. Once we’ve made one full revolution of the minute timing wheel, we process one tick for the hour timing wheel, and so on. Each timing wheel uses Scheme 4, which was the scheme with a fixed size timing wheel that has some MaxInterval . But by having more of them in this hierarchical layout, we can support a much greater range of timer periods with much less memory than we could with a fixed resolution timer.

How the scheme works is that the lowest resolution timing wheel behaves exactly the same as before, performing the expiry action of the timers at the current time index. However, the timing wheels higher up in the hierarchy behave differently when finding timers at the current index. We cannot simply perform the action, since we might have scheduled a timer for 1 minute and 30 seconds in the future, so expiring it after a minute isn’t quite right.

Instead each timer stores its full resolution, or remaining resolution, as well. So if we went from minute 0 to minute 1, and discovered our timer for 1 minute and 30 seconds, we would schedule the timer in the seconds timing-wheel below us in the hierarchy with its remaining time of 30 seconds.

In general, PER_TICK_BOOKKEEPING increments the time index in the lowest resolution following Scheme 4. If we have performed a full revolution, then we process a tick for the timing wheel one level above us. In the higher level timing wheels, if we encounter a timer at the current index, then we schedule it with the remaining time in the timing wheel one level below us.

START_TIMER might have to inspect all the timing wheels we use to discover where to schedule the timer, but the number of levels m should be small and constant. Hence START_TIMER should take constant time.

PER_TICK_BOOKKEEPING might have to spend a good deal of time migrating between the various hierarchies if our timers are generally for a long period. The authors provide a more rigorous framework for evaluating the performance, but the basic tradeoff is that if you have longer lived timers and wish to use less space, then Scheme 7 performs better than Scheme 6. The intuition being that you don’t have as many unexpired timers clogging up your timing wheels as you do in Scheme 6.

The authors also note another possible optimization discovered by Wick Nichols :

Wick Nichols has pointed out that if the timer precision is allowed to decrease with increasing levels in the hierarchy, then we need not migrate timers between levels.

The idea being that if we’re okay with having a 3 hour and 30 minute timer expiring after 3 hours, we don’t have to do any expensive migrating between timing wheels. Or perhaps we could only migrate a fixed number of times, meaning a 3 hour, 30 minute, and 5 second timer, might go off after 3 hours and 30 minutes.

Conclusion

The paper is a great read and quite easy to understand. I love how it gives such a complete background on the subject and is nicely self contained. Some quick googling doesn’t yield any obvious published improvements on the methods presented here, and it’s still used today for all sorts of useful software like Kafka and Facebook’s folly library.

Sometimes I think of small papers like this as the pinnacle of technical achievement. Tackling a tough, small problem and finding near optimal solutions for future generations to build on. It’s almost like reading a short story, with a satisfying ending where the good guys win and the bad guys of complexity lose. Wish there were more stories like that.