Recently I found some pretty libevent benchmarks. For me they show terrifying results. The blood freezing fact is that the more connections you have, the bigger is the cost of adding new connections to asynchronous loop.

It means that if you have 1 connection registered to asynchronous loop, the cost of registering callback would be small; while when you have million of idle connections the cost of registering is big.

It would be horrible if a similar situation would appear on the kernel level. It won’t be as bad as in this benchmark, but still, as a programmer I’d expect that creating and destroying connections is always constant time.

How asynchronous libraries work?

The single chunk of data passed to the async lib is a tuple of three items:

event = (socket_descriptor, callback_function, timeout)

This structure is often referred as an event. When programmer registers such a structure to the asynchronous loop, he expects that his callback will be executed when something happens on a descriptor, or when a timeout occurs. That’s what’s asynchronous programming is all about.

Asynchronous library offers a way of adding and removing such events to and from its main data structure – a priority queue. This queue is sorted by timeouts, so async library always knows which events would be timed out first.

Asynchronous loop is waiting for something happening on descriptors or for timeouts.

When something happens, it executes necessary callbacks. See this pseudocode:

the_main_loop: nearest_event = find_event_with_smallest_timeout() events_triggered = select(<fds...>, nearest_event.timeout) if events_triggered: # has anything happened on the wire? for event in events_triggered: remove_from_datastructure(event) execute_callback(event, timeouted=False) else: # nothing happened, event timeouted remove_from_datastructure(nearest_event) execute_callback(nearest_event, timeouted=True)

In 95% of cases the body of programmers callback function, looks like this:

def my_callback(fd): <read response from descriptor> <write new request to descriptor> add_to_event_loop(fd, my_callback) return

When a timeout occurs, in most cases the callback just cleans up the memory. But usually timeouts don’t happen often.

This pseudocode means that:

when anything happens on a socket, event structure is always removed from priority queue

if socket becomes readable there’s a high probability that next event would be registered for that socket

Putting this conclusions back into the pseudocode:

the_main_loop: # less often nearest_event = find_event_with_smallest_timeout() events_triggered = select(<fds...>, nearest_event.timeout) if events_triggered: for event in events_triggered: # very often, we have data on the socket! remove_from_datastructure(event) execute_callback(event, timeouted=False) else: # timeout happens rather rarely remove_from_datastructure(nearest_event) execute_callback(nearest_event, timeouted=True)

The problem

The problem is that currently asynchronous libraries often use binary heap as a representation of internal priority queue. It means that the cost of registering and removing item (event) from this structure is logarithmic to the number of already registered events.

binary heap cost frequency in my use case add item O(log(N)) very often remove item O(log(N)) very often event loop iteration O(1) less often

We can clearly see that the data structure was optimized to have smallest possible time per every event loop iteration. While I think that it should be optimized to reduce the cost of adding and removing events.

I can put the same in different words. In the real world, in most cases something happens on a socket before a timeout occurs. Connections very rarely expire because of a timeout.

Considering this, it’s reasonable to use a data structure that would have O(1) as a time of adding and removing items, and it’s not so bad to have a bit higher cost inside the loop.

As you can imagine others faced that problem before me. The data structure that could be used to manage timers with better average cost is called Timer Wheels. It is used inside Linux kernel to manage operating system timers. It’s described in this paper from 1996.

The timer wheels algorithm assumes that events occur every fixed period of time. For the kernel it was obvious that timeouts were checked every kernel tick, usually 10ms. On the other hand waking up asynchronous library every 10ms would be bad.

I believe it’s possible to avoid waking up every tick. To achieve this, when we seek for a smallest timeout we should look through all the buckets and try to find the first one that’s not empty. The cost of that is constant and depends on the number of buckets.

I suggest to use buckets of 1ms. If you need granularity better than this, no asynchronous library can help you anyway.

Other issue is how to handle events very far in the future, that don’t fit into time wheel buckets. The reasonable compromise is to use a binary tree (or heap). In such case, the cost of manipulating that timers would be logarithmic.

timer wheels limited time scale unlimited time scale timeouts fit in timer wheels (optimistic) timeouts far in the future

(pessimistic) add item O(1) O(1) O(log(N)) remove item O(1) O(1) O(log(N)) event loop iteration O(K) O(K) O(K)

(K is the number of buckets in all time wheels. This is a constant factor.)

This diagram shows the structure I’m suggesting.

There are few disadvantages of this approach:

The library must be woken up regularly to check if there aren’t any new events on the long-term heap. In case of my diagram it would be every 33.8 sec, but of course you can add another level timer wheel to expand this time.

If there’s a single event third timer wheel, the library would need to be woken up three times to dispatch it.

The granularity is fixed to 1ms.

The method I’m suggesting is quite complicated but in my opinion having a constant O(1) cost of managing events is worth it. Unfortunately, for events far in the future the cost would always need to be logarithmic, in the end we must sort the timers somewhere.

You can ask who needs such improvements, not everybody has thousands connections. I agree that that most users doesn’t care. But there are some companies that could save money on this improvement, like Facebook which apparently uses libevent a lot…