Transcript

Imagine it's the year 2000 and you're trying to build a software load balancer. You wanted to handle tens of thousands of concurrent connections per second, you wanted to max out the network interface card it's using, and you wanted to do all that with minimal latency. The only problem is you and your users have hardware something like this. Well, I'm exaggerating, but this was a time of Pentium III CPUs, which run at around 1 Gigahertz and with only a single CPU core. Willy Tarreau is the author of EBtrees, and he ran an even slower machine. He's also the primary author, prolific contributor, and the maintainer for the HAProxy open source project for the past 18 years.

Willy had a problem. If we imagine this highway intersection to be HAProxy, we can think of the cars on it as connections going through their various stages of processing. Our single CPU means that we only have one engine for all of these vehicles. Our challenge then is to somehow magically share this one single-engine amongst all of them in such a way so that everybody travels at their preferred speed and that nobody notices that there's only one engine.

One way to do it is to process each connection for a short period of time and then move on to the next connection that has the highest need to be processed next. This is not a new concept, it has been popularized since then under terms such as green threading or cooperative threading, and you can find examples of it in Python's gevent or in golang's coroutines. Writing this type of software is interesting in and of itself, but before we can even do that, we need to be able to wake up some tasks and find out which one needs to be processed first. We need a timer and a scheduler. A scheduler implementation can be simple, and probably should, it requires a specific set of demands. In this talk, I'll be talking about what makes HAProxy scheduler so fast, and in turn, what makes HAProxy the fastest software load balancer in the industry.

My name is Andjelko, and I'm a director of engineering at HAProxy. My team and I, we work on making sure that HAProxy remains the fastest and most widely used software load balancer. We work on integrating HAProxy with current and emerging platforms and technologies in order to fill our user's needs.

Also, in this talk, we'll be showing that if we take a concept, an idea, think outside of the box and implement it with thinking about the practical constraints of the environment where our software is going to be running, we can make a big impact with it. In our example, trees are fundamental and, therefore, ubiquitous data structure. We can find it virtually anywhere, in any software library. It's in file systems, ext4, HFS+, NTFS, virtually any modern file system. We can find it in higher-level languages as a way to implement more complex data structures. It's in databases, I would guess, MySQL, and also in the Linux kernel itself. It powers the Completely Fair Scheduler, which is the default scheduler for the Linux kernel. Even though the trees are used everywhere, if we combine algorithmic innovation with paying attention to memory allocation and CPU cache utilization, we can make an implementation that's very suited for our needs than any other competitor.

High performance is also a way to protect our load balancer since it can be exposed to malicious traffic, and a slow load balancer is a way to find a vector for an attack.

Our EBtree that we end up with is a fast descent and search capability, it's memory efficient. It allows us, by the virtue of its makeup, to look up keys by mask or prefix. It's optimized for inserts and deletes, it's great with bit-addressable data.

We will go about this talk by first exploring the problem space, by taking a look at the scheduling requirements that we've had in our use case and that solution that might be used to implement it. Then we’ll take a look at what makes EBtree so special, and there are two parts, one of them is design and the other one is the EBtree implementation. Finally, we'll take a look how we can make use of EBtrees and how it's used with a HAProxy scheduler, and then we'll conclude.

Scheduling Requirements

In order to solve a problem of a number of tasks needing to run on a single thread and take care of network connections at the same time, an event loop seems like an ideal design choice. The HAProxy event loop - what we needed to do is to handle the network connections, and the tasks that it's going to be handling can be grouped into two categories, the tasks that are running right now and that we need to run right now and the tasks that are suspended that are chilling out that we want to look up and check upon later.

Conceptually, we can visualize it like this. Our event loop is going to go through all these three stages in turn and then loop around back upon it. For the connection part we have a number of polling mechanisms. In Linux, we have poll, epoll, kpoll, so we're already in a pretty good spot with them. However, with our tasks, that's where we're on our own, we need to implement something that can take care of our needs there.

What is a HAProxy task at all? At a basic level, we can define it by only two things. One is the expiry time, when we need the task to run, and the other thing is a bunch of code that we want to run at that time. As about our scheduler - what do we want it to do? We need to have two groups of tasks, active ones and suspended ones. The tasks that we have are consisting of a key with a timestamp and some code attached to it, we're going to be inserting a lot of that. We also need to allow for duplicates to be inserted, because it's likely that we will end up with tasks that want to be executed at the same moment of time.

After we've done some inserting, we'll want to read those tasks back, but we want them sorted, we want to know which ones need to run first. As we read them back, we're going to be deleting them from that particular group, because after we unsuspended them, after we proclaim them task active, we no longer need it in that group. Last thing, it will be nice to have priorities so that if multiple tasks need to run at the same time, we can have tasks which are prioritized over others and run sooner.

Our handling of tasks is going to look something like this. We're going to have a number of tasks in the run queue, we're going to be processing them. As we process each one of them, we're going to move it to the wait queue. After we process a number of tasks, we can take a look at the wait queue and pull up some tasks from it to be promoted to the run queue. Then rinse and repeat.

What about our scheduling environment? Where will our scheduler run? Well, since we're building a software load balancer, it's going to be running potentially during slow periods of network traffic, but also running during very high periods of network traffic, probably sustained periods. We can expect up to high frequency of events. Then for each network connection, we might have multiple processing rules or conditions associated with it. This is just going to compound the number of rules that we need to insert and the number of tasks that we need to keep track of, and that number can grow quite huge. We are also going to be doing frequent lookups, because we're going to be checking up on the list of suspended tasks to see, "Do we need to run something," and that's going to happen quite a lot.

What are some of the desirable qualities that we can think of? We want the scheduler to be fast. Even more than that, we want it to be predictable, because we don't want to suddenly have a slow down when no external change has happened just because our scheduler has run up against some of its internal limitation. If we combine the desire for speed with the desire for simplicity, with the desire for predictability, we end up with needing to have something that's simple enough that it can be well understood.

Candidate Solutions

Let's take a look at some potential solutions to our problem, and we'll take a look at the basic data structures that are commonly used in programming since those all have the highest chance of satisfying the demands that we just put forth.

One of the most basic data structure is an array, but the only thing it has going for it and for our use case is that we can iterate through it quite fast. We can check up on the list of tasks that needs to be woken up quickly, but for inserts, for deletes, for sorting, and for searching, it doesn't really do much for us. Then we have a linked list, and that one is a good improvement on the array, because not only can we iterate through it quickly, we can also insert into it quickly and we can delete through it quickly. We should maybe keep that one in mind.

Some of the other basic data structures, like the stack and the queue, they're not going to be interesting for us, because we can't reorder them. That's part of their definition. A hash map, another basic data structure, has fast inserts, fast deletes, can search really quickly, but doesn't do anything for sorting. Lastly - and this was a spoiler earlier - trees are a good thing to look at.

We isolated linked list as a potential solution to our problem. Why? It usually handled the linked list by referencing with the point at its head. If the entry that we're interested in is at the start, and if the entry that we want to extract and delete is at the start, this gives us a constant time to access and delete entries. If we keep another pointer at the end of the list, and we always insert at the end, then it gives us a constant time to insert at the end. For our tasks, if we can rely on all the task in the same expiry time, then the linked list is a perfect solution. In fact, originally, HAProxy used the linked list as a basis for task scheduler. It's not the only one that's doing that, Linux scheduler used to use linked list as basis of its implementation before Completely Fair Scheduler was introduced.

Why are we not currently using linked list as the basis of the scheduler implementation for HAProxy? The reason is simple, it’s because it's not realistic anymore to have tasks with the same expiry time. As soon as we introduce the tasks that need to have differing expiry times, that means that after inserting tasks, we will probably need to sort the linked list. Adding to the end is no longer a solution to our problem, because then, after every insert, we need to sort the list, and that's going to take additional time, so the linked list is not an ideal solution after all.

A tree, on the other hand, is data structure that, by definition, keeps the data sorted, a good thing to start with. It does fast searches as well, if we need to find a single task that, for some reason, needs to be terminated or needs to be promoted somewhere else, it's easy to do that. However, depending on implementation, trees can have more complicated insert and delete processes. Why is that? If we take a look at this basic binary search tree, if we don't have any optimizations for insert and delete procedures, and if we insert data that's coming in over the pre-sorted, we're going to end up with a linked list. All the data will be inserted in the same branch of the tree, and basically, we didn't gain anything. To prevent that, over time, the number of tree variance have been devised as a class of self-balancing trees that perform tree rotations or operations after inserts or after delete in order to minimize the difference between the longest branch of the tree and the shortest branch of the tree. This is a good thing, but it can get complex.

Here, we have an animation from Wikipedia showing a number of different operations, tree rotations, for one of the self-balancing trees, the AVL tree, and what that might look like in practice. This is going to happen, one of these is going to happen on every insert, and there's a number of different ones. After we delete an entry, the balance of the tree is going to change, so we're going to have to do some rearrangement. A different self-balancing tree implementation, like the red-black tree, has the same problem.

There's one type of tree that we didn't consider yet, and this is a prefix tree. It has a completely different set of properties that we discussed up to now, and this is because the prefix tree is introduced a distinction between the nodes that connect the tree and the last row or the last line of nodes which are the leaves, and they are used differently in a way such that when searching and when traversing the tree, the nodes are used to compare only parts of the key, while the full key is achieved only after reaching the leaf. Here, we have an example of the number of words and characters used. The common parent of the tree in this branch contains the common prefix, while the different parts of the key are stored on either side of the tree.

This is interesting because this type of tree has its structure defined by the makeup of data that's stored in it. We can't rotate this tree, if we start rotating stuff, first of all, we're going to mess up the ordering of the key components, and we no longer have the same data that we inserted in it. There's a good part to it as well because the tree is not sortable or it's not rearrangeable. If we want to delete an element, we just remove it from the tree, and nothing else should or can change.

While the inserts require different operations on the tree, the deletes are very simple. Since in every step of the traverse down the tree we only compare the part of the key that's corresponding to the node and the position of the node we're looking at in the tree, we can compare very long keys much faster than other tree variants. In other tree variants, every node contains the entire key, while here, we only take a look at the part of the key for that particular node. It's also easy to do prefix matching, because all the keys with the same prefix will be elements of the same subtree, we'll find them under the same node. There's a good part, but there are some downsides here as well.

The nodes and the leaves are treated differently, so this makes our life harder, because now we need to have different algorithms and different code to treat nodes and leaves. This complicates memory management as well, because we need to keep track of which nodes are still used after we delete the leaf, need to garbage collect them, or do something to them afterward. It makes for a more complicated codebase.

The other downside is that the tree is not balanced, and that's a part of its definition. If it's not balanced, then we can write into the same problem we talked about earlier, where depending on how the data is constructed or what type of data we are inserting and the makeup of the data, we can end up with an unbalanced tree compared to a self-balancing tree, which would end up just fine, end up balanced.

EBtree Design

That's where we were, and this is where Willy found himself, so let's see what happened. As I already hinted at, the initial implementation for the scheduler in HAProxy was based on linked lists, but that no longer worked after introducing different timeouts for tasks. The community contributed an implementation of a self-balancing tree, in this case, it was a red-black tree, but this just brought the problems which we associate with self-balancing trees, and those are more complex inserts and delete operations. This means we just lost a bunch of speed, because we're doing inserts and deletes quite frequently.

Willy tried to solve this problem with a simple implementation of a prefix tree, just to discover the downsides that we just mentioned, which is more complex memory management and need to treat nodes and leaves differently, resulting in an increase in memory allocations. The rates that we're talking about, this makes a difference.

What can we do about it? Constant time deletion for prefix trees is something that we can latch on to, and it's the inherent quality of prefix trees. If we can just simplify memory management and reduce the impact of potentially having the tree unbalanced, then we might have a solution for our use case. Let's take a look at the structure and the makeup of what are the basic elements of a prefix tree. We have nodes here in green which are pointed to by one parent and which in turn have two pointers to their children. Those children can be either leaves or other nodes. Then we have the leaves which contain the full data of the key and which are pointed to by one node. From this description, we can encode it into a diagram of a node with the left and the right link.

In order to traverse the tree more efficiently and without needing to resort to recursion or remembering where we were while traversing the tree, we can add a node parent pointer. It'll just point us back to where we came. For the leaf, we can have just the key data, and for convenience' sake, we'll add a parent pointer to the leaf so that we can move faster from it upwards. There's one additional element here, which is the bit, and that's the information that we'll use to find where or how long the prefix, how long the common prefix for a particular node in the key is. I'll show that in more detail in just a bit.

We had this as our example of a prefix tree, and the nodes and leaves we just defined would end up in this arrangement. That's the base of the prefix tree that we're working with. How do we go from that to an elastic binary tree?

The elasticity in the elastic binary tree comes from the realization that, for n leaves, if all the nodes have all their children assigned, there will be n minus 1 nodes. In this case, we have five leaves and exactly four nodes, and this relation will remain true no matter how we insert or delete data from the tree. We have 100 leaves, we'll have 99 nodes. That gives us a hint that if we can somehow establish a permanent relationship between the nodes and the leaf, and if we can somehow cheat on that one-off, we might be able to reduce the number of memory operations by half.

This is a diagram of how we define our EBtree node. A node parent pointer, a bit counter, a left and a right pointer for the children of the node, a key data, and a parent pointer for the key, for the leaf. When we tie those two together, if we continue building upon our example, we can combine them together somehow like this. Here, we see that every leaf has an associated node with it, only the first one doesn't, so we're cheating with that one. The takeaway is that by establishing a permanent relationship between the nodes and the leaves, we solve their problem with memory management.

To take a look at how this actually works in practice and how this relationship between the node and the leaf that we just established remains holding, we're going to take a look at the inserting procedure. If we start with a tree root, just a pointer pointing to the highest node in the tree, initially, with an empty tree, it would be empty. After adding the first element, in our case, the element with the key value of 1, we only need to connect the tree pointer, the tree root pointer to the leaf, and this is the node part which we are cheating on. This is the node part that's not going to be used.

We're done, thank you for coming. I'm just joking. If we add the second node, we're going to see that this is where we are using both the node and the leaf part of the data structure we just defined. Here, we're just reassigning the pointers back and forth. It looks confusing, but if we zoom out, this is the structure that we achieved. We have the first element with an unused node part, and we have the second element with the key value of 2 which has both its node and its leaf parts used.

We can continue, we can add a third, and what happens here is that we're going to be inserting the third node with the key value of 3. We're going to be using both its node and its leaf part, but interestingly, the node part for the third element is going to insert itself between the node part and the leaf part of the key and with the value of 2. If we zoom out, this is what it looks like, and this is how, logically, if we observe the element 2, the parts of the data structure logically drift apart, because we've started inserting other ones in between them, as indicated by the point of arrows as well. The takeaway here is that the node and the leaf part of the element number 2 are still the same contiguous family area allocated at once, we’re just keep using it like that. If we can delete it at once and deallocate it at once, then our problem with memory management is completely solved.

If we continue adding more elements, the element number 4 and number 5, we're going to end up with something like this. Here, we are going to see that the element of the key 4 has had its node and leaf parts logically drifting further apart, but also, here on the illustration, we're showing the use for the bit value. The bit value tells us which is the first bit from the end of the key where the data starts to differ between the left and the right branch. Between the leaves, I've written out the values for the keys in binary, and if we take a look between the left and the right side of the tree, the third bit from the bottom is the one where the differences start. All the bits leading up the third bit from the end, they're all 0, and the third bit has the value of 0 for the left side of the tree and a value of 1 for the right side of the tree.

As we go down, the same holds true for every node. If we go one level down, the difference starts for elements 1, 2, and 3, and the second bit from the right, the second bit from the left. All the 1s leading up to it are the same, and the element 1 has 0, so it goes to the left, and elements 2 and 3, they had the 1 at this mid-position, so they go to the right. Finally, elements 2 and 3 have only one bit of difference, and that one bit of difference is at the end. The same can be said for bit 4 and 5. All the 1s leading up to it are the same, and just the last bit is different.

The deletion, in our case, is likely going to start with deleting the first element, the smallest one, because that's going to be the task that we want to end the soonest, so we're going to want to remove it from the scheduler the first. Remember that this first task also has the unused node part associated with it, and if we can delete it, we'll deallocate both the node and the leaf part at the same time. In order to delete it, we'll just take its sibling, and we're going to connect it to its grandparent. By doing that, the node part of the element number 2 is no longer going to be accessible from the root of the tree. It's still allocated but it's no longer unused, and it becomes that minus 1, the cheating element that we used earlier. We don't have to worry about it because it's going to be removed at the same time when we extract the element number 2 from the tree and deallocate its entire memory structure. All throughout this process, we only had to reassign two pointers, the parent and the left pointer from the grandparent, and we end up with a tree that has the property of having n leaves and n minus 1 nodes.

Implementation

We solved the problem of memory allocation, but what can we do about our tree potentially being unbalanced? Compared to a self-balancing tree, we might end up with five or six times more operations to search a value in the tree. The solution is we just make everything five or six times faster. How do we do that?

Let's take a look at some tools that we use in the implementation to make up for that potential loss of speed in some corner cases. One of the first is the use of pointer tagging, and this is the property we are taking advantage from property of modern architectures that contain byte-addressable memory. Their pointer addresses are in bytes, but when allocating memory, we only allocate in word sizes and align to word sizes. For 32-bit architectures, the word size is 4 bytes, and for 64-bit architectures, the word size is 8 bytes, which means that we're never going to get an address from our memory allocation that has any of the last 3 bits nonzero in the case of 64 bits. We have some example pointers here with our bit values, and these last three bits are never going, they're free for the taking. We can store any three-bit values in them, and as long as we clear them before referencing the pointer, everything is going to work fine.

Another thing - and this is a humorous observation by Willy - we use it quite a lot, is that C is a portable assembler. We can take a look at a couple of examples why this is the case and how we use it. The first one is a historical one or a legacy reason, and you'll find it in the code still, and this is the regparm compiler directive. Thirty-two-bit architectures, the CPU will only have 8 general-purpose registers, and the compiler was not likely to use them to exchange data between function calls and instead opted to use the stack. Using the stack means a trip to the main memory, and 10, 20, or more-nanosecond delay for every operation there. With this, we can force 32-bit compilers, if possible at all, to keep using the main general-purpose registers at our CPU. Currently, 64-bit CPUs have plenty of registers, so the compiler should do the right thing even without this.

Then we have the forced inlining, which is a common programming technique, where we just direct the compiler to insert our entire code into another function, all the assembler commands, without using the stack.

We have a _builtin_expect compiler directive as well, and this one is interesting. It tells the compiler what we think is the most likely branching code pass that is going to be happening in our program as it runs. Then the compiler takes the assembler commands for that most likely code pass after an if-statement, for example, and lays those commands eventually. They no longer have to rely on process or branch prediction to make the right choice. The commands that we know are most likely to be executed. I've already preloaded in the CPU instruction cache and are going to be most likely executed every time correctly. Lastly, we can always resort to writing a similar code ourselves, and that's what we do in a couple of performance-critical places.

Take a look at what we ended up with, know the design translates directly into C code. We have branches, we have the bit field, we have the node parent pointer and the node leaf pointer. The thing that we don't have here is the data or the key, and this is the base struct which we can use for implementing different EBtree variants than implement a particular data type. With the base knowledge definition, we can also have base functions.

Here, I've highlighted one where we used pointer tagging a lot, and we can use it to check whether they know that we are pointing too, is a left or a right, or we can use it to check whether the node they're pointing to is a leaf or on a different node on link node. That saves us a trip to the reference the pointer and take a look at the node from the main memory, because we already have the data within the pointer.

Some of the EBtree data types relying on these base functions and structs are types that stores 32-bit integers and the type that stores 64-bit integers. We have one that's used for pointers, we have indirect memory blocks and strings. The variance are different in such a way that one stores pointers and the other one stores pointers, with memory blocks continuously allocated immediately after the data of the node. All of them have the support for the same set of functions.

An eb64 node looks like this, it's very simple, it contains the basic node part and then the key following it. The functions that are different between the various tree implementations are the ones regarding insertion and lookup of keys, because this is where the key start to matter. All the other find next nodes, delete nodes, traverse the tree. They can be performed without ever touching the key part, but the key being different between 32-bit keys, 64-bit keys, strings, and memory blocks is where different functions that implement lookups and inserts. The lookup code is the one that's being used for inserts as well, because when inserting, we still need to find the location of the tree where we're going to insert the data.

Here, the code is also relatively simple. If we find a leaf, then we check that the key is completely the same as the one that we're searching for, and we can return the node. If not, we find the difference in bits between the key that we're searching for and the one that we stumbled upon as we are traversing the key. Here's where we used the bit’s value to mask out all the bits that were not in the key being searched for, and we're not interested at this step of traversing through the tree, and this is where we can confirm whether everything up to this particular node shares the same prefix or not. If it doesn't, you know you have to return null, because it's not possible for that data to be contained in this three, but if it is, there's only one bit of difference. This the bid that tells us if it is the left side of the tree or the right side of the tree that we need to defend next.

Production Use

That's how we defined our EBtree code. Let's take a look at why it looks like a HAProxy scheduler. We already mentioned the tasks, they can be active or suspended, they have some code, and they have some time stamp, and we're going to have to have a group of suspended tasks. This is where we're going to have an EBtree, index on expiration dates for those tasks in this tree, and we're going to have a group of active HAProxy tasks, a different EBtree, index on expiration dates, but this time, we're going to take into account priority as well, because we can shift those more important tasks to be plucked out sooner and processed sooner.

If we combine all stuff that we talked about and imagine it into a more elaborate and more lifelike diagram, then we get something like this. We have an I/O scheduler, epoll or whatever, reacting to changes in the active connections, the ones that have some data. Their data filled into buffers, and some tasks get created to process those buffers. Those tasks run until exhausting their buffer data and then they get put into a wait queue. Occasionally, we can check on that wait queue, or if the connection gets dropped or whatever, we can directly look up that task in the wait queue and just terminate it there.

That was the diagram of HAProxy's event loop. That's exactly the same thing, and this is almost the same code that you will find and just for the readability of it. We have the function that runs the active tasks. We have the part that wakes up any tasks from the wait queue that need to run next. Then we have the part that takes care of network connections.

That's it, our tasks look like this. We have the entries being inserted into either the EBtree node that you just called, that you just pointed out. We have the function pointer for the processing code for this particular task, we have the expiry date. We can schedule them by inserting into the wait queue. We can wake them up by looking them up in the wait queue, and then modifying their expiry time by some nice value, then just insert it into the run queue. In the run queue, we take a look at which one is the lowest, which one is the soonest, the earliest one. We access its entire struct so that we can get its processing function pointer. We invoke that and if the task still needs to run, we just queue it back into the waiting tasks list.

That's not all where we use EBtrees, we use it for timers and schedulers. We also use it for ACLs, and it's useful because there we can rely on the prefix matching to match IP address subnets to record IP addresses. We use it for stick-tables, this is where we store stats, counters, values related to process HTTP requests. We also use it for an LRU cache, in HAProxy, we have zero maintenance, small and fast LRU cache which we used to call favicon cache, just as an additional way to help out the backend performance.

Performance of the EBtrees - depending on conditions, the CPU type, and whatever, in ideal conditions, we can get down to 100 nanoseconds per insert. This allows HAProxy to process more than 200,000 TCP connections per second or more than 350k HTTP requests per second when we enable connection keep alive. Also, out of that, our HAProxy Scheduler is using only 3-5% of the CPU. The rest of it is taken up by the remainder of the processing tasks. We have a utility in the HAProxy source code that's used to parse default HAProxy logs and index them and search through them. That one runs up to 4 million log lines per second and that's, in part, thanks to EBtrees. If we insert 450k BGP routes into an EBtree, we can get more than 2 million lookups per second out of it.

That’s where we end it up. We're going to skip the LRU cache implementation walkthrough, because it's just as simple, even more than the one that we talked about for the HAProxy scheduler. You can find the code for it in the code of the EBtree as an example. The code there is almost the same as the code in HAProxy.

Results

What did we end up with? We find that we've ended up with a pretty nice tree implementation, but it's not just that. We didn't take the existing implementations as something that cannot be improved upon. We took a look at what we can find out there as an existing solution. We took it apart, we found ways to improve it, both algorithmically and through implementation that's strictly tailored to our environment, and we got a good result out of that.

You can check out EBtree at the EBtree. You can check out HAProxy at haproxy.org or haproxy.com. You can check other developers at HAProxy mailing list or just email me directly.