One of the frustrations of computer programming (almost certainly shared with other engineering disciplines) is that, often, a simple, elegant, and general design doesn't work as well as an ugly hack. Such designs still have value as they are more maintainable and more extensible, so it is not uncommon to need to find a balance between simple elegance and practical efficiency. The story of futex support in Linux could be seen as a story of trying to find just this balance. The latest episode adds a new special case, but provides impressive performance improvements.

Some futex history

The original design of futexes — a kernel interface to support Fast User-space muTEXes — introduced a four-byte memory location that would always be updated atomically. A key aspect of the design was that the kernel needed only the simplest understanding of a futex's contents; a comparison with a value provided by user space was all that was ever needed. All updates were handled by user-space code and, if a program ever found that it needed to wait for the value to change, such as to wait for a lock to be released, it would use the futex() system call to ask the kernel to wait for that change. When some other thread changed the value, it would tell the kernel to wake up some number of sleeping processes, and they could examine the new value and act accordingly.

This design is simple and elegant, but imperfect. The kernel has minimal knowledge of what user space is doing, and user space has little access to relevant information that the kernel maintains, so they are limited in the extent to which they can work together. This disconnect has required a number of extensions over the years, three that are in the mainline now, and one that is on the horizon.

The first extension was needed to optimize the implementation of pthread_cond_signal() , which sometimes needs to send a wakeup on one futex (a condition variable), unlock a mutex (represented by another futex), then possibly send a wakeup on that second futex. At the same time, another thread might be waiting on the first futex and will immediately try to lock the mutex, which could require waiting on the second futex. Having this second thread wake up and go straight back to sleep leads to measurably poor performance. Neither the documentation nor the changelogs make it clear why the wakeups cannot both be performed after the mutex is unlocked, but they do assert that this is racy and so a new futex operation was created.

FUTEX_WAKE_OP is given two futexes and instructions on how to unlock the second. It will perform that unlocking, wake up waiters on the first futex, then conditionally wake up waiters on the second as well. It does all this atomically with respect to any other operations on either futex. This was the first time that the kernel needed to modify the value of the futex itself; Jakub Jelinek came up with a fairly generic mechanism to describe the unlock operation. Five different operations are provided to combine an operand with the current value of the futex, and then one of six different comparisons can be performed against a second operand to decide if the second wakeup should happen. This seems fairly powerful and suitably general, but was probably wasted effort. Only a single operation (set to zero) is ever used by glibc, and only a single comparison (greater than one). It is unlikely that the other options will ever be used, in part because of the subsequent extensions that impose structure on the value.

It is possible for a thread to be killed before it makes an expected change to a futex that other threads are waiting for. Only the user-space code knows which thread "owns" the lock (or even what sort of locking is being used) and only the kernel knows when a process dies unexpectedly. To allow those waiting threads to discover that something is wrong, some extra communication was needed. This gave rise to "robust futexes" that allow a thread to register a linked list of futexes whose waiting threads need to be woken up if the thread ever dies.

The use of robust futexes significantly reduced the flexibility of futexes. The four-byte memory location now has a fully defined meaning: 30 bits provide the thread ID of the owner of the futex, one bit records if any other threads are waiting on the futex, and one bit indicates if the previous owner died. This means that robust futexes cannot be used to create counting semaphores, or reader/writer locks. They can only be used for binary mutual-exclusion locks. It also means that most of the operations provided for FUTEX_WAKE_OP are of no value for robust futexes.

The more flexible, non-robust futexes could still be used as private futexes between threads in a single process and never shared between processes. In that context, an unexpected failure will kill the whole process rather than a single thread, so no recovery handling is needed.

The third extension of interest was to support priority inheritance. If it is possible for threads of different priorities to claim a lock then, when a low-priority process holds the lock and a high-priority process is waiting for it, any medium priority process that prevents the low-priority process from running will indirectly interfere with the high-priority process, which is not desirable. This "priority inversion" is usually addressed using priority inheritance, which causes the low-priority process to run with the priority of the highest-priority process that is waiting for it. Linux has priority-inheritance mutex locks internally, so the priority-inheritance (PI) extension to futexes allocates one of those whenever a PI futex is contended, and uses it to manage priority.

This extension was the first to introduce the verbs "lock" and "unlock" (and "trylock") into the futex interface. Previously the interfaces only talked about "waking" and "waiting" with the implication that a variety of different services could be built on that. For priority-inheritance locks, at least, that pretense in now gone. It really is just a lock.

What's next for futexes

In the kernel, one of the improvements that has been made to mutexes in recent years is to add adaptive spinning. The theory is that sometimes a mutex is only held for a short period of time and, in those cases, it is more efficient to busy-wait for the lock to be free than to go to sleep and then be woken up. If the busy-wait doesn't look like it will be successful, only then is the thread put to sleep. Making a choice between spinning and sleeping is the adaption included in the name.

It should be no surprise that this optimization should be useful for user-space locking using futexes. Waiman Long has found that, for a particular micro-benchmark, standard (wait/wake) futexes can achieve a mere 35 million operations in ten seconds, while adaptive spinning can increase that to over 54 million. This technique had been tried before by Darren Hart, though his reported results weren't quite so impressive, probably because modern processors have many more cores and high core counts can tip the balance towards spinning over sleeping. While micro-benchmark results should be treated with caution, a 50% improvement deserves some attention.

The user-space code could, of course, simply spin for, say, 20 microseconds before giving up and asking the kernel to put it to sleep. While simple, this approach is far from ideal. If the process holding the lock is sleeping, busy-waiting for it is a waste of power and could possible increase the total wait time. It only makes sense to busy-wait if the process owning the lock is itself busy.

Here again, the separation between user space and kernel space is a problem. Only the kernel knows which processes are busy or sleeping. Either we need to tell user space when the owner of a futex is sleeping, or tell the kernel that it should spin for a while before taking the lock. Both of these are probably possible, but moving the whole locking operation into the kernel is probably easiest, matches the approach that PI futexes use, and is the approach that Long is exploring.

The patchset

Long's latest patchset adds the FUTEX_LOCK and FUTEX_UNLOCK operations to the futex() system call; they work in a similar fashion to FUTEX_LOCK_PI and FUTEX_UNLOCK_PI , but without the priority inheritance. They use a regular mutex to help implement the locking, but in a slightly different way than the PI extension's use of an rt_mutex .

The first time the futex() system call is made, there are two processes interested in the lock. One already holds the lock, while the other wants to acquire it. The PI extension initializes a new rt_mutex in a locked state and makes it appear that the thread that owns the futex also owns this rt_mutex . There is substantial complexity in making this work in a race-free way, but in essence, that is what happens. The second process then waits for the mutex in a fairly normal way.

The new adaptive spinning extension (called "throughput optimized", which describes the goal rather than the implementation) uses a mutex only to arbitrate between the different threads that might be waiting, not to arbitrate between them and the thread that owns the lock. Whichever thread manages to claim this mutex is the "top waiter" and gets to decide whether to busy-wait for the current owner to release the lock, or to go to sleep to be woken by the usual futex wakeup mechanism.

Long did originally try to follow the same model as PI mutexes but found that the performance wasn't close to what he wanted, as too many unlock requests still went through the kernel. Futexes only need the kernel to be involved when there is contention; the kernel only gets involved when there is one lock owner and at least one lock waiter. Additionally, if there is only one waiter, when the owner releases the lock and that waiter becomes the owner, the kernel no longer needs to be involved. When that new owner drops the lock it should be able to complete without involving the kernel. With PI mutexes, the rt_mutex stays allocated until completely unlocked (i.e. until there are no more owners). This means that last unlock goes through the kernel. Long claims that this extra kernel involvement reduces throughput significantly.

Another benefit from maintaining control of the busy-waiting separately from the kernel mutex is that the benefits of lock stealing can be realized; this sacrifices some fairness for performance. There is a small window between the moment when the owning thread unlocks a futex and when the top waiter locks that futex. If another thread tries to claim the lock during that window it can successfully steal the lock. This is seen as a good thing, presumably because that new thread has its working set of memory in cache and will likely make progress quickly. Long's patches explicitly allow this stealing, but also put a limit on it. If the top waiter is woken up after sleeping and fails to get the lock, it sets a flag asking that the next time some thread unlocks the lock, that they perform a handoff instead and explicitly give the lock to that top waiter, thus avoiding further theft. Long provides numbers that seem to suggest that this improves throughput (the main goal) and also improves fairness.

Responses

The reception to the patch set so far has been cautious. The results appear encouraging, but there are some questions, including whether the code might be driven too much by that one benchmark. Two comments that reveal Thomas Gleixner's concerns are first:

I'm not really happy about these heuristics. The chosen value fits a particular machine and scenario. Now try to do the same on some slow ARM SoC and the 1024 loops are going to hurt badly.

and later:

So the benefit of these new fangled futexes is only there for extreme short critical sections and a gazillion of threads fighting for the same futex, right? I really wonder how the average programmer should pick the right flavour, not to talk about any useful decision for something like glibc to pick the proper one.

Given that adaptive spinning has made its way into the in-kernel mutexes, it would be surprising if a way cannot be found to make them work well for user space too. Of course, it would also be surprising if the first attempt at providing such a feature would have found the right balance between the various competing needs. We don't have efficient adaptive spinning for futexes yet, but if it really brings value, it shouldn't be too far away.

Comments (1 posted)

The Turris Omnia router is not the first FLOSS router out there, but it could well be one of the first open hardware routers to be available. As the crowdfunding campaign is coming to a close, it is worth reflecting on the place of the project in the ecosystem. Beyond that, I got my hardware recently, so I was able to give it a try.

A short introduction to the Omnia project

The Omnia router is a followup project on CZ.NIC's original research project, the Turris. The goal of the project was to identify hostile traffic on end-user networks and develop global responses to those attacks across every monitored device. The Omnia is an extension of the original project: more features were added and data collection is now opt-in. Whereas the original Turris was simply a home router, the new Omnia router includes:

1.6GHz ARM CPU

1-2GB RAM

8GB flash storage

6 Gbit Ethernet ports

SFP fiber port

2 Mini-PCI express ports

mSATA port

3 MIMO 802.11ac and 2 MIMO 802.11bgn radios and antennas

SIM card support for backup connectivity

Some models sold had a larger case to accommodate extra hard drives, turning the Omnia router into a NAS device that could actually serve as a multi-purpose home server. Indeed, it is one of the objectives of the project to make "more than just a router". The NAS model is not currently on sale anymore, but there are plans to bring it back along with LTE modem options and new accessories "to expand Omnia towards home automation".

Omnia runs a fork of the OpenWRT distribution called TurrisOS that has been customized to support automated live updates, a simpler web interface, and other extra features. The fork also has patches to the Linux kernel, which is based on Linux 4.4.13 (according to uname -a ). It is unclear why those patches are necessary since the ARMv7 Armada 385 CPU has been supported in Linux since at least 4.2-rc1, but it is common for OpenWRT ports to ship patches to the kernel, either to backport missing functionality or perform some optimization.

There has been some pressure from backers to petition Turris to "speedup the process of upstreaming Omnia support to OpenWrt". It could be that the team is too busy with delivering the devices already ordered to complete that process at this point. The software is available on the CZ-NIC GitHub repository and the actual Linux patches can be found here and here. CZ.NIC also operates a private GitLab instance where more software is available. There is technically no reason why you wouldn't be able to run your own distribution on the Omnia router: OpenWRT development snapshots should be able to run on the Omnia hardware and some people have installed Debian on Omnia. It may require some customization (e.g. the kernel) to make sure the Omnia hardware is correctly supported. Most people seem to prefer to run TurrisOS because of the extra features.

The hardware itself is also free and open for the most part. There is a binary blob needed for the 5GHz wireless card, which seems to be the only proprietary component on the board. The schematics of the device are available through the Omnia wiki, but oddly not in the GitHub repository like the rest of the software.

Hands on

I received my own router last week, which is about six months late from the original April 2016 delivery date; it allowed me to do some hands-on testing of the device. The first thing I noticed was a known problem with the antenna connectors: I had to open up the case to screw the fittings tight, otherwise the antennas wouldn't screw in correctly.

Once that was done, I simply had to go through the usual process of setting up the router, which consisted of connecting the Omnia to my laptop with an Ethernet cable, connecting the Omnia to an uplink (I hooked it into my existing network), and go through a web wizard. I was pleasantly surprised with the interface: it was smooth and easy to use, but at the same time imposed good security practices on the user.

For example, the wizard, once connected to the network, goes through a full system upgrade and will, by default, automatically upgrade itself (including reboots) when new updates become available. Users have to opt-in to the automatic updates, and can chose to automate only the downloading and installation of the updates without having the device reboot on its own. Reboots are also performed during user-specified time frames (by default, Omnia applies kernel updates during the night). I also liked the "skip" button that allowed me to completely bypass the wizard and configure the device myself, through the regular OpenWRT systems (like LuCI or SSH) if I needed to.

Notwithstanding the antenna connectors themselves, the hardware is nice. I ordered the black metal case, and I must admit I love the many LED lights in the front. It is especially useful to have color changes in the reset procedure: no more guessing what state the device is in or if I pressed the reset button long enough. The LEDs can also be dimmed to reduce the glare that our electronic devices produce.

All this comes at a price, however: at $250 USD, it is a much higher price tag than common home routers, which typically go for around $50. Furthermore, it may be difficult to actually get the device, because no orders are being accepted on the Indiegogo site after October 31. The Turris team doesn't actually want to deal with retail sales and has now delegated retail sales to other stores, which are currently limited to European deliveries.

A nice device to help fight off the IoT apocalypse

It seems there isn't a week that goes by these days without a record-breaking distributed denial-of-service (DDoS) attack. Those attacks are more and more caused by home routers, webcams, and "Internet of Things" (IoT) devices. In that context, the Omnia sets a high bar for how devices should be built but also how they should be operated. Omnia routers are automatically upgraded on a nightly basis and, by default, do not provide telnet or SSH ports to run arbitrary code. There is the password-less wizard that starts up on install, but it forces the user to chose a password in order to complete the configuration.

Both the hardware and software of the Omnia are free and open. The automatic update's EULA explicitly states that the software provided by CZ.NIC "will be released under a free software licence" (and it has been, as mentioned earlier). This makes the machine much easier to audit by someone looking for possible flaws, say for example a customs official looking to approve the import in the eventual case where IoT devices end up being regulated. But it also makes the device itself more secure. One of the problems with these kinds of devices is "bit rot": they have known vulnerabilities that are not fixed in a timely manner, if at all. While it would be trivial for an attacker to disable the Omnia's auto-update mechanisms, the point is not to counterattack, but to prevent attacks on known vulnerabilities.

The CZ.NIC folks take it a step further and encourage users to actively participate in a monitoring effort to document such attacks. For example, the Omnia can run a honeypot to lure attackers into divulging their presence. The Omnia also runs an elaborate data collection program, where routers report malicious activity to a central server that collects information about traffic flows, blocked packets, bandwidth usage, and activity from a predefined list of malicious addresses. The exact data collected is specified in another EULA that is currently only available to users logged in at the Turris web site. That data can then be turned into tweaked firewall rules to protect the overall network, which the Turris project calls a distributed adaptive firewall. Users need to explicitly opt-in to the monitoring system by registering on a portal using their email address.

Turris devices also feature the Majordomo software (not to be confused with the venerable mailing list software) that can also monitor devices in your home and identify hostile traffic, potentially leading users to take responsibility over the actions of their own devices. This, in turn, could lead users to trickle complaints back up to the manufacturers that could change their behavior. It turns out that some companies do care about their reputations and will issue recalls if their devices have significant enough issues.

It remains to be seen how effective the latter approach will be, however. In the meantime, the Omnia seems to be an excellent all-around server and router for even the most demanding home or small-office environments that is a great example for future competitors.

Comments (17 posted)