So I want a hybrid mutex, but C++11 only provides blocking mutexes. Can I make this horse into a zebra? It’s worth a try. Fortunately std::mutex implements the try_lock() routine, a nonblocking routine that returns a boolean indicating the success of locking the mutex. We can use try_lock() to drive the spin phase of a hybrid mutex, then lock() when needed:

class spin_mutex_t {

std::mutex _m; public:

void lock();

void unlock();

};

The only other member variable required for our class is an adjustable high-water mark that determines when the mutex should block:

class spin_mutex_t {

std::mutex _m;

std::atomic<long long> _p{1}; public:

void lock();

void unlock();

};

The most interesting elements of the code are in the lock() routine, which I’ll detail here:

void spin_mutex_t::lock() {

using clock_t = std::chrono::high_resolution_clock; auto before = clock_t::now();

long long measured{0}; while (!_m.try_lock()) {

measured = (clock_t::now() - before).count(); if (measured >= _p * 2) {

_m.lock(); break;

}

} _p += (measured - _p) / 8;

}

The routine starts by timestamping when it begins, which it will use in deciding if/when to block as well as adjusting the predictor.

The spin loop repeatedly calls try_lock() . If it succeeds, we have locked during the spin phase, and move on. If not, we determine the amount of time elapsed while spinning. If that elapsed time is twice as long as the predictor, we move on to the blocking phase of the mutex, and move on.

At this point in the routine the mutex is locked, regardless of the phase in which it was acquired. The only step that remains is to adjust the predictor based on what we learned this last time through the lock routine. The technique is to derive the next predictor value with a proportioned combination of the previous predictor and the measured value:

Adjusting the predictor

Results

With such a basic implementation, I decided to see how it behaved. The first thing I added was a probe callback to get numbers out of the mutex based on its behavior after every lock aquisition. I then ran a battery of tests.

Test 1: “0 slow”

Then I spun up five threads that each locked the spin_mutex_t 100 times as quickly as they could, and charted the results.

0 slow acquisition by phase

This bears some explanation. The probe callback was told in which phase the lock was acquired: 1 for blocking, 0 for spinning. The red dots represent each of the data points, while the black line represents a rolling average of the last 50 lock attempts. It is clear to see the trend starts very high, which is to be expected as the predictor adjusts to how long it takes to lock and unlock immediately. About 70 locks in, the predictor’s accuracy is to the point where locks acquire in the spin phase, and the trendline starts to dip down. Towards the end of the test, it is plain to see that more locks are acquired during the spin phase than blocking.

Test 2: “1 slow”

The next test is designed to imitate the scenario described above: most threads have a short critical section time, but some might be in there a while. I spun up the same five threads, each immediately unlocking save for one which has a 3ms delay:

1 slow acquisition by phase

The behavior is similar, but the trendline settles in more around 50% than the 0-slow test.

Test 3: “5-slow”

This is a worst-case scenario: each thread would be imposed with a 3ms slowdown. Although it would be better suited by a blocking mutex, I wanted to make sure the behavior was at least tolerable, even though it would be a pessimization:

5-slow acquisition by phase

It is interesting to note that as more threads are given the delay, the variation of the trendline goes down. I suspect this is due to the side effects of the operating system (e.g., yielding) becoming more muted as the predictor values become larger.

Moving Forward

There are several questions and next steps I am interested to pursue with the implementation:

Are the slowdown times appropriate? What would be better numbers to model real-life use cases?

Auto-detection when the spin phase would exceed the time required to block, and doing so when it is performant.

Fine-tuning the constants of the lock. Could adjusting the time-to-block multiplier make the mutex more performant overall? What about the multiplier used to adjust the predictor?

We know the critical section ends with a call to unlock() . Might that be used to adjust the predictor, or improve performance in lock() by anticipating how long the next critical section might be? For example, if the critical section is found take more time than blocking the thread, maybe it is worthwhile to simply block at the get-go of the next call to lock() ?

Open Source

I have made the code available on GitHub. Please let me know what you think, including snafus or ways in which I might improve things.

Footnotes

† For the uninitiated, a critical section is a block of code wherein no more than one thread should be running at a time. This is solved by a thread first acquiring (or locking) a mutex that guards the critical section. Mutexes are built such that locking is an atomic operation: it will only complete on one thread at a time, regardless of how many threads are attempting to lock. All competing threads will be unable to continue until the mutex is unlocked. It’s not unlike the conch in Lord of the Flies, but with less bloodshed.

†† It should be noted here that both types of threads are subject to the OS, which can yield them preemptively at any time. This is similar to a block but is imposed by the OS, not the mutex architecture. I won’t be covering OS thread scheduling in any more detail.)