DConf 2013 was a huge success! I’ve finally recovered enough from my red-eye flight back to the East Coast to blog about a topic that really stuck out in my mind from it.

At DConf, I gave a talk on D-specific design patterns. (The video will be up on dconf.org shortly.) One pattern in particular attracted much more audience attention than I thought it would, so I decided to blog about it: The Low-Lock Singleton Pattern.

Most people who program for a living are probably familiar with the traditional Singleton Pattern, which is used to create a single, globally available instance of a class. A canonical implementation in D is shown below:

class MySingleton { static MySingleton get() { if (instance_ is null) { instance_ = new MySingleton; } return instance_; } private: this() {} __gshared MySingleton instance_; }

Most of these people also probably know that this naive implementation is a concurrency nightmare. You can’t safely call get() from multiple threads simultaneously. The most obvious reason is because two instances of MySingleton will get created if get() is called from a second thread before it’s finished running in the first thread that called it, i.e. before instance_ !is null from the first call.

If you want thread safety, the most obvious solution is to wrap everything in a mutex, represented in D by a synchronized block. Omitting all the boilerplate except the get() method, this would look something like:

static MySingleton get() { synchronized { if (instance_ is null) { instance_ = new MySingleton; } } return instance_; }

Now, if Thread 1 calls get(), it has a chance to finish instantiating instance_ and storing a reference to it before Thread 2 checks whether instance_ is null. There’s only one problem with this: It comes at a huge performance cost (benchmarks forthcoming). Entering a synchronized block is expensive.

One solution that looks correct to the naive eye but is subtly broken is double-checked locking. The idea here is to check whether instance_ is null outside a synchronized block, and then if so, check again inside one:

static MySingleton get() { if (instance_ is null) { synchronized { if (instance_ is null) { instance_ = new MySingleton; } } } return instance_; }

Intuitively this seems reasonable. instance_ can’t go from non-null to null, only from null to non-null. Therefore, if it’s already non-null by the time we get to the first check (the one outside the synchronized block) it would seem we know every thing we need to know and can simply return instance_. Alas, the real world is not that simple. The details are complicated and best explained by others, but the executive summary is this: The sequentiality of program execution is an illusion maintained within a single thread. Compilers and CPUs reorder instructions and the illusion breaks down when dealing with multiple threads. For example, instance_ can be assigned a non-null value before the object it refers to is fully constructed.

In D, though, thread-local storage (TLS) is a first class member of the language. This leads to an interesting workaround that’s similar in spirit to double-checked locking but actually works. It’s a previously obscure pattern that was independently invented by at least myself and Alexander Terekhov, an IBM researcher working in Java. It never caught on for some reason, probably because of the abysmal performance of TLS in Java back when it was discovered, as illustrated by Doug Lea’s benchmarks from that era. Anyhow, the pattern is implemented in D below:

class MySingleton { static MySingleton get() { if (!instantiated_) { synchronized { if (instance_ is null) { instance_ = new MySingleton; } instantiated_ = true; } } return instance_; } private: this() {} static bool instantiated_; // Thread local __gshared MySingleton instance_; }

The idea is simple: We maintain a thread-local flag to track whether we’ve already entered the synchronized block from the current thread and ensured that instance_ is instantiated. The synchronized block is entered once per thread, amortized over a whole run of the program.

Benchmarks

Ideally, we’d like thread-safe Singletons that are as fast as thread-unsafe Singletons. How well does our low-lock TLS-based Singleton perform in practice? I created a micro-benchmark calling get() in a loop 1 billion times and ran it for naive, thread-unsafe Singletons (unsafe), Singletons that enter a synchronized block on every call to get() (synchronized), and the low-lock TLS-based Singletons (TLS). The results are shown in the graph below:

On GDC the overhead of TLS vs. unsafe is virtually undetectable. On DMD it’s noticeable but small. In both cases it pales in comparison to the overhead of synchronizing on every call to get(). Note that this benchmark was run in a single thread, so the mutex used to implement the synchronized statement was never contested. If it were, the results when synchronizing on every call to get() would be far worse.

There you have it: An obscure Java pattern that was ahead of its time, resurrected in the presence of more modern thread-local storage implementations in D, lets us have our cake and eat it too: Fast Singletons that are thread safe without any dependence on obscure memory model details.