Double-checked locking is notoriously evil. While it can be a boon to efficiency, it’s tricky to do correctly. Lurking at its core are two issues fundamentals to multi-threaded programming. Even if you don’t need this pattern, understanding atomicity and visibility is a must for any modern programming.

The Pattern

Let’s look at what a double-check lock is. Rather than start with anything questionable, we’ll start with a relatively good form. Most programmers can’t likely explain why this code doesn’t work. Yet understanding the problem is a necessary if you ever intend on writing safe multi-threaded code. We’ll use pseudo-code to avoid any language specific issues now.

SomeType globalObject = NULL; SomeType getSomeType() { if( globalObject != NULL ); //1 return globalObject locked_section { if( globalObject ) //2 return globalObject; globalObject = new SomeType; //3 } return globalObject }

This pattern is used when you have an object which is initialized the first time it is required. In the above code you see that we start with a null object. At “1” if the code we check if it is initialized, if so, we can quickly return that value and avoid any mutex call. If not we enter the locked section. At “2” we check again to avoid a race condition with another thread — we’re now safe doing this in a mutex, no race. If the object doesn’t exist, we create it, assign it to the global reference. Now we leave the locked section and return the object.

locked_section: For most languages this will probably be a “mutex.lock” to enter the section and a “mutex.unlock” to leave the section. You might also have a “synchronized” keyword that would work.

The Atomic Problem

The simplest problem is one that most programmers will never actually experience. The reference we have to “SomeType” is some kind of value, most likely a pointer to the object in memory. These values are not a single byte in size. When you do your first check on “globalObject” perhaps some parts of the address have been set, but not all of them. The object isn’t null, but it isn’t the correct value either, thus you’ll be returning a broken reference.

//Thread A sets the value in four steps (four bytes) globalObject.byte[0] = 0x12; //1 globalObject.byte[1] = 0x23; //2 globalObject.byte[2] = 0x45; //3 globalObject.byte[3] = 0x67; //4 //Thread B reaches the first just after at point "2" in the above, //thus "globalObject == 0x00002312 if( globalObject != NULL ) //this passes, since it isn't null return globalObject; //this returns a junk reference of 0x00002312

The reason most of us never have this issue is becuase most current hardware will copy the entire four bytes atomically. You’ll never see this tearing of the value. Even if the hardware isn’t doing this atomically the race condition is so rare you might still never see it — except for random crash reports from your users.

To correct this you have to explicitly use atomic operations. Most languages don’t offer atomics and expect you to use always locked sections. Some languages, like Java, provide guarantees for basic atomic copying of object references. What each language offers varies greatly. You cannot assume the constructs are the same and therefore must take the time to carefully understand each language on its own.

The Visibility Issue

Once you add threads, the precise flow of a program becomes vitally important. For many people new to concurrency it may come as a surprise that what your code does isn’t the same as what you wrote. The compiler may optimize statements and reorder things as it feels fit. The CPU also tends to do reordering and out-of-path execution. For a surface level example, consider how the following statement can be rewritten (in pseudo-code).

//what you wrote... globalObject = new SomeType; //...becomes globalObject = allocate( sizeof SomeType ); SomeType.construct( globalObject );

The problem here is that “globalObject” gets a value prior to the object actually being created. This has a serious consequence for the statement “if( globalObject != NULL)” since that condition will now pass even though the object has yet to be created. You will return a pointer to unitialized memory!

It bears repeating that regardless of how you order your assignment, the compiler, and CPU, can conspire together to produce this version of the code. On the CPU side why this happens relates both to memory reordering and visibility. For this article it is suffice to understand that concurrently running threads have a slightly different view on memory.

Refer also to “CPU Reordering” and “CPU Memory“.

The Solution

So we have two issues that need to be fixed. First we need to ensure assignment of the pointer is atomic to avoid any partial value. Secondly we need a way to ensure our object is fully created before its pointer becomes visible to any other thread. Let’s expand the code to correct it.

atomic SomeType globalObject = NULL; SomeType getSomeType() { if( globalObject != NULL ) { load-fence; return globalObject; } locked_section { if( globalObject ) return globalObject; SomeType temp = new SomeType; store-fence; globalObject = temp } return globalObject; }

Again, the easier part to explain is the “atomic” label. By declaring a variable as atomic, it ensures that you will never have partial values contained in it. The value you read will either be the new value (after an assignment), or the old value (prior to the assignment), it will never be somewhere in between. This solves the problem of producing junk pointers.

The second fix are the two “fence” operations. We assign to a temporary variable prior to the “store-fence” and to the real variable after it. Then before we read from the variable we issue a “load-fence”. What this combination does is ensure that everything which happens before the store-fence is visible in its entirety after the load-fence. We have effectively said that the construction of the object must be entirely completed prior to setting the “globalObject” pointer, and when we read that pointer we ensure we can read the construction in its entirety prior to actually using it.

Updated: See the comment thread started by FerdinandB to understand why the `load-fence` is inside the if block.

Note that a fence is actually a fairly low-level CPU instruction. Most languages won’t provide direct access to this function. If you don’t have access to such functions, or an equivalent store/load facility you simply cannot implement a safe double-check locking. Note that some languages expose this functionality with the “volatile” keyword: be very careful as this meaning of volatile is distinct from its meaning in C and C++.

If a language does offer atomic variables it probably adds the fences implicitly in the reading and writing of those atomic variables. For example, the new C++ standard offers atomic variables which also have memory ordering semantics. Instead of using an explicit fence you can use a combination of “acquire” and “release” semantics on the atomic pointer.

Recap

For reasons of atomicity and visibility best practices dictate that all shared memory is accessed within a locked section. This is not an unreasonable standard. Once you wish to break free of this pattern you will encounter a few major issues. Understanding alone is non-trivial; consider in the above we’ve really just glossed over what “visibility” really is.

Double-checked locking is a relatively mild introduction to problems of advanced concurrency. This example alone is very useful in production code and can avoid the cost of a mutex, which can be costly in some circumstances. Just make sure you understand exactly what you are doing before implementing anything with atomic variables. If in doubt, stick to the normal locked section approach.

Classic Approach

I was asked to provide an example of the classic approach, so I’ll add that here. Note that there is actually no classic “double-check” pattern, since without the above fixes the pattern cannot work. This is why the double-check pattern is considered evil, since you simply can’t implement it correctly without atomics and fences. Instead you’re left using just a mutex locked section like below.

SomeType globalObject = NULL; SomeType getSomeType() { locked_section { if( globalObject ) return globalObject; globalObject = new SomeType; return globalObject } }

This approach completely avoids the issues of atomicity and visibility. While easier to understand and program, it can be a bit less efficient. The added cost comes from the need to use a mutex on every call to the function. Exactly how much this costs depends greatly on the programming language, the operating system, and the underlying hardware. Refer to “How does a mutex work?“.