1. “before every read we have to access our write set…” This is in case we have already written in this location during the current transaction. The write hasn’t been reflected in shared memory yet — it’s still sitting in our write set. We want to read this new value, not the original. So it’s not a conflict, but rather a consistency requirement.

2. “The lock and version number are sampled in one load.” Version number is combined with the lock bit in a single memory word, so it can be loaded in one instruction. Now we have a copy of version/lock locally and we can check it. If the test fails, we abort. If it succeeds, Traditional lock-based approach to concurrent programming is error prone, hard to debug, and almost impossible to scale.

In my previous post I gave a preview of how concurrent programming might evolve in the future, and the topic of Software Transactional Memory (STM) came up quite naturally. STM has the potential to replace locking with an easy to use, virtually fool-proof, scalable paradigm for concurrent access.

It’s true that STM has its problems; performance and integration with existing languages being the serious ones. The hope is that performance can be improved, especially if there’s some support in hardware, and that a new batch of concurrent languages will incorporate STM more smoothly.

In this post I won’t be elaborating on the superiority of STM over other paradigms but rather concentrate on how it works. I hope my explanation will be more approachable than academic papers.

STM in a Nutshell

Let’s analyze a very simple example, a singleton pattern in C++ (I chose C++ because it exposes the use of pointers):

Foo * getFoo() { static Foo * pFoo = 0; if (pFoo == 0) // <- read pFoo = new Foo(); // <- write return pFoo; }

As you might know, this code doesn’t work correctly in a multithreaded environment. Let’s have a look at why, but from a slightly unusual angle. The statement:

if (pFoo == 0)

executes a memory read of the variable pFoo . The value that’s been read is usually stored in a register–in other words, is cached–at least until it’s compared with zero. Suppose that we read zero (a null pointer). We then proceed with the allocation and construction of the Foo object. We update the variable pFoo (a memory write) and return it to the caller.

The important thing is that the write to pFoo is only correct if the condition pFoo == 0 is maintained up to the moment when the new value is stored (we don’t even think of such subtleties in sequential programming).

But that condition may be broken by another thread storing its own creation in pFoo . It’s a classic data race, and its consequences may vary from a simple memory leak to a deadly memory fault (that might happen on some processors if the object Foo manages its own internal pointers). In general, it’s the gaps between logically related reads and writes that leave room for other threads to break our assumptions. If we could only concentrate those reads and writes into one atomic action, we’d be safe.

Let me propose a crazy solution: Before performing the write to pFoo , why don’t we re-check, or validate, our original read (assume that it’s still in the register). If the value in memory is still equal to the value in the register, we proceed with the write; otherwise we back off and try again.

Wait a moment! Aren’t we just postponing the problem? What about another thread messing with pFoo right between the read validation and the write? True, that is a problem, and we have to figure out a way to make the sequence read-validation/final-write atomic. But the beauty of this crazy approach is that it can be made universal. The code that does generic read verification may be compiled into the language runtime. It may be implemented once and for all, thoroughly tested, and used whenever concurrent access is needed.

All the programmer has to do is to inform the compiler that a logically related group of reads and writes must be done atomically; for instance, by enclosing the relevant code in the atomic block (not actual C++!):

Foo * getFoo() { static Foo * pFoo = 0; atomic { if (pFoo != 0) // read pFoo = new Foo(); // write(s), possibly reads } return pFoo; }

Here’s what happens inside the atomic block (there are many variations of the algorithm but the idea is the same).

First, a transaction is started–the runtime allocates a data structure containing two logs.

is started–the runtime allocates a data structure containing two logs. Every memory read inside the atomic block is logged in a log called the read set .

. Every memory write, instead of going directly to memory, is written into the write set (there are also versions of STM that let the writes go through, to be undone if the transaction aborts).

(there are also versions of STM that let the writes go through, to be undone if the transaction aborts). At the end of the atomic block the system attempts to commit the transaction. It does that by first verifying the read log and then performing the writes. (There is a very fine-grained system of locking in play, which I’ll describe shortly.)

block the system attempts to the transaction. It does that by first verifying the read log and then performing the writes. (There is a very fine-grained system of locking in play, which I’ll describe shortly.) If the read verification discovers a mismatch, the transaction is aborted and repeated from scratch, until it succeeds.

Some Details

There is one particular implementation of STM that’s been most widely accepted: Transactional Locking II (TL2), which I will summarize here.

Memory Layout

Imagine that for every single word of main memory you also have a separate data structure that stores the version number and a write lock for that word. How this can be done efficiently is an important implementation detail that I will explain later. For each memory word, its version number is initially set to zero and its lock is open.

Transaction

Now let’s start a transaction. We allocate a thread-local transaction object that has a read set and a write set, initially empty. We assign a version number to the transaction by (atomically) sampling the global “version clock” (it’s called a clock because it can only move forward). This will be our transaction’s read version number.

The compiler has conveniently instrumented all reads and writes inside the scope of the transaction (that is, in the atomic block). Each read adds an entry to our read set, storing the address of the word being read. Each write adds an entry to our write set, storing the address and the value to be written. The value in main memory is not changed at this point–the write is virtual.

Actually, before every read, we have to access our write set in case we have already written to that location. Moreover, we need to check that the location is not locked by another committing transaction, and that its version is less or equal to our transaction’s read version. This is why we need all those per-location locks and version numbers. If the location is locked or if its version number is larger than our current version, we abort the transaction and repeat the whole process again.

These checks are costly, and they are optimized to just a few instructions in the common case. So, for instance, the write-set check is implemented using a Bloom filter. Before every read we sample the corresponding lock and the version number (in one load, since it’s one word) and keep it for later. Then we check our write set using the Bloom filter. If we don’t find the corresponding entry, we just read the value directly from memory, otherwise we read it from our write log. Then we store the read address in our read set, and finally check the version number/lock that we have previously sampled. (On some processors, read barriers are required when accessing the locks.)

Going back to our singleton example, in Fig 1 you see the transaction corresponding right before committing. We have read the value of pFoo (directly from memory, since there was no entry in our write set yet) and logged the address of pFoo in our read set. We have checked that the version of pFoo , 6, was less than the version of our transaction, 8. We have also allocated a new Foo in memory. During the construction of Foo some two fields of Foo were (virtually) written and they have been saved in the write set together with the new value of pFoo (the pointer to the new Foo ).

At commit time, we lock all our write locations (there are three of those) using the per-word locks. Since we might be racing with other transactions, we should be careful not to cause deadlocks. The simplest precaution is to use bounded spinlocks. If we fail to acquire any lock within a set timeout, we abort the transaction.

Committing the Transaction

Sequencing and Validation During read set validation, we check the lock and the version number of each location that’s been read by our transaction. The problem is that this operation is not atomic (we don’t lock read locations). While we are validating, other transactions might modify the already validated locations. I’ll show you that it really doesn’t matter in the big scheme of things. In the big scheme of things, there is a sequence of transactions, each making an atomic update. So our program’s memory switches from one consistent state to another. When, precisely, this jump occurs is called the “sequence point.” As I mentioned, the sequence point for each transaction is reached after the write set is locked but before the read set is validated. Now imagine that we have two read locations, r1 and r2. We have validated r1 and, while we’re validating r2, a stray transaction commits a write to r1. What can be said about the sequence points of the two transactions? Our sequence point was set before we stated read validation. Since we have successfully validated r1, it hadn’t been locked by the stray transaction. It means that the other transaction’s sequence point must have been set after ours; and we don’t care about “future” transactions. All we are concerned about at this point is whether an earlier transaction, with an earlier sequence point (which is, however, greater than our read version), hasn’t overwritten those locations.

At this point we atomically increment and fetch the global version clock–this will be our transaction’s write version, not to be confused with the earlier read version. We haven’t committed the transaction yet but, if we commit it, this will be its official commit point, a.k.a. sequence point.

Now we can validate the read set at our leisure. We make sure that the read locations are not locked by another thread, and that their version numbers are still less or equal to our transaction’s read version. This is a slightly tricky point–see the sidebar.

In our example, the locks for pFoo and for the two fields inside the newly allocated Foo will be taken, the clock incremented to 9, and the read set validated.

The read set is small–just the location of the variable pFoo . It is not locked by another thread (although it’s locked by us, because it is also present in our write set), and the version number is still less than 8.

The final step is to commit the write set. Each location is updated and then unlocked. We also set the version number for each modified location to the write version of our transaction (the value of the version clock at our sequence point). In our case, we change the version number of pFoo from 6 to 9.

You might be wondering why we verify each read at least twice–once while executing the instrumented client code, and once during validation. After all, if we read the wrong version of the variable, the transaction will fail anyway during verification. The problem is that, after reading a bunch of inconsistent values, the client code might get into all kinds of trouble, including infinite loops or access violations. This used to be a major concern before TL2.

Reasons to Abort

Let’s consider what can go wrong in our example that would force us to abort (and retry).

During the first read of pFoo : We find it locked. That could happen if another transaction is just committing its write to pFoo .

. It’s version number is greater than our transaction’s read version number. That could happen if another transaction had just committed its own write. During pre-commit locking, we find pFoo locked by another transaction. (We let that transaction proceed.) During read validation of pFoo : The location is locked by another transaction. (This can’t happen in our case because we have locked it ourselves.)

Its version number is greater than our read version number. This could happen if another transaction committed its write to pFoo after we have read it last.

You can easily convince yourself that in each case it really makes sense to abort the transaction. Moreover, when we retry the transaction, pFoo will have already been initialized by another thread. In that case the transaction will just breeze through, almost as fast as the lock-based implementation.

Optimizations

Note that during our transaction other non-conflicting transactions could have committed successfully. Each such transaction increments the version lock. This is why our write version could be arbitrarily larger than our read version. On the other hand, if the write version is just one unit larger that the read version, we know that we were alone, and we can skip the read validation altogether. In fact this is what would have happened in our example where the read version was 8 and the write version was 9.

Another great optimization is read-only transactions. Transactions that don’t perform any writes don’t have to log the reads. It’s enough that they perform the lock/version check at the point of each read. If all reads are successful, the transaction is done. It has seen a consistent state of memory.

It’s worth noting that atomic operations in TL2 are non-blocking. They are usually implemented using the CAS instruction (Compare and Swap). The locking of individual memory words is done using bounded spinlocks.

Conserving Memory

Let’s see how we can implement all the locks and version numbers without doubling or tripling our computer’s memory. To begin with, we can combine the lock with the version number in one word–a versioned lock. We only need one bit for a spinlock and we can stick it at the bottom of the version number, as long as we only use even version numbers.

The second trick is to use a reasonably sized table of versioned locks and hash all memory locations into that table. Since there will be many locations that map into the same lock, spurious aborts may happen. Notice however that spurious aborts have no impact on the semantics of the program. Aborted transactions will be retried until they commit. In practice such spurious conflicts happen reasonably rarely.

Performance

With the usual caveat that there are lies, damned lies, and benchmarks; the performance of STM clocks at about half the performance of hand-crafted locks (this is more or less the consensus number–see the interview with Peyton-Jones and Harris for confirmation). Which is not bad, considering all the instrumentation that goes into it. But if you have a program that spends 90% of its time concurrently inserting and removing items from a red-black tree, you might consider hiring a team of highly experienced programmers and testers to create a faster, lock-based, or even lock-free, solution. But take into account an important fact: STM is very fine grained. For instance, when you’re inserting an item into a tree, the STM transaction will only lock the nodes that you are actually modifying. STM will easily beat a solution that uses one global lock per whole tree. Per-node manual locking is hard to implement correctly because of the risk of deadlocks.

By its optimistic nature, STM performs best under low contention scenarios where the vast majority of transactions commit on the first try.

What’s Next?

Will performance problems interfere with a wider acceptance of STM? There is an interesting analogy between garbage collection (GC) and STM put forward by Dan Grossman. Historically, performance concerns were a major obstacle in the acceptance of GC until a general-purpose language, Java, adopted it as the memory management tool. If the analogy holds, the same will happen to STM and shared-memory concurrency.

STM has a permanent spot, and probably the best implementation, in Haskell (doesn’t it remind you of the relationship between Lisp and GC?). There’s been some STM activity in C++, Java, and other languages (see Bibliography) but, without language support, STM doesn’t offer enough bang for the buck in terms of safety and ease of use. That’s why I’m really excited about the use of STM in a batch of new high-performance languages, in particular in Chapel. Will Chapel become to STM what Java became to GC? Only time will tell.

I’ve been following the progress of the Chapel’s implementation of STM and talked to some of the developers. In distributed environments, STM can perform outstandingly. That’s because the overhead of STM read/write logging (which is done locally) is dwarfed by the usual distributed communication costs. The trick is to piggyback STM communications on top of remote reads and writes, which have to happen anyway. Chapel also has a chance to implement a type system that makes STM safer to use. I hope to discuss those options in a future post.

Bibliography