The BlueGene/Q processors that will power the 20 petaflops Sequoia supercomputer being built by IBM for Lawrence Livermore National Labs will be the first commercial processors to include hardware support for transactional memory. Transactional memory could prove to be a versatile solution to many of the issues that currently make highly scalable parallel programming a difficult task. Most research so far has been done on software-based transactional memory implementations. The BlueGene/Q-powered supercomputer will allow a much more extensive real-world testing of the technology and concepts. The inclusion of the feature was revealed at Hot Chips last week.

BlueGene/Q itself is a multicore 64-bit PowerPC-based system-on-chip based on IBM's multicore-oriented, 4-way multithreaded PowerPC A2 design. Each 1.47 billion transistor chip includes 18 cores. Sixteen will be used for running actual computations, one will be used for running the operating system, and the final core will be used to improve chip reliability. For BlueGene/Q, a quad floating point unit, capable of up to four double-precision floating point operations at a time, has been added to every A2 core. At the intended 1.6GHz clock speed, each chip will be capable of a total of 204.8 GFLOPS within a 55 W power envelope. The chips also include memory controllers and I/O connectivity.

The 18th core is a redundant spare. If a fault is detected in one of the chip's cores, the core can be disabled and transparently mapped to the redundant spare. Detection and remapping of faulty cores can be done at any stage of the system's manufacture—not just when the chip wafer is being tested, but also when chip has been installed into Sequoia. Sequoia will use about 100,000 of the chips in total to reach its 20 petaflops target. Sequoia's huge scale makes the ability to remap faulty cores important: IBM estimates that given the number of chips in the supercomputer, one chip will fail every three weeks, on average.

Traditional multithreading: locks and serialization

Transactional memory is an approach to parallel programming that has the potential to make efficient parallel programming a great deal easier than it is currently. Parallel programming is easy when a task can be broken up into many independent threads that don't share any data; each part can run on a processor core, and no coordination between cores is necessary. Things get more difficult when the different parts of the task aren't completely independent—for example, if different threads need to update a single value that they share.

The traditional solution is to use locks. Every time a thread needs to alter the shared value, it acquires the lock. No other thread can acquire the lock while one thread holds it; they just have to wait. The thread with the lock can then modify the shared value (which may require a complex computation, and hence can take a long time), and then release the lock. The release of the lock in turn allows the waiting threads to continue. This system works, but it has a number of problems in practice. If updates to the shared value occur only infrequently—and hence, it's rare for a thread to ever have to wait—the lock-based system can be very efficient. However, that efficiency tends to rapidly diminish whenever updates to the shared value are frequent: threads spend a lot of their time waiting for the lock to become available, and can't do any useful work while they're waiting.

Locks also prove quite difficult for programmers to use correctly. Though the case of a single shared value is easy to handle, real programs are rarely so simple. A program with two locks, A and B, is susceptible to a problem called deadlock. If two threads need both locks, they have a choice; they can either acquire lock A followed by lock B, or they can acquire lock B followed by lock A. As long as every thread acquires the locks in the same order, there's no problem. However, if one thread acquires lock A first, and the other acquires lock B first, then the two threads can get stuck—the first waits for lock B to become free, the second waits for lock A to become free, and neither can ever succeed. This is a deadlock.

This problem might seem easy to avoid, and indeed when a program only has two locks, it normally is—but it becomes harder to ensure that every part of the program does the right thing as the program becomes more complex. Add more locks, for other bits of shared data, and it becomes harder still.

Transactional memory: the end of locks

Transactional memory is designed to solve this kind of problem. With transactional memory, developers mark the portions of their programs that modify the shared data as being "atomic." Each atomic block is executed within a transaction: either the whole block executes, or none of it does. Within the atomic block, the program can read the shared value without locking it, perform all the computations it needs to perform, and then write the value back. At the end, it commits the transaction. The clever part happens with the commit operation: the transactional memory system checks to see if the shared data has been modified since the atomic operation was started. If it hasn't, the commit just makes the update and the thread can carry on with its work. If the shared value has changed, the transaction is aborted, and the work the thread did is rolled back. Typically when this happens, the program will simply retry the operation.

Transactional memory potentially offers a number of advantages over the lock-based scheme. First, it's optimistic: instead of each thread needing to acquire a lock just in case another thread tries to perform a concurrent operation, the threads assume that they'll succeed. It's only in the case of actual concurrent modifications that one thread will be forced to retry its work. Second, there's no deadlock scenario, since there are no locks. Third, the programming model is, broadly speaking, one that developers are quite familiar with; the notion of transactions and roll-back is familiar to most developers who've used relational databases, as they offer a somewhat similar set of features. Fourth, atomic blocks arguably make it a lot easier to construct large, correct programs: an atomic block with nested atomic blocks will do the right thing, but the same isn't necessarily true of lock-based programs.

(It's worth pointing out that transactional memory has a number of complexities of its own: for example, what if a transaction needs to do something that can't be rolled back, like sending data over a network or drawing on the screen? The best way to approach this kind of issue, and many others, is still an area of active research.)

The hardware advantage

Up until now, transactional memory research has mostly focused on software-based implementations. Real processors don't actually support transactional memory, so it has to be emulated in some way. Some schemes make use of virtual machines to do this—there are transactional memory modifications for the .NET and Java virtual machines, for example—others use native code, and require programmers to use special functions for accessing shared data, so that the transactional memory software can ensure the right things happen in the background. A consistent feature of all of these implementations is that they tend to be slow—sometimes very slow. Although the transactional memory makes it easier to produce bug-free programs, careful use of locks (or other multithreading techniques) can yield much greater performance.

BlueGene/Q moves transactional memory into the processor itself. It's the first commercial processor to do so, though Sun's Rock processor—cancelled when the company was purchased by Oracle—would also have included a transactional memory capability. The transactional memory implementation is predominantly found in the chip's 32MB level 2 cache. IBM did not describe the system in enormous detail at Hot Chips, but did describe a handful of details. Data in cache has a version tag, and the cache can store multiple versions of the same data. Software tells the processor to begin a transaction, does the work it needs to do, and then tells the processor to commit the work. If other threads have modified the data—creating multiple versions—the cache rejects the transaction and the software must try again. If other versions weren't created, the data is committed.

The same versioning facility can also be used for speculative execution. Instead of having to wait for up-to-date versions of all the data it needs—which might require, for example, waiting for another core to finish a computation—a thread can begin executing with the data it has, speculatively performing useful work. If it turns out that the data was up-to-date, it can commit that work, giving a performance boost: the work was done before the final value was delivered. If it turns out that the data was stale, the speculative work can be abandoned, and re-executed with the correct value.

A logical evolution

The transactional memory support is in some ways a logical extension of a feature that has long been a part of the PowerPC processor, "load-link/store-conditional," or LL/SC. LL/SC is a primitive operation that can be used as a building block for all kinds of thread-safe constructs. This includes both well-known mechanisms, like locks, and more exotic data structures, such as lists that can be modified by multiple threads simultaneously without any locking at all. Software transactional memory can also be created using LL/SC.

LL/SC has two parts. The first is the load-link. The program uses load-link to retrieve a value from memory. It can then perform the work it needs to do on that value. When it's finished, and needs to write a new value back to memory, it uses store-conditional. Store-conditional will only succeed if the memory value has not been modified since the load-link. If the value has been modified, the program has to go back to the beginning and start again.

LL/SC is in fact found on many processors—PowerPC, MIPS, ARM, and Alpha all use it. x86 doesn't; it has an alternative mechanism called "compare and swap." Most LL/SC systems are quite restrictive. For example, they may not be able to track writes to individual bytes of memory, but only entire cache lines, meaning that the SC operation can fail even if the monitored value wasn't actually modified. SC will also typically fail if, for example, a context switch (which flushes the cache) occurs between the LL and the SC. Some implementations will even make the SC fail if any value gets written to memory between the LL and the SC.

Transactional memory is a kind of LL/SC on steroids: each thread in a transaction can, in effect, perform an LL on many different memory locations, and the commit operation performs a kind of SC that takes effect on those multiple locations simultaneously, with either every store succeeding or failing together.

Will it deliver?

The implementation of the transactional memory itself is complex. Ruud Haring, who presented IBM's work at Hot Chips, claimed that "a lot of neat trickery" was required to make it work, and that it was a work of "sheer genius." After careful design work, the system was first built using FPGAs (chips that can be reconfigured in software) and, remarkably, it worked correctly first time. As complex as it is, the implementation still has its restrictions: notably, it doesn't offer any kind of multiprocessor transactional support. This isn't an issue for the specialized Sequoia, but it would be a problem for conventional multiprocessor machines: threads running on different CPUs could make concurrent modifications to shared data, and the transactional memory system won't detect that.

BlueGene/Q's hardware support allows use of transactional memory with little or no performance penalty. This lack of performance penalty in turn means that transactional memory can be used in real-world software, to see if it really is as useful in practice as it appears to be in theory. Haring said that, the transactional memory "feels good," but the team was still tuning compilers and software support, so it did not yet have any real-world data.

As specialized as Sequoia is, the insight it will give into the utility of transactional memory will be invaluable. The combination of ease-of-use advantages for programmers and the performance potential (both of transactional memory and speculative execution) make transactional memory very appealing. Software implementations, however, have for the most part reached a performance impasse: so severe are the performance issues that it puts the entire approach in jeopardy. If this hardware implementation proves successful, it could be the first of many. But if it doesn't work out—if it fails to deliver the performance and reliability that transactional memory is assumed to provide—it could sound the death knell for a once promising solution to the multicore conundrum.