In a recent post , I called out some of the past year's activity in the area of transactional memory (TM) . This post takes a closer look at hardware transactional memory (HTM) , with an eye to where it might fit into the parallel programmer's toolbox.

But first, what is HTM?



Current HTM implementations make use of processor caches and speculative execution to ensure that a designated group of statements (called a “transaction”) executes atomically from the viewpoint of any other transaction. The beginning and end of a transaction is marked by begin-transaction and commit-transaction machine instructions, respectively, and might also include abort-transaction instructions. The transaction is executed within the confines of its CPU, so that its work is not visible to other CPUs until the commit-transaction instruction is executed. An abort-transaction instruction squashes the speculative execution, discarding any changes that the transaction might have carried out, while also branching to a failure handler. The location of the failure handler is normally specified by the begin-transaction instruction, either as an explicit address or by means of condition codes. This failure handler might then either: (1) retry the transaction, (2) execute a fallback code sequence, for example, using locking, or (3) otherwise handle the transaction's failure. In this way, each transaction either executes atomically with respect to all other transactions, or is aborted with no changes to shared state.



So where does HTM fit into the parallel programmer's toolbox? Ideally, we would like a classification of the applicability of HTM similar to that shown for RCU in the diagram below:







On the other hand, I have been developing, using, and maintaining RCU in production for almost two decades. Because the oldest commercially available HTM implementation is still quite young by comparison, any attempt to similarly classify its use must necessarily rely on educated guesses, extrapolation, and speculation. But I am most certainly not going to let that stop me from making a first attempt! :-)



The remainder of this posting will look at a number of questions that need to be asked of current commercially available HTM implementations, but without focusing on any particular implementation. The answers will be primarily about HTM in the publicly documented here-and-now, though some additional speculation about possible future implementations will inevitably leak in. This additional speculation will draw on selected academic research projects.





Does HTM Belong In The Parallel Programmer's Toolbox?

The first question we need to ask is whether HTM has any place at all in the parallel programmer's toolbox. After all, as the new entrant, it must prove its worth compared to any number of concurrency-control mechanisms that have been in production use for decades. In addition, there has been no shortage of technologies that have roused great excitement but eventually failed of their promise, a historical tendency that is greatly illuminated by Ioannidis's Sixth Corrollary.



Ioannidis notwithstanding, it seems quite likely that the answer to this first question is in the affirmative. To quote Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory, “An important TM near-term opportunity is thus update-heavy workloads using large non-partitionable data structures such as high-diameter unstructured graphs.” (Disclaimer: I co-authored this paper, which is a revision of the 2007 PLOS paper by the same name with revised presentation here.) Although there are other mechanisms that can be used to handle large update-heavy non-partitionable data structures, these other methods suffer from cache-locality problems. There is thus hope that HTM can provide performance advantages in this area. In addition, use of HTM for transactional lock elision promises to greatly reduce the number of cache misses associated with lock acquisition and release, which is likely to provide large performance advantages to HTM for short lock-based critical sections—at least in cases where none of the oft-written variables protected by the elided lock share a cache line with that lock.





Should HTM Be Used Universally?

Given that HTM very likely has a place in the parallel programmer's toolbox, the logical next question to ask is whether parallel programmers can simplify their lives by just using HTM for everything.



The answer to this question is an emphatic “no” for the following reasons:



We can see that HTM is useful on the one hand, but has substantial limitations on the other. Therefore, the next section looks at where HTM is most likely to be helpful.





Where is HTM Most Likely to be Helpful?

HTM is likely to be at its best for large in-memory data structures that are difficult to statically partition but that are dynamically partitionable, in other words, the conflict probability is reasonably low. There must be a reasonable non-TM fallback algorithm for every transaction. The workload should ideally be update-heavy with small accesses and updates, and not subject to aggressive real-time constraints. Finally, if HTM is used for transactional lock elision, any empty critical sections must continue to use explicit locking.



Why is this situation best for HTM?



A thought experiment involving a red-black tree might be helpful.



First, consider a red-black tree that supports insertions, deletions, and lookups of single elements, but where all of these APIs return the exact number of elements in the tree. In this case, the resulting transactions will be very prone to conflicts on the variable in which the element count is maintained, resulting in poor performance and scalability, especially for update-heavy workloads.



This should not be surprising, because returning the size causes insertion and deletion to be strongly non-commutative. Maintaining and returning a consistent count is simply bad for parallelism.



Therefore, let's next consider a red-black tree with insertion, deletion, and lookup, but without the exact count of the number of elements in the tree. In this case, if the tree is large, the conflict probabilities between random insertion, deletion, and lookup operations is extremely low. In this case, HTM is likely to perform and scale quite well.



Finally, let's add an API that enumerates all the the nodes in the red-black tree. A transaction implementing this API will conflict with all concurrent insertion or deletion operations, severely limiting performance and scalability. In many cases, this situation can be alleviated by using hazard pointers or RCU to protect these large read-only accesses. Although the RCU/hazard-pointer reader can still conflict with updates, the conflict window is now limited to each single read-side load, instead of extending across the entire read-side operation, as is the case for transactions. (RCU and hazard pointers could help even more if their loads didn't generate conflicts, but instead returned the old value of any variable written by an in-flight transaction, but current HTM implementations do not appear to do this.)



This last example illustrates the importance of using combinations of synchronization mechanisms in ways that play to each mechanism's strengths.





Summary

In summary, although it appears that HTM will be able to earn a place in the parallel programmer's toolbox, it is not a magic wand with which to wish away all concurrency problems. Like every other tool in the toolbox, it will be necessary to carefully consider whether HTM is the right tool for a particular job. Furthermore, like every other tool in the toolbox, it may in some cases be necessary to tune, restructure, or even redesign your code in order to obtain the full benefits of HTM.



HTM should therefore be an interesting and exciting learning experience for all concerned. ;-)



(Thanks to a great number of people, especially Jon Walpole, Josh Triplett, and Andi Kleen for a number of illuminating discussions on TM in general and HTM in particular.)