Most multi-threaded software uses locking. Lock optimization traditionally has aimed to reduce lock contention, that is make the critical regions smaller. In optimized software, this often results in a lot of very small critical regions, protected by many locks. Each critical region does only a little work, before releasing the lock and potentially letting some other CPU access the same data.

This is fine as long as these locks are completely uncontended. But when the lock was owned by another CPU before, even if it is currently uncontended, it will take quite some time to transfer the cache line containing the lock from the previous owner to the next. Essentially, each lock acquire -- even when not actually blocking -- has to wait for the latency of communicating between different CPUs.

It's useful to think of a modern multi-core system as a network, where each CPU core is a network node. Transferring data between them can take 40-50ns and more just inside a socket, and more if it is transferred between sockets. On larger systems the latencies generally increase. The CPU can execute a large number of instructions in the same time as it takes to transfer information. This means communication is very expensive.

Now, as in any distributed system, optimization often becomes an exercise in avoiding communication latencies. When the time taken to do the work in the critical section is much smaller than the communication latency, and the system executes a lot of critical sections that need to fetch the lock each, then it will spend far more time communicating than doing actual work. Increasing the amount of work per critical section also typically lowers the total number of times locks need to be acquired (assuming constant work), which can lower the locking overhead significantly.

A typical critical section processes one object. An example would be network packet processing applications, where a lock may need to be taken for every incoming packet to look up the state of the connection in a global table and do some checks. As the hash lookup and other checks are fast, this often results in a very short critical section. An obvious way to improve the overall performance is to take the global lock only once for multiple packets and amortize the time it takes to transfer with more work. This is called "lock batching" or "lock coarsening". The critical sections are explicitly tuned to be longer. Similar techniques can be used in many other situations dealing with locks.

Clearly there should be a balance between amortizing transfer latencies, and not increasing the critical section time too much to cause other CPUs to block for significant times. This can be tuned by adjusting the number of objects "batched" inside the lock. The more objects are processed per lock acquire, less time will be wasted in communication. But the value should be also not so long that blocking time starts to show up significantly. The best value typically depends on the transfer latencies in the system. A system with a single socket needs less, a system with many sockets has longer latencies and needs more.

Another factor to consider would bethat if the critical section time is shorter than the transfer latency, the performance can become non deterministic. Relatively small changes in the latency (e.g. from a different version of the micro architecture or frequency changes) can cause large swings in performance.

On a modern 2-socket server, at least several hundred nanoseconds duration are recommended for critical sections, to get both good and deterministic performance.

It can also help significantly to partition locks per socket. The transfer time between sockets is far longer than between the cores inside a single die. Limiting the lock usage to a single socket lowers the overall communication overhead significantly. As core counts increase it may even help to partition by smaller groups of cores.

One technique to do that is to split up reader locks into "big reader locks": use multiple reader locks partitioned by a group of cores. A read thread only takes its assigned reader lock. A writer takes all reader locks. This improves performance of readers (less and cheaper communication) at the cost of more expensive writers.

Together with some colleagues (Jeff Chamberlain and others) we recently published http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf that explores these issues in more detail.

Glossary: