An alternative to locks is CAS (Compare And Swap) operations. A CAS operation is a machine instruction that allows a word to be conditionally set as an atomic operation (all or nothing).

In a nutshell, the old and expected values of a variable are provided as parameters. At the end of the operation, if the new value matches the expected one, the variable is updated. Otherwise, it is up to the thread to retry another CAS operation.

In Java, Atomic* classes like AtomicInteger are based on this operation.

Let’s compare the performance results:

Last but not least, it’s worth mentioning that if the critical section (the protected part) does not fit in a single atomic operation, it will require the orchestration of multiple CAS operations and will be way more complex and expensive to handle.

Visibility of Change

Visibility of change can be achieved with a memory barrier (also called memory fence).

A memory barrier causes a CPU to enforce an ordering constraint on memory operations issued before and after the barrier instruction. Source: Wikipedia

Why do we need such mechanisms? For performance reasons, most modern CPU employ performance optimizations which may result in out-of-order execution.

Let’s go back to our first example:

In this example, it does not matter when is the loop counter updated as no operation within the loop uses it. The CPU is free to reorder instructions for optimizing the execution performance.

This reordering is not a problem in case of a single threaded execution but can become unpredictable in the context of concurrent execution.

This is the goal of memory barriers:

Ensuring that all instructions either side of the barrier appear in the correct order if they are observed from another CPU core.

Making the memory visible by propagating data to the underlying caches.

There are different types of memory barriers:

Read: gives a consistent view of the world for write operations ordered before the barrier.

Write: gives an ordered view to the world of the store operations before the barrier.

Full: a composition of the read and the full barrier.

In Java, using a volatile field inserts a write barrier instruction after we write to it, and a read barrier instruction before we read from it. Meanwhile, a final field of a class is made visible using a write barrier once the constructor completes.

It is also possible to access such instructions from the Unsafe library.

Let’s modify our multi-threaded implementation by using a read memory barrier before to access sharedCounter and a write memory barrier after:

Let’s run again this implementation with MAX equals to 1 million. Bear in mind, we still expect to print counter=1000000 at the end of the execution:

-- First multi-threaded implementation without a memory barrier

counter=522388

counter=733903

counter=532331 -- New implementation with memory barriers

counter=999890

counter=999868

counter=999770

As we can see, there’s an undeniable impact here as we are getting closer to the expected result due to the memory barriers. Yet, the execution is not deterministic as a memory barrier is still not enough to prevent race conditions with non-atomic operations.

Traditional Queues

One alternative to sharing memory between threads is the message passing paradigm: sharing memory by communicating.

It means we need something in between threads to handle communications. One of the solutions is to use traditional queues like LinkedBlockingQueue or ArrayBlockingQueue in Java.

Yet, it does not solve concurrency problems as even a queue must ensure mutual exclusion and visibility of change.

If we take a look at the put method of ArrayBlockingQueue , we can verify that both aspects are handled:

The accesses to the queue are locked and once an element has been added a signal is sent to an awaiting thread.

Something very interesting, the LMAX team noticed that typically, queues tend to be always close to full or close to empty due to the difference in pace between consumers and producers. This observation results in a high-level of contention and/or expensive cache management.

If the queue is close to full, it will result in contention between the producers, leading to context switches and perhaps losing cached data and instructions.

Moreover, in a traditional queuing system, the producers claim the head of the queue while the consumers claim the tail. If the queue is close to empty, it is very likely that the head, the tail and the queue size will all belong to the same cache line which may lead to the false the sharing problem described hereabove.

LMAX Disruptor

The creators of the LMAX Disruptor are also famous for having invented the concept of Mechanical Sympathy. In a nutshell, understanding the underlying hardware should make you a developer when it comes to designing algorithms, data structures etc. Based on this philosophy, the team has been able to produce this great library:

The Disruptor has significantly less write contention, a lower concurrency overhead and is more cache friendly than comparable approaches, all of which results in greater throughput with less jitter at lower latency. Source: Disruptor technical paper

Let’s try to analyze the reasons.

First, the Disruptor is based on a ring buffer structure (also called circular buffer) which is simply a single and fixed-size buffer as if it were connected end-to-end:

Ring buffer structure

On startup, the memory is allocated according to the provided size (which must be a power of 2) and a factory to initialize events:

Here, we allocated ringBufferSize instances of MyEvent class on startup.

Meanwhile, we provided a factory to create threads for event processors and a wait strategy defining how to handle slow subscribers (those strategies are described here). We’ll discuss the producer type a bit later.

The event instances will be reused and live for the duration of the Disruptor instance to eliminate issues with garbage collections as in traditional queues, events may survive longer than they should.

Internally, the ring buffer is backed by an array of objects. There is a good reason for that according to the creators. At the hardware level, an array has a predictable pattern of access. The entries can be pre-loaded, so the processor is not constantly going back to the main memory to load the next item in the ring. This way, we can iterate way more efficiently than with a queue backed by a linked list for example.

Configuring a consumer can be done like this:

We provided a lambda expression to handleEventsWith() . This lambda has three inputs:

The event itself

A sequence identifier

A boolean for batch management

In the case of multiple producers, they are all going to receive every single event. If we want to distribute the load, we can implement a sharding strategy based on the sequence identifier. Something like this:

Then we can start the Disruptor which returns a RingBuffer instance:

Once the Disruptor instance is started, we can publish an event the following way:

The sequence identifier is the position of the event in the ring buffer structure.

When we created a Disruptor instance, we also passed a producerType variable which can be equals either to ProducerType.SINGLE or ProducerType.MULTI . This indicates to the Disruptor whether we will have single or multiple producers.

In the case of a single producer, the ringBuffer.next() is completely lock-free. On the other side, if we have multiple producers, this function relies on a CAS operation to provide the next sequence identifier available in the ring buffer.

The sequence identifier itself is managed by adding left and right padding to make sure it is never in a cache line with anything else:

Furthermore, publishing an event creates a memory barrier to make sure the caches are up to date with this event. It allows adding an event in the ring buffer structure, without any locking which gives us a huge performance improvement.

The last point to mention. We said that the ring buffer was backed by a simple array. That means, it is up to the developer to prevent potential false sharing problems with multiple events that would belong to the same cache line.