FreeBSD lockless algorithm - seq

Those days locking algorithms are critical for operating systems, especially in a multi-threaded world. With time it turns out that classical locks, like mutex, are performance costly. Even when we are using techniques like having reads and writes mutex, synchronizing the state between CPUs can be costly. To optimize some cases, the lockless algorithm started to be used.

There are multiple places in the kernel where we need to read some variables very often, but there is a small number of cases when we write to them. For such cases, the seq interface was created. First, let's see how the writers look:

lock_exclusive(&obj->lock);

seq_write_begin(&obj->seq);

/* Change the state of obj. */

seq_write_end(&obj->seq);

unlock_exclusive(&obj->lock);

As you can see, we are still using some kind of barrier before editing the object. This protects users from having multiple writes at once. But remember that in our case we have an object which doesn't change often, so we don't need to optimize that part. After locking the object, we use a special seq_write_begin function which will notify all readers that the object will be changed. Next, we are doing all the modifications on the object. The seq_write_end function is used to notify all readers that the object was changed. In the end, we just unlock our locks and allow other writers to modify the object. This is a performance costly operation because we need to lock and unlock the lock.

int s1, s2;

set_t seq;

do {

seq = seq_read(&obj->seq);

s1 = obj->state1;

s2 = obj->state2;

} while(!seq_consistent(&obj->seq, seq));

The code above showed how the reader is looking. First, we are fetching a state of the sequence using the seq_read function. Then we are getting our variables that we are interested to work with (s1 and s2). In the end, we are checking if the state of our sequence was changed using the seq_consistent function. If the sequance was not changed we fetched values correctly. If the function returned false, that means that there was some writer that changed the state of our objects, and we need to start fetching one more time. This loop at first look can look weird but in the case when we are fetching only a few variables and there aren't so many writes to it, the loop will be used once or twice.

Because our obj variable doesn't change very often, this algorithm is much more efficient than standards mutex - the only locking which we need to do is only on the writer side. How is this possible? Let's look how it coult be implemented.

static __inline void

seq_write_begin(seq_t *seqp)

{

atomic_add_rel_int(seqp, 1);

}



static __inline void

seq_write_end(seq_t *seqp)

{

atomic_add_rel_int(seqp, 1);

}



static __inline seq_t

seq_read(const seq_t *seqp)

{

set_t s;



while ((s = atomic_load_acq_int(__DECONST(seq_t *, seqp))) % 2 == 1)

;

return (s);

}



static __inline seq_t

seq_consistent(const seq_t *seqp, seq_t oldseq)

{



return (*seqp == oldseq);

}

The seq_t is defined as unsigned 32bits integer. The seq_write_begin and seq_write_end adds one to our sequence. Those functions are used to create a transaction for writer. The seq_read function returns us the current state of the sequence. In case when sequence is an odd number - this is case when there is ongoing write to the structure - the seq_read function will spin for a while and read the sequence one more time. If it's a even number then there is no ongoing write so it returns the current sequance. The seq_consistent is just comparing those two numbers. If the numbers are equals it means that we succesfully fetched variables. In case when the numbers are different it means that there is ongoing write to the structure, or there was some write to it – in both case we need to start the whole process from beginning. All those situations are illustrated below:





In the seq_read function FreeBSD is using a special function for spin – cpu_spinwait. I simplified that for the purpose of example.



Thereare two theoretical problems with this algorithm. If there is a lot of writers, there is no guarantee that thereader will finally get the data - this is why we should use this algorithm only when we are sure that there will not be many writes to the structure.

The second problem is if the read loop was to take a very, very long time and at the same time would be 232 writes, the sequence compassion because of integer overflow and the state from the structure would be fetched with invalid state – however, this is very unlikely to ever happen.

In this post, we discuss a lockless algorithm `seq` from FreeBSD which can be used for the optimization of reads from structures. The only one requirement before using this algorithm is to be sure that there are more readers then writers :) You can find the whole implementation in the FreeBSD source code.