Spinlocks and Read-Write Locks Most parallel programming in some way will involve the use of locking at the lowest levels. Locks are primitives that provide mutual exclusion that allow data structures to remain in consistent states. Without locking, multiple threads of execution may simultaneously modify a data structure. Without a carefully thought out (and usually complex) lock-free algorithm, the result is usually a crash of hang as unintended program states are entered. Since the creation of a lock-free algorithm is extremely difficult, most programs use locks. If updating a data structure is slow, the lock of choice is a mutex of some kind. These will transfer control to the operating system when they block. This allows another thread to run, and perhaps make progress whilst the first thread sleeps. This transfer of control consists of a pair of context switches, which are quite a slow operation. Thus, if the lock-hold time is expected to be short, then this may not be the fastest method. Spinlocks Instead of context switches, a spinlock will "spin", and repeatedly check to see if the lock is unlocked. Spinning is very fast, so the latency between an unlock-lock pair is small. However, spinning doesn't accomplish any work, so may not be as efficient as a sleeping mutex if the time spent becomes significant. Before we describe the implementation of spin locks, we first need a set of atomic primitives. Fortunately, gcc provides some of these as built-in functions: #define atomic_xadd(P, V) __sync_fetch_and_add((P), (V)) #define cmpxchg(P, O, N) __sync_val_compare_and_swap((P), (O), (N)) #define atomic_inc(P) __sync_add_and_fetch((P), 1) #define atomic_dec(P) __sync_add_and_fetch((P), -1) #define atomic_add(P, V) __sync_add_and_fetch((P), (V)) #define atomic_set_bit(P, V) __sync_or_and_fetch((P), 1< Unfortunately, we will require a few others that are not, and so must be implemented in assembly /* Compile read-write barrier */ #define barrier() asm volatile("": : :"memory") /* Pause instruction to prevent excess processor bus usage */ #define cpu_relax() asm volatile("pause

": : :"memory") /* Atomic exchange (of various sizes) */ static inline void *xchg_64(void *ptr, void *x) { __asm__ __volatile__("xchgq %0,%1" :"=r" ((unsigned long long) x) :"m" (*(volatile long long *)ptr), "0" ((unsigned long long) x) :"memory"); return x; } static inline unsigned xchg_32(void *ptr, unsigned x) { __asm__ __volatile__("xchgl %0,%1" :"=r" ((unsigned) x) :"m" (*(volatile unsigned *)ptr), "0" (x) :"memory"); return x; } static inline unsigned short xchg_16(void *ptr, unsigned short x) { __asm__ __volatile__("xchgw %0,%1" :"=r" ((unsigned short) x) :"m" (*(volatile unsigned short *)ptr), "0" (x) :"memory"); return x; } /* Test and set a bit */ static inline char atomic_bitsetandtest(void *ptr, int x) { char out; __asm__ __volatile__("lock; bts %2,%1

" "sbb %0,%0

" :"=r" (out), "=m" (*(volatile long long *)ptr) :"Ir" (x) :"memory"); return out; } A spinlock can be implemented in an obvious way, using the atomic exchange primitive. #define EBUSY 1 typedef unsigned spinlock; static void spin_lock(spinlock *lock) { while (1) { if (!xchg_32(lock, EBUSY)) return; while (*lock) cpu_relax(); } } static void spin_unlock(spinlock *lock) { barrier(); *lock = 0; } static int spin_trylock(spinlock *lock) { return xchg_32(lock, EBUSY); } So how fast is the above code? A simple benchmark to test the overhead of a lock is to have a given number of threads attempting to lock and unlock it, doing a fixed amount of work each time. If the total number of lock-unlock pairs is maintained as the number of threads is increased, it is possible to measure the affect of contention on performance. A good spinlock implementation will be as fast as possible for any given number of threads attempting to use that lock simultaneously. The results for the above spinlock implementation are: Threads 1 2 3 4 5 Time (s) 5.5 5.6 5.7 5.7 5.7 These results are pretty good, but can be improved. The problem is that if there are multiple threads contending, then they all attempt to take the lock at the same time once it is released. This results in a huge amount of processor bus traffic, which is a huge performance killer. Thus, if we somehow order the lock-takers so that they know who is next in line for the resource we can vastly reduce the amount of bus traffic. One spinlock algorithm that does this is called the MCS lock. This uses a list to maintain the order of acquirers. typedef struct mcs_lock_t mcs_lock_t; struct mcs_lock_t { mcs_lock_t *next; int spin; }; typedef struct mcs_lock_t *mcs_lock; static void lock_mcs(mcs_lock *m, mcs_lock_t *me) { mcs_lock_t *tail; me->next = NULL; me->spin = 0; tail = xchg_64(m, me); /* No one there? */ if (!tail) return; /* Someone there, need to link in */ tail->next = me; /* Make sure we do the above setting of next. */ barrier(); /* Spin on my spin variable */ while (!me->spin) cpu_relax(); return; } static void unlock_mcs(mcs_lock *m, mcs_lock_t *me) { /* No successor yet? */ if (!me->next) { /* Try to atomically unlock */ if (cmpxchg(m, me, NULL) == me) return; /* Wait for successor to appear */ while (!me->next) cpu_relax(); } /* Unlock next one */ me->next->spin = 1; } static int trylock_mcs(mcs_lock *m, mcs_lock_t *me) { mcs_lock_t *tail; me->next = NULL; me->spin = 0; /* Try to lock */ tail = cmpxchg(m, NULL, &me); /* No one was there - can quickly return */ if (!tail) return 0; return EBUSY; } This has quite different timings: Threads 1 2 3 4 5 Time (s) 3.6 4.4 4.5 4.8 >1min The MCS lock takes a hugely longer time when the number of threads is greater than the number of processors (four in this case). This is because if the next thread in the queue isn't active when the lock is unlocked, then everyone must wait until the operating system scheduler decides to run it. Every "fair" lock algorithm has this problem. Thus, the simple unfair spinlock still can be quite useful when you don't know that the number of threads is bounded by the number of cpus. A bigger problem with the MCS lock is its API. It requires a second structure to be passed in addition to the address of the lock. The algorithm uses this second structure to store the information which describes the queue of threads waiting for the lock. Unfortunately, most code written using spinlocks doesn't have this extra information, so the fact that the MCS algorithm isn't a drop-in replacement to a standard spin lock is a problem. An IBM working group found a way to improve the MCS algorithm to remove the need to pass the extra structure as a parameter. Instead, on-stack information was used instead. The result is the K42 lock algorithm: typedef struct k42lock k42lock; struct k42lock { k42lock *next; k42lock *tail; }; static void k42_lock(k42lock *l) { k42lock me; k42lock *pred, *succ; me.next = NULL; barrier(); pred = xchg_64(&l->tail, &me); if (pred) { me.tail = (void *) 1; barrier(); pred->next = &me; barrier(); while (me.tail) cpu_relax(); } succ = me.next; if (!succ) { barrier(); l->next = NULL; if (cmpxchg(&l->tail, &me, &l->next) != &me) { while (!me.next) cpu_relax(); l->next = me.next; } } else { l->next = succ; } } static void k42_unlock(k42lock *l) { k42lock *succ = l->next; barrier(); if (!succ) { if (cmpxchg(&l->tail, &l->next, NULL) == (void *) &l->next) return; while (!l->next) cpu_relax(); succ = l->next; } succ->tail = NULL; } static int k42_trylock(k42lock *l) { if (!cmpxchg(&l->tail, NULL, &l->next)) return 0; return EBUSY; } The timings of the K42 algorithm are as good as, if not better than the MCS lock: Threads 1 2 3 4 5 Time (s) 3.7 4.8 4.5 4.9 >1min Unfortunately, the K42 algorithm has another problem. It appears that it may be patented by IBM. Thus it cannot be used either. (Without perhaps paying royalties to IBM.) One way around this is to use a different type of list. The K42 and MCS locks use lists ordered so that finding the next thread to run is easy, and adding to the end is hard. What about flipping the direction of the pointers so that finding the end is easy, and find who's next hard? The result is the following algorithm: typedef struct listlock_t listlock_t; struct listlock_t { listlock_t *next; int spin; }; typedef struct listlock_t *listlock; #define LLOCK_FLAG (void *)1 static void listlock_lock(listlock *l) { listlock_t me; listlock_t *tail; /* Fast path - no users */ if (!cmpxchg(l, NULL, LLOCK_FLAG)) return; me.next = LLOCK_FLAG; me.spin = 0; /* Convert into a wait list */ tail = xchg_64(l, &me); if (tail) { /* Add myself to the list of waiters */ if (tail == LLOCK_FLAG) tail = NULL; me.next = tail; /* Wait for being able to go */ while (!me.spin) cpu_relax(); return; } /* Try to convert to an exclusive lock */ if (cmpxchg(l, &me, LLOCK_FLAG) == &me) return; /* Failed - there is now a wait list */ tail = *l; /* Scan to find who is after me */ while (1) { /* Wait for them to enter their next link */ while (tail->next == LLOCK_FLAG) cpu_relax(); if (tail->next == &me) { /* Fix their next pointer */ tail->next = NULL; return; } tail = tail->next; } } static void listlock_unlock(listlock *l) { listlock_t *tail; listlock_t *tp; while (1) { tail = *l; barrier(); /* Fast path */ if (tail == LLOCK_FLAG) { if (cmpxchg(l, LLOCK_FLAG, NULL) == LLOCK_FLAG) return; continue; } tp = NULL; /* Wait for partially added waiter */ while (tail->next == LLOCK_FLAG) cpu_relax(); /* There is a wait list */ if (tail->next) break; /* Try to convert to a single-waiter lock */ if (cmpxchg(l, tail, LLOCK_FLAG) == tail) { /* Unlock */ tail->spin = 1; return; } cpu_relax(); } /* A long list */ tp = tail; tail = tail->next; /* Scan wait list */ while (1) { /* Wait for partially added waiter */ while (tail->next == LLOCK_FLAG) cpu_relax(); if (!tail->next) break; tp = tail; tail = tail->next; } tp->next = NULL; barrier(); /* Unlock */ tail->spin = 1; } static int listlock_trylock(listlock *l) { /* Simple part of a spin-lock */ if (!cmpxchg(l, NULL, LLOCK_FLAG)) return 0; /* Failure! */ return EBUSY; } This unfortunately is extremely complex, and doesn't perform well either: Threads 1 2 3 4 5 Time (s) 3.6 5.1 5.8 6.3 >1min It is still faster than the standard spinlock when contention is low, but once more than two threads are attempting to lock at the same time it is worse, and gets slower from there on. Another possible trick is to use a spinlock within a spinlock. The first lock can be very light weight since we know it will only be held for a short time. It can then control the locking for the wait list describing the acquirers of the real spinlock. If done right, the number of waiters on the sub-lock can be kept low, and thus minimize bus traffic. The result is: typedef struct bitlistlock_t bitlistlock_t; struct bitlistlock_t { bitlistlock_t *next; int spin; }; typedef bitlistlock_t *bitlistlock; #define BLL_USED ((bitlistlock_t *) -2LL) static void bitlistlock_lock(bitlistlock *l) { bitlistlock_t me; bitlistlock_t *tail; /* Grab control of list */ while (atomic_bitsetandtest(l, 0)) cpu_relax(); /* Remove locked bit */ tail = (bitlistlock_t *) ((uintptr_t) *l & ~1LL); /* Fast path, no waiters */ if (!tail) { /* Set to be a flag value */ *l = BLL_USED; return; } if (tail == BLL_USED) tail = NULL; me.next = tail; me.spin = 0; barrier(); /* Unlock, and add myself to the wait list */ *l = &me; /* Wait for the go-ahead */ while (!me.spin) cpu_relax(); } static void bitlistlock_unlock(bitlistlock *l) { bitlistlock_t *tail; bitlistlock_t *tp; /* Fast path - no wait list */ if (cmpxchg(l, BLL_USED, NULL) == BLL_USED) return; /* Grab control of list */ while (atomic_bitsetandtest(l, 0)) cpu_relax(); tp = *l; barrier(); /* Get end of list */ tail = (bitlistlock_t *) ((uintptr_t) tp & ~1LL); /* Actually no users? */ if (tail == BLL_USED) { barrier(); *l = NULL; return; } /* Only one entry on wait list? */ if (!tail->next) { barrier(); /* Unlock bitlock */ *l = BLL_USED; barrier(); /* Unlock lock */ tail->spin = 1; return; } barrier(); /* Unlock bitlock */ *l = tail; barrier(); /* Scan wait list for start */ do { tp = tail; tail = tail->next; } while (tail->next); tp->next = NULL; barrier(); /* Unlock */ tail->spin = 1; } static int bitlistlock_trylock(bitlistlock *l) { if (!*l && (cmpxchg(l, NULL, BLL_USED) == NULL)) return 0; return EBUSY; } Unfortunately, this is even worse than the previous listlock algorithm. This is only good for the uncontended case. Threads 1 2 3 4 5 Time (s) 3.6 5.3 6.3 6.8 >1min Another possibility is to modify some other type of locking algorithm to be a spinlock. The read-write locks from Reactos are designed to be scale extremely well. If the "read" part of them is removed, then the mutual exclusion between the writers will act just like a spinlock. Doing this yields: /* Bit-lock for editing the wait block */ #define SLOCK_LOCK 1 #define SLOCK_LOCK_BIT 0 /* Has an active user */ #define SLOCK_USED 2 #define SLOCK_BITS 3 typedef struct slock slock; struct slock { uintptr_t p; }; typedef struct slock_wb slock_wb; struct slock_wb { /* * last points to the last wait block in the chain. * The value is only valid when read from the first wait block. */ slock_wb *last; /* next points to the next wait block in the chain. */ slock_wb *next; /* Wake up? */ int wake; }; /* Wait for control of wait block */ static slock_wb *slockwb(slock *s) { uintptr_t p; /* Spin on the wait block bit lock */ while (atomic_bitsetandtest(&s->p, SLOCK_LOCK_BIT)) { cpu_relax(); } p = s->p; if (p <= SLOCK_BITS) { /* Oops, looks like the wait block was removed. */ atomic_dec(&s->p); return NULL; } return (slock_wb *)(p - SLOCK_LOCK); } static void slock_lock(slock *s) { slock_wb swblock; /* Fastpath - no other readers or writers */ if (!s->p && (cmpxchg(&s->p, 0, SLOCK_USED) == 0)) return; /* Initialize wait block */ swblock.next = NULL; swblock.last = &swblock; swblock.wake = 0; while (1) { uintptr_t p = s->p; cpu_relax(); /* Fastpath - no other readers or writers */ if (!p) { if (cmpxchg(&s->p, 0, SLOCK_USED) == 0) return; continue; } if (p > SLOCK_BITS) { slock_wb *first_wb, *last; first_wb = slockwb(s); if (!first_wb) continue; last = first_wb->last; last->next = &swblock; first_wb->last = &swblock; /* Unlock */ barrier(); s->p &= ~SLOCK_LOCK; break; } /* Try to add the first wait block */ if (cmpxchg(&s->p, p, (uintptr_t)&swblock) == p) break; } /* Wait to acquire exclusive lock */ while (!swblock.wake) cpu_relax(); } static void slock_unlock(slock *s) { slock_wb *next; slock_wb *wb; uintptr_t np; while (1) { uintptr_t p = s->p; /* This is the fast path, we can simply clear the SRWLOCK_USED bit. */ if (p == SLOCK_USED) { if (cmpxchg(&s->p, SLOCK_USED, 0) == SLOCK_USED) return; continue; } /* There's a wait block, we need to wake the next pending user */ wb = slockwb(s); if (wb) break; cpu_relax(); } next = wb->next; if (next) { /* * There's more blocks chained, we need to update the pointers * in the next wait block and update the wait block pointer. */ np = (uintptr_t) next; next->last = wb->last; } else { /* Convert the lock to a simple lock. */ np = SLOCK_USED; } barrier(); /* Also unlocks lock bit */ s->p = np; barrier(); /* Notify the next waiter */ wb->wake = 1; /* We released the lock */ } static int slock_trylock(slock *s) { /* No other readers or writers? */ if (!s->p && (cmpxchg(&s->p, 0, SLOCK_USED) == 0)) return 0; return EBUSY; } Again, this algorithm disappoints. The results are similar to the bitlistlock algorithm. This isn't surprising, as the wait-block that controls the waiter list is synchronized by a bit lock. Threads 1 2 3 4 5 Time (s) 3.7 5.1 5.8 6.5 >1min Time to think laterally. One of the problems with the above algorithms is synchronization of the wait list. The core issue is that we need some way to recognize the head and tail of that list. The head of the list is needed to add a new waiter. The tail is needed to decide who is to go next. The MCS lock used the extra structure information so that the list tail could be quickly found. The K42 Lock used the patented method of storing the tail in a second list pointer within the lock itself. There is another trick we can do though. If the extra information is allocated on the stack, then it may be possible to recognize that a pointer is pointing within our own stack frame. If so, then we can use that information within the algorithm to decide where the wait list ends. The result is the stack-lock algorithm: typedef struct stlock_t stlock_t; struct stlock_t { stlock_t *next; }; typedef struct stlock_t *stlock; static __attribute__((noinline)) void stlock_lock(stlock *l) { stlock_t *me = NULL; barrier(); me = xchg_64(l, &me); /* Wait until we get the lock */ while (me) cpu_relax(); } #define MAX_STACK_SIZE (1<<12) static __attribute__((noinline)) int on_stack(void *p) { int x; uintptr_t u = (uintptr_t) &x; return ((u - (uintptr_t)p + MAX_STACK_SIZE) < MAX_STACK_SIZE * 2); } static __attribute__((noinline)) void stlock_unlock(stlock *l) { stlock_t *tail = *l; barrier(); /* Fast case */ if (on_stack(tail)) { /* Try to remove the wait list */ if (cmpxchg(l, tail, NULL) == tail) return; tail = *l; } /* Scan wait list */ while (1) { /* Wait for partially added waiter */ while (!tail->next) cpu_relax(); if (on_stack(tail->next)) break; tail = tail->next; } barrier(); /* Unlock */ tail->next = NULL; } static int stlock_trylock(stlock *l) { stlock_t me; if (!cmpxchg(l, NULL, &me)) return 0; return EBUSY; } This algorithm is quite a bit simpler if you know that a thread's stack is aligned a certain way. (Then the stack-check turns into an XOR and a mask operation.) Unfortunately, it is still quite slow. Threads 1 2 3 4 5 Time (s) 3.6 5.3 5.7 6.2 >1min The lock operation above looks to be fairly efficient, it is the unlock routine that is slow and complex. Perhaps if we save a little more information within the lock itself, then the unlock operation can be made faster. Since quite a bit of time seems to be spent finding the previous node to ourselves (which is the one to wake up), it might be better to do that while we are spinning waiting for our turn to take the lock. If we save this previous point within the lock, we then will not need to calculate it within the unlock routine. typedef struct plock_t plock_t; struct plock_t { plock_t *next; }; typedef struct plock plock; struct plock { plock_t *next; plock_t *prev; plock_t *last; }; static void plock_lock(plock *l) { plock_t *me = NULL; plock_t *prev; barrier(); me = xchg_64(l, &me); prev = NULL; /* Wait until we get the lock */ while (me) { /* Scan wait list for my previous */ if (l->next != (plock_t *) &me) { plock_t *t = l->next; while (me) { if (t->next == (plock_t *) &me) { prev = t; while (me) cpu_relax(); goto done; } if (t->next) t = t->next; cpu_relax(); } } cpu_relax(); } done: l->prev = prev; l->last = (plock_t *) &me; } static void plock_unlock(plock *l) { plock_t *tail; /* Do I know my previous? */ if (l->prev) { /* Unlock */ l->prev->next = NULL; return; } tail = l->next; barrier(); /* Fast case */ if (tail == l->last) { /* Try to remove the wait list */ if (cmpxchg(&l->next, tail, NULL) == tail) return; tail = l->next; } /* Scan wait list */ while (1) { /* Wait for partially added waiter */ while (!tail->next) cpu_relax(); if (tail->next == l->last) break; tail = tail->next; } barrier(); /* Unlock */ tail->next = NULL; } static int plock_trylock(plock *l) { plock_t me; if (!cmpxchg(&l->next, NULL, &me)) { l->last = &me; return 0; } return EBUSY; } This starts regaining some of the speed we have lost, but still isn't quite as good as the K42 algorithm. (It is however, always faster than the original naive spinlock provided that the number of threads is less than the number of processors.) Threads 1 2 3 4 5 Time (s) 3.7 5.1 5.3 5.4 >1min A careful reading of the plock algorithm shows that it can be improved even more. We don't actually need to know the pointer value of the next waiter. Some other unique value will do instead. Instead of saving a pointer, we can use a counter that we increment. If a waiter knows which counter value corresponds to its turn, then it just needs to wait until that value appears. The result is called the ticket lock algorithm: typedef union ticketlock ticketlock; union ticketlock { unsigned u; struct { unsigned short ticket; unsigned short users; } s; }; static void ticket_lock(ticketlock *t) { unsigned short me = atomic_xadd(&t->s.users, 1); while (t->s.ticket != me) cpu_relax(); } static void ticket_unlock(ticketlock *t) { barrier(); t->s.ticket++; } static int ticket_trylock(ticketlock *t) { unsigned short me = t->s.users; unsigned short menew = me + 1; unsigned cmp = ((unsigned) me << 16) + me; unsigned cmpnew = ((unsigned) menew << 16) + me; if (cmpxchg(&t->u, cmp, cmpnew) == cmp) return 0; return EBUSY; } static int ticket_lockable(ticketlock *t) { ticketlock u = *t; barrier(); return (u.s.ticket == u.s.users); } The above algorithm is extremely fast, and beats all the other fair-locks described. Threads 1 2 3 4 5 Time (s) 3.6 4.4 4.5 4.8 >1min In fact, this is the spinlock algorithm used in the Linux kernel, although for extra speed, the kernel version is written in assembly language rather than the semi-portable C shown above. Also note that the above code depends on the endianness of the computer architecture. It is designed for little-endian machines. Big endian processors will require a swap of the two fields within the structure in the union. The ticket lock shows that an oft-repeated fallacy is untrue. Many of the above fair-lock algorithms are meant to scale well because the waiters are spinning on different memory locations. This is meant to reduce bus traffic and thus increase performance. However, it appears that that effect is small. The more important thing is to make sure that the waiters are ordered by who gets to take the lock next. This is what the ticket lock does admirably. The fact that multiple waiters are spinning on the same ticket lock location does not seem to be a performance drain. Read Write Locks Quite often, some users of a data structure will make no modifications to it. They just require read access to its fields to do their work. If multiple threads require read access to the same data, there is no reason why they should not be able to execute simultaneously. Spinlocks don't differentiate between read and read/write access. Thus spinlocks do not exploit this potential parallelism. To do so, read-write locks are required. The simplest read-write lock uses a spinlock to control write access, and a counter field for the readers. typedef struct dumbrwlock dumbrwlock; struct dumbrwlock { spinlock lock; unsigned readers; }; static void dumb_wrlock(dumbrwlock *l) { /* Get write lock */ spin_lock(&l->lock); /* Wait for readers to finish */ while (l->readers) cpu_relax(); } static void dumb_wrunlock(dumbrwlock *l) { spin_unlock(&l->lock); } static int dumb_wrtrylock(dumbrwlock *l) { /* Want no readers */ if (l->readers) return EBUSY; /* Try to get write lock */ if (spin_trylock(&l->lock)) return EBUSY; if (l->readers) { /* Oops, a reader started */ spin_unlock(&l->lock); return EBUSY; } /* Success! */ return 0; } static void dumb_rdlock(dumbrwlock *l) { while (1) { /* Speculatively take read lock */ atomic_inc(&l->readers); /* Success? */ if (!l->lock) return; /* Failure - undo, and wait until we can try again */ atomic_dec(&l->readers); while (l->lock) cpu_relax(); } } static void dumb_rdunlock(dumbrwlock *l) { atomic_dec(&l->readers); } static int dumb_rdtrylock(dumbrwlock *l) { /* Speculatively take read lock */ atomic_inc(&l->readers); /* Success? */ if (!l->lock) return 0; /* Failure - undo */ atomic_dec(&l->readers); return EBUSY; } static int dumb_rdupgradelock(dumbrwlock *l) { /* Try to convert into a write lock */ if (spin_trylock(&l->lock)) return EBUSY; /* I'm no longer a reader */ atomic_dec(&l->readers); /* Wait for all other readers to finish */ while (l->readers) cpu_relax(); return 0; } The benchmark the above code, we need a little more information than the spinlock case. The fraction of readers is important. The more readers, the more parallelism we should get, and the faster the code should run. It is also important to have a random distribution of readers and writers, just like real-world situations. Thus a parallel random number generator is used. By selecting a random byte, and choosing 1, 25, 128, or 250 out of 256 possibilities to be a writer we can explore the mostly-reader case through to where most users of the lock are writers. Finally, it is important to find out the effects of contention. In general, read-write locks tend to be used where contention is high, so we will mostly look at the case where the number of threads is equal to the number of processors. The dumb lock above performs fairly poorly when there is no contention. If one thread is used we get: Writers per 256 1 25 128 250 Time (s) 3.7 3.8 4.6 5.4 As expected, we asymptote to the relatively slow timings of the standard spinlock algorithm as the write fraction increases. If there is contention, however, the dumb lock actually performs quite well. Using four threads: Writers per 256 1 25 128 250 Time (s) 1.1 1.9 4.4 5.7 The obvious thing to do to try to gain speed would be to replace the slow spinlock with a ticketlock algorithm. If this is done, we have: typedef struct dumbtrwlock dumbtrwlock; struct dumbtrwlock { ticketlock lock; unsigned readers; }; static void dumbt_wrlock(dumbtrwlock *l) { /* Get lock */ ticket_lock(&l->lock); /* Wait for readers to finish */ while (l->readers) cpu_relax(); } static void dumbt_wrunlock(dumbtrwlock *l) { ticket_unlock(&l->lock); } static int dumbt_wrtrylock(dumbtrwlock *l) { /* Want no readers */ if (l->readers) return EBUSY; /* Try to get write lock */ if (ticket_trylock(&l->lock)) return EBUSY; if (l->readers) { /* Oops, a reader started */ ticket_unlock(&l->lock); return EBUSY; } /* Success! */ return 0; } static void dumbt_rdlock(dumbtrwlock *l) { while (1) { /* Success? */ if (ticket_lockable(&l->lock)) { /* Speculatively take read lock */ atomic_inc(&l->readers); /* Success? */ if (ticket_lockable(&l->lock)) return; /* Failure - undo, and wait until we can try again */ atomic_dec(&l->readers); } while (!ticket_lockable(&l->lock)) cpu_relax(); } } static void dumbt_rdunlock(dumbtrwlock *l) { atomic_dec(&l->readers); } static int dumbt_rdtrylock(dumbtrwlock *l) { /* Speculatively take read lock */ atomic_inc(&l->readers); /* Success? */ if (ticket_lockable(&l->lock)) return 0; /* Failure - undo */ atomic_dec(&l->readers); return EBUSY; } static int dumbt_rdupgradelock(dumbtrwlock *l) { /* Try to convert into a write lock */ if (ticket_trylock(&l->lock)) return EBUSY; /* I'm no longer a reader */ atomic_dec(&l->readers); /* Wait for all other readers to finish */ while (l->readers) cpu_relax(); return 0; } This performs much better in the uncontended case, taking 3.7 seconds for all write fractions. It is surprising that it doesn't beat the contended case though: Writers per 256 1 25 128 250 Time (s) 2.0 2.5 3.7 4.5 This is slower for low write fractions, and faster for large write fractions. Since most of the time we use a read-write lock when the write fraction is low, this is really bad for this algorithm, which can be twice as slow as its competitor. To try to reduce contention, and to gain speed, lets explore the rather complex algorithm used in Reactos to emulate Microsoft Window's slim read-write (SRW) locks. This uses a wait list, with a bitlock to control access to the wait list data structure. It is designed so that waiters will spin on separate memory locations for extra scalability. /* Have a wait block */ #define SRWLOCK_WAIT 1 /* Users are readers */ #define SRWLOCK_SHARED 2 /* Bit-lock for editing the wait block */ #define SRWLOCK_LOCK 4 #define SRWLOCK_LOCK_BIT 2 /* Mask for the above bits */ #define SRWLOCK_MASK 7 /* Number of current users * 8 */ #define SRWLOCK_USERS 8 typedef struct srwlock srwlock; struct srwlock { uintptr_t p; }; typedef struct srw_sw srw_sw; struct srw_sw { uintptr_t spin; srw_sw *next; }; typedef struct srw_wb srw_wb; struct srw_wb { /* s_count is the number of shared acquirers * SRWLOCK_USERS. */ uintptr_t s_count; /* Last points to the last wait block in the chain. The value is only valid when read from the first wait block. */ srw_wb *last; /* Next points to the next wait block in the chain. */ srw_wb *next; /* The wake chain is only valid for shared wait blocks */ srw_sw *wake; srw_sw *last_shared; int ex; }; /* Wait for control of wait block */ static srw_wb *lock_wb(srwlock *l) { uintptr_t p; /* Spin on the wait block bit lock */ while (atomic_bitsetandtest(&l->p, SRWLOCK_LOCK_BIT)) cpu_relax(); p = l->p; barrier(); if (!(p & SRWLOCK_WAIT)) { /* Oops, looks like the wait block was removed. */ atomic_clear_bit(&l->p, SRWLOCK_LOCK_BIT); return NULL; } return (srw_wb *)(p & ~SRWLOCK_MASK); } static void srwlock_init(srwlock *l) { l->p = 0; } static void srwlock_rdlock(srwlock *l) { srw_wb swblock; srw_sw sw; uintptr_t p; srw_wb *wb, *shared; while (1) { barrier(); p = l->p; cpu_relax(); if (!p) { /* This is a fast path, we can simply try to set the shared count to 1 */ if (!cmpxchg(&l->p, 0, SRWLOCK_USERS | SRWLOCK_SHARED)) return; continue; } /* Don't interfere with locking */ if (p & SRWLOCK_LOCK) continue; if (p & SRWLOCK_SHARED) { if (!(p & SRWLOCK_WAIT)) { /* This is a fast path, just increment the number of current shared locks */ if (cmpxchg(&l->p, p, p + SRWLOCK_USERS) == p) return; } else { /* There's other waiters already, lock the wait blocks and increment the shared count */ wb = lock_wb(l); if (wb) break; } continue; } /* Initialize wait block */ swblock.ex = FALSE; swblock.next = NULL; swblock.last = &swblock; swblock.wake = &sw; sw.next = NULL; sw.spin = 0; if (!(p & SRWLOCK_WAIT)) { /* * We need to setup the first wait block. * Currently an exclusive lock is held, change the lock to contended mode. */ swblock.s_count = SRWLOCK_USERS; swblock.last_shared = &sw; if (cmpxchg(&l->p, p, (uintptr_t)&swblock | SRWLOCK_WAIT) == p) { while (!sw.spin) cpu_relax(); return; } continue; } /* Handle the contended but not shared case */ /* * There's other waiters already, lock the wait blocks and increment the shared count. * If the last block in the chain is an exclusive lock, add another block. */ swblock.s_count = 0; wb = lock_wb(l); if (!wb) continue; shared = wb->last; if (shared->ex) { shared->next = &swblock; wb->last = &swblock; shared = &swblock; } else { shared->last_shared->next = &sw; } shared->s_count += SRWLOCK_USERS; shared->last_shared = &sw; /* Unlock */ barrier(); l->p &= ~SRWLOCK_LOCK; /* Wait to be woken */ while (!sw.spin) cpu_relax(); return; } /* The contended and shared case */ sw.next = NULL; sw.spin = 0; if (wb->ex) { /* * We need to setup a new wait block. * Although we're currently in a shared lock and we're acquiring * a shared lock, there are exclusive locks queued in between. * We need to wait until those are released. */ shared = wb->last; if (shared->ex) { swblock.ex = FALSE; swblock.s_count = SRWLOCK_USERS; swblock.next = NULL; swblock.last = &swblock; swblock.wake = &sw; swblock.last_shared = &sw; shared->next = &swblock; wb->last = &swblock; } else { shared->last_shared->next = &sw; shared->s_count += SRWLOCK_USERS; shared->last_shared = &sw; } } else { wb->last_shared->next = &sw; wb->s_count += SRWLOCK_USERS; wb->last_shared = &sw; } /* Unlock */ barrier(); l->p &= ~SRWLOCK_LOCK; /* Wait to be woken */ while (!sw.spin) cpu_relax(); } static void srwlock_rdunlock(srwlock *l) { uintptr_t p, np; srw_wb *wb; srw_wb *next; while (1) { barrier(); p = l->p; cpu_relax(); if (p & SRWLOCK_WAIT) { /* * There's a wait block, we need to wake a pending exclusive acquirer, * if this is the last shared release. */ wb = lock_wb(l); if (wb) break; continue; } /* Don't interfere with locking */ if (p & SRWLOCK_LOCK) continue; /* * This is a fast path, we can simply decrement the shared * count and store the pointer */ np = p - SRWLOCK_USERS; /* If we are the last reader, then the lock is unused */ if (np == SRWLOCK_SHARED) np = 0; /* Try to release the lock */ if (cmpxchg(&l->p, p, np) == p) return; } wb->s_count -= SRWLOCK_USERS; if (wb->s_count) { /* Unlock */ barrier(); l->p &= ~SRWLOCK_LOCK; return; } next = wb->next; if (next) { /* * There's more blocks chained, we need to update the pointers * in the next wait block and update the wait block pointer. */ np = (uintptr_t)next | SRWLOCK_WAIT; next->last = wb->last; } else { /* Convert the lock to a simple exclusive lock. */ np = SRWLOCK_USERS; } barrier(); /* This also unlocks wb lock bit */ l->p = np; barrier(); wb->wake = (void *) 1; barrier(); /* We released the lock */ } static int srwlock_rdtrylock(srwlock *s) { uintptr_t p = s->p; barrier(); /* This is a fast path, we can simply try to set the shared count to 1 */ if (!p && (cmpxchg(&s->p, 0, SRWLOCK_USERS | SRWLOCK_SHARED) == 0)) return 0; if ((p & (SRWLOCK_SHARED | SRWLOCK_WAIT)) == SRWLOCK_SHARED) { /* This is a fast path, just increment the number of current shared locks */ if (cmpxchg(&s->p, p, p + SRWLOCK_USERS) == p) return 0; } return EBUSY; } static void srwlock_wrlock(srwlock *l) { srw_wb swblock; uintptr_t p, np; /* Fastpath - no other readers or writers */ if (!l->p && (!cmpxchg(&l->p, 0, SRWLOCK_USERS))) return; /* Initialize wait block */ swblock.ex = TRUE; swblock.next = NULL; swblock.last = &swblock; swblock.wake = NULL; while (1) { barrier(); p = l->p; cpu_relax(); if (p & SRWLOCK_WAIT) { srw_wb *wb = lock_wb(l); if (!wb) continue; /* Complete Initialization of block */ swblock.s_count = 0; wb->last->next = &swblock; wb->last = &swblock; /* Unlock */ barrier(); l->p &= ~SRWLOCK_LOCK; /* Has our wait block became the first one in the chain? */ while (!swblock.wake) cpu_relax(); return; } /* Fastpath - no other readers or writers */ if (!p) { if (!cmpxchg(&l->p, 0, SRWLOCK_USERS)) return; continue; } /* Don't interfere with locking */ if (p & SRWLOCK_LOCK) continue; /* There are no wait blocks so far, we need to add ourselves as the first wait block. */ if (p & SRWLOCK_SHARED) { swblock.s_count = p & ~SRWLOCK_MASK; np = (uintptr_t)&swblock | SRWLOCK_SHARED | SRWLOCK_WAIT; } else { swblock.s_count = 0; np = (uintptr_t)&swblock | SRWLOCK_WAIT; } /* Try to make change */ if (cmpxchg(&l->p, p, np) == p) break; } /* Has our wait block became the first one in the chain? */ while (!swblock.wake) cpu_relax(); } static void srwlock_wrunlock(srwlock *l) { uintptr_t p, np; srw_wb *wb; srw_wb *next; srw_sw *wake, *wake_next; while (1) { barrier(); p = l->p; cpu_relax(); if (p == SRWLOCK_USERS) { /* * This is the fast path, we can simply clear the SRWLOCK_USERS bit. * All other bits should be 0 now because this is a simple exclusive lock, * and no one else is waiting. */ if (cmpxchg(&l->p, SRWLOCK_USERS, 0) == SRWLOCK_USERS) return; continue; } /* There's a wait block, we need to wake the next pending acquirer */ wb = lock_wb(l); if (wb) break; } next = wb->next; if (next) { /* * There's more blocks chained, we need to update the pointers * in the next wait block and update the wait block pointer. */ np = (uintptr_t)next | SRWLOCK_WAIT; if (!wb->ex) { /* Save the shared count */ next->s_count = wb->s_count; np |= SRWLOCK_SHARED; } next->last = wb->last; } else { /* Convert the lock to a simple lock. */ if (wb->ex) { np = SRWLOCK_USERS; } else { np = wb->s_count | SRWLOCK_SHARED; } } barrier(); /* Also unlocks lock bit */ l->p = np; barrier(); if (wb->ex) { barrier(); /* Notify the next waiter */ wb->wake = (void *) 1; barrier(); return; } /* We now need to wake all others required. */ for (wake = wb->wake; wake; wake = wake_next) { barrier(); wake_next = wake->next; barrier(); wake->spin = 1; barrier(); } } static int srwlock_wrtrylock(srwlock *s) { /* No other readers or writers? */ if (!s->p && (cmpxchg(&s->p, 0, SRWLOCK_USERS) == 0)) return 0; return EBUSY; } The above code is not exactly the code in Reactos. It has been simplified and cleaned up somewhat. One of the controlling bit flags has been removed, and replaced with altered control flow. So how does it perform? In the uncontended case, it is just like the dumb ticket-based read-write lock, and takes 3.7 seconds for all cases. For the contended case with four threads: Writers per 256 1 25 128 250 Time (s) 2.2 3.2 5.7 6.4 This is quite bad, slower than the dumb lock in all contended cases. The extra complexity simply isn't worth any performance gain. Another possibility is to combine the reader count with some bits describing the state of the writers. A similar technique is used by the Linux kernel to describe its (reader-preferring) read-write locks. Making the lock starvation-proof for writers instead, yields something like the following: #define RW_WAIT_BIT 0 #define RW_WRITE_BIT 1 #define RW_READ_BIT 2 #define RW_WAIT 1 #define RW_WRITE 2 #define RW_READ 4 typedef unsigned rwlock; static void wrlock(rwlock *l) { while (1) { unsigned state = *l; /* No readers or writers? */ if (state < RW_WRITE) { /* Turn off RW_WAIT, and turn on RW_WRITE */ if (cmpxchg(l, state, RW_WRITE) == state) return; /* Someone else got there... time to wait */ state = *l; } /* Turn on writer wait bit */ if (!(state & RW_WAIT)) atomic_set_bit(l, RW_WAIT_BIT); /* Wait until can try to take the lock */ while (*l > RW_WAIT) cpu_relax(); } } static void wrunlock(rwlock *l) { atomic_add(l, -RW_WRITE); } static int wrtrylock(rwlock *l) { unsigned state = *l; if ((state < RW_WRITE) && (cmpxchg(l, state, state + RW_WRITE) == state)) return 0; return EBUSY; } static void rdlock(rwlock *l) { while (1) { /* A writer exists? */ while (*l & (RW_WAIT | RW_WRITE)) cpu_relax(); /* Try to get read lock */ if (!(atomic_xadd(l, RW_READ) & (RW_WAIT | RW_WRITE))) return; /* Undo */ atomic_add(l, -RW_READ); } } static void rdunlock(rwlock *l) { atomic_add(l, -RW_READ); } static int rdtrylock(rwlock *l) { /* Try to get read lock */ unsigned state = atomic_xadd(l, RW_READ); if (!(state & (RW_WAIT | RW_WRITE))) return 0; /* Undo */ atomic_add(l, -RW_READ); return EBUSY; } /* Get a read lock, even if a writer is waiting */ static int rdforcelock(rwlock *l) { /* Try to get read lock */ unsigned state = atomic_xadd(l, RW_READ); /* We succeed even if a writer is waiting */ if (!(state & RW_WRITE)) return 0; /* Undo */ atomic_add(l, -RW_READ); return EBUSY; } /* Try to upgrade from a read to a write lock atomically */ static int rdtryupgradelock(rwlock *l) { /* Someone else is trying (and will succeed) to upgrade to a write lock? */ if (atomic_bitsetandtest(l, RW_WRITE_BIT)) return EBUSY; /* Don't count myself any more */ atomic_add(l, -RW_READ); /* Wait until there are no more readers */ while (*l > (RW_WAIT | RW_WRITE)) cpu_relax(); return 0; } This lock unfortunately, has a similar performance to the dumb lock using a ticket lock as its spinlock. Writers per 256 1 25 128 250 Time (s) 2.0 3.4 3.9 4.6 The version in the Linux kernel is written in assembler, so may be a fair bit faster. It uses the fact that the atomic add instruction can set the zero flag. This means that the slower add-and-test method isn't needed, and a two-instruction fast path is used instead. Sticking to semi-portable C code, we can still do a little better. There exists a form of the ticket lock that is designed for read-write locks. An example written in assembly was posted to the Linux kernel mailing list in 2002 by David Howells from RedHat. This was a highly optimized version of a read-write ticket lock developed at IBM in the early 90's by Joseph Seigh. Note that a similar (but not identical) algorithm was published by John Mellor-Crummey and Michael Scott in their landmark paper "Scalable Reader-Writer Synchronization for Shared-Memory Multiprocessors". Converting the algorithm from assembly language to C yields: typedef union rwticket rwticket; union rwticket { unsigned u; unsigned short us; __extension__ struct { unsigned char write; unsigned char read; unsigned char users; } s; }; static void rwticket_wrlock(rwticket *l) { unsigned me = atomic_xadd(&l->u, (1<<16)); unsigned char val = me >> 16; while (val != l->s.write) cpu_relax(); } static void rwticket_wrunlock(rwticket *l) { rwticket t = *l; barrier(); t.s.write++; t.s.read++; *(unsigned short *) l = t.us; } static int rwticket_wrtrylock(rwticket *l) { unsigned me = l->s.users; unsigned char menew = me + 1; unsigned read = l->s.read << 8; unsigned cmp = (me << 16) + read + me; unsigned cmpnew = (menew << 16) + read + me; if (cmpxchg(&l->u, cmp, cmpnew) == cmp) return 0; return EBUSY; } static void rwticket_rdlock(rwticket *l) { unsigned me = atomic_xadd(&l->u, (1<<16)); unsigned char val = me >> 16; while (val != l->s.read) cpu_relax(); l->s.read++; } static void rwticket_rdunlock(rwticket *l) { atomic_inc(&l->s.write); } static int rwticket_rdtrylock(rwticket *l) { unsigned me = l->s.users; unsigned write = l->s.write; unsigned char menew = me + 1; unsigned cmp = (me << 16) + (me << 8) + write; unsigned cmpnew = ((unsigned) menew << 16) + (menew << 8) + write; if (cmpxchg(&l->u, cmp, cmpnew) == cmp) return 0; return EBUSY; } This read-write lock performs extremely well. It is as fast as the dumb spinlock rwlock for low writer fraction, and nearly as fast as the dumb ticketlock rwlock for large number of writers. It also doesn't suffer any slowdown when there is no contention, taking 3.7 seconds for all cases. With contention: Writers per 256 1 25 128 250 Time (s) 1.1 1.8 3.9 4.7 This algorithm is five times faster than using a simple spin lock for the reader-dominated case. Its only drawback is that it is difficult to upgrade read locks into write locks atomically. (It can be done, but then rwticket_wrunlock() needs to use an atomic instruction, and the resulting code becomes quite a bit slower.) This drawback is the reason why this algorithm is not used within the Linux kernel. Some parts depend on the fact that if you have a read lock, then acquiring a new read lock recursively will always succeed. However, if that requirement were to be removed, this algorithm probably would be a good replacement. One final thing to note is that the read-write ticket lock is not optimal. The problem is the situation where readers and writers alternate in the wait queue: writer (executing), reader 1, writer, reader 2. The two reader threads can be shuffled so that they execute in parallel. i.e. the second reader should probably not have to wait until the second writer finishes to execute. Fortunately, this situation is encountered rarely when the thread count is low. For four processors and threads, this happens one time in 16 if readers and writers are equally likely, and less often otherwise. Unfortunately, as the number of threads increases this will lead to an asymptotic factor of two slowdown compared with the optimal ordering. The obvious thing to do to fix this is to add a test to see if readers should be reordered in wait-order. However, since the effect is so rare with four concurrent threads, it is extremely hard (if not impossible) to add the check with a low enough overhead that the result is a performance win. Thus it seems that the problem of exactly which algorithm is best will need to be revisited when larger multicore machines become common. Lockless

Articles

Spinlocks and Read-Write Locks