A mutex is a common type of lock used to serialize concurrent access by multiple

threads to shared resources. While support for POSIX mutexes in the QNX Neutrino

Realtime OS dates back to the early days of the system, this area of the code

has seen considerable changes in the last couple of years.

This article is written by Elad Lahav, Software Developer, QNX Software Systems Limited.

Introduction

Multi-threaded programming allows software developers to break up a process into

multiple, concurrent streams of execution. There are several good (and various

bad) reasons for using multiple threads within a program:

Achieving true parallelism by executing different threads on different

processor cores.

processor cores. Adhering to real-time constraints with priority-based pre-emption.

Simulating asynchronous I/O in the presence of blocking calls.

Almost every multi-threaded program must deal with the problem of sharing data

among the threads. Access to data by different threads needs to be serialized in

order to preserve data integrity. Even the most trivial of data

manipulations, such as incrementing a variable by one and reading its value, is

prone to race conditions in the presence of multiple threads that can access

that data. These race conditions can, and often will, lead to incorrect

execution.

There are various techniques to ensure serialization, which is the act of

governing access to data by multiple threads. Most modern processors provide

atomic operations that allow for integrity-preserving manipulation of data, such

as test-and-set and compare-and-exchange. These atomic operations, however, are

limited to well-defined data types, typically those that can fit within a single

register. Nevertheless, such operations are useful as building blocks

for other serialization mechanisms, most notably for various lock implementations.

A spin-lock is the simplest lock, requiring only one shared bit. A thread

that wishes to access a piece of shared data can check whether the bit is set,

and, if not, acquire the lock by setting it. This is done in a loop that spins as

long as the bit is set, i.e., as long as the lock is held. The operation of

testing whether the bit is set and setting it if not itself needs to be atomic,

in order to prevent race conditions among threads contending for the lock.

While spin-locks are indispensable in certain scenarios (mostly on SMP systems

in contexts that cannot block, such as interrupt services routines), these locks

are rarely used in (correct) multi-threaded programs. Since the OS scheduler is

unaware that the thread is waiting for a lock to be released, it may take time

before the spinning thread is pre-empted and the thread currently holding the

lock is scheduled. On single-processor systems, if the spinning thread has a

higher priority than the current lock owner, the former can end up spinning

forever. Spinning is also bad for power consumption.

Most operating systems provide a locking primitive that allows a thread to block

while waiting for the lock to become available. Such a lock is commonly referred

to as a mutex. Blocking a thread requires that the OS

scheduler be aware of the lock, avoid scheduling a thread while it is held by

another thread, and re-schedule the thread once the lock can be acquired. POSIX

defines the pthread_mutex_t type and a set of functions to initialize,

lock, unlock and destroy an object of this type. The implementation of the mutex

data type and functions is the subject of the following sections.

Mutexes in the QNX Neutrino OS

The QNX Realtime OS is a POSIX-certified operating system and, as

such, provides a complete implementation of the POSIX Threads (pthread)

API, including mutexes. The default mutex object in the QNX Neutrino OS (i.e., one that

is statically initialized with PTHREAD_MUTEX_INITIALIZER or via a call

to pthread_mutex_init() with a NULL mutex attribute pointer) is:

Fast : Operations on uncontested locks do not require any kernel calls.

: Operations on uncontested locks do not require any kernel calls. Light weight : The cost of declaring a mutex is 8 bytes in the declaring process’ address space. No kernel resources are required unless a thread is blocked on a locked mutex.

: The cost of declaring a mutex is 8 bytes in the declaring process’ address space. No kernel resources are required unless a thread is blocked on a locked mutex. Priority inheriting : The default mutex implements the PTHREAD_PRIO_INHERIT protocol for priority inheritance. Other protocols

( PTHREAD_PRIO_PROTECT and PTHREAD_PRIO_NONE ) are also

supported.

: The default mutex implements the protocol for priority inheritance. Other protocols ( and ) are also supported. Non-recursive : An attempt to lock a mutex more than once by the same thread results in an error.

: An attempt to lock a mutex more than once by the same thread results in an error. Shared: A mutex placed in memory that is visible to multiple processes (e.g., memory obtained by mapping a shared-memory object) can be operated on by these processes as a single, common, lock.

The implementation of mutexes is split between the C library and the

QNX Neutrino micro-kernel.

C Library Implementation

The C library provides the POSIX-defined functions for handling mutexes,

(pthread_mutex_init(), pthread_mutex_lock(), etc.), as well as

wrappers for the relevant kernel calls 1.

User code creates a mutex by declaring a variable of type pthread_mutex_t . As mentioned above, this type is an 8-byte structure,

common to mutexes, semaphores and condition variables. The structure holds the

current owner of the mutex (a system-wide identifier of the thread that had last

locked the mutex successfully), a counter for recursive mutexes and various flag

bits. One bit in the owner field is used to indicate whether other threads are

currently waiting for the mutex. The variable can be initialized statically, or

with a call to pthread_mutex_init().

A call to pthread_mutex_lock() will attempt to lock the mutex by

performing an atomic compare-and-exchange operation on the variable’s owner

field, looking for a current value of 0 (no owner). If the operation succeeds,

then the owner field is updated with the calling thread’s system-unique ID, and

the mutex is now considered locked. No kernel intervention is required in this

case. The atomic operation will fail if the owner field has a value other than

0, which can happen in the following cases:

The mutex is locked by this thread. The mutex is locked by another thread. There are other threads waiting for the mutex.

The first case is handled by the C library code (recursive mutexes increment

their lock count, non-recursive mutexes return an error). The other two are handled by

invoking the SyncMutexLock() kernel call.

Unlocking the mutex with a call to pthread_mutex_unlock() is again

handled by a compare-and-exchange operation, which attempts to replace the

calling thread’s ID with the value 0. The operation will fail if:

The mutex is locked by another thread. There are other threads blocked on the mutex.

The first case returns an error, while the second invokes the SyncMutexUnlock() kernel call.

It can be seen from the description above that an uncontested mutex, i.e., a

mutex that is locked and unlocked by the same thread without any other thread

trying to acquire it at the same time, is handled completely in user mode,

without any kernel intervention. The only overhead is that of a function call

and an atomic operation. While not free (the atomic operation impacts the bus

and memory barriers are required after acquiring and before releasing the

mutex), this operation is orders of magnitude cheaper than a call that requires

a trap into the kernel. Nevertheless, since the information stored in the pthread_mutex_t structure is essential to the correct operation of the

mutex, care must be taken within the kernel when the values of the structure are

read and used. For example, the owner field may be in a state of flux until the

bit indicating waiting threads is set, and all processors are aware of

that. Moreover, bad values in this structure written by a faulty or malicious

process should be handled properly. Such values should cause the kernel calls to

return an error or, in the worst case scenario, cause the process that wrote

them to malfunction, without affecting the kernel or other processes.

Kernel Implementation

The two most important kernel calls dealing with mutexes are SyncMutexLock() and SyncMutexUnlock(). As described in the previous

section, these are invoked when the C library is unable to deal with the mutex

by itself. There are other calls to initialize a mutex to non-default

attributes, assign a priority to a priority-ceiling mutex, associate an event

with a locked mutex whose owner dies unexpectedly, and more.

Since an attempt to lock a mutex may lead to the calling thread blocking, the

kernel needs to maintain a list of threads waiting on the mutex. The list is

sorted first by priority and than on a first-come-first-served basis. When a

mutex that is waited on by other mutexes is unlocked (as indicated by a special

bit in the user-mode owner field) the kernel will choose the next thread to wake

up and attempt to lock the mutex.

This design has two important implications:

The list of waiting threads requires a kernel object to serve as its head,

as a user-mode object cannot be used to store kernel pointers. The kernel needs to be able to identify this object from the user-mode

mutex.

The kernel object is a sync_entry structure which serves as the head of

the waiting threads list. The association between the user-mode pthread_mutex_t object and the kernel’s sync_entry object is

accomplished by a hash table. However, since mutexes in QNX are shared by

default, we cannot use the user-mode object’s virtual address as the key to the

hash table. Instead, the kernel consults the memory manager, which provides a

globally-unique handle for the object. This handle is used as a key to the hash

function.

Other than holding the head of the waiters list, the sync_entry structure

contains pointers that allow it to be linked on two other lists: the hash table

bucket and a list of mutexes locked by a thread.

Traditionally, a sync_entry object was allocated for every mutex upon a

call to pthread_mutex_init() (or the first call to pthread_mutex_lock() for statically-initialized mutexes), and persisted

until the mutex was deleted with a call to pthread_mutex_destroy(). While this approach is usually acceptable, we

have run into cases where too much kernel memory was tied into these

objects (consider a database holding millions of records with a mutex

associated with each record). To overcome the problem, we made the observation

that, in most cases, the kernel object is required only when threads are

actually blocked waiting for the mutex to be released. The implementation was

therefore changed such that a kernel object is allocated when a thread blocks on

it, and is freed as soon as the last waiter is woken up. Such a strategy,

however, opens up the possibility that kernel memory will not be available when

a thread needs to block, which means that a call to lock a mutex can fail for no

fault of the caller. To avoid this, a pool of sync_entry objects is used,

with a new object reserved each time a thread is created (and unreserved when it

is destroyed). Since a thread cannot block on more than one mutex at a time,

this pool guarantees that an object is available whenever it is required. The

exceptions to this use of dynamic objects are robust mutexes (those whose owner

value needs to be updated if the locking process dies unexpectedly) and

priority-ceiling mutexes, where the priority is held by the sync_entry

structure. Nevertheless, such objects are considerably less common than the

default non-robust, priority-inheriting mutexes, and dynamic allocation works

well in practice.

The implementation of SyncMutexLock() goes through the following steps:

Set the waiters bit in the owner field of the user-mode variable, to force

other threads into the kernel. Check whether the mutex is currently locked by inspecting the owner field

(it could have been unlocked since the time the C library code decided to

invoke the kernel call). If not locked, and if higher-priority threads are not

on the mutex waiting list, the owner field is set and the kernel call returns. Look up a sync_entry object in the hash table for the user-mode

variable. If one does not yet exist, a new object is allocated and put in the

hash table and on the list of mutexes held by the current owner thread. Add the calling thread to the sync_entry waiting list and mark it

as blocked on a mutex. Adjust the priority of the current owner, if needed.

SyncMutexUnlock() looks up the sync_entry object for the user-mode

mutex in the hash table, removes the first waiter from the list and moves it to

a ready state. Information stored in the thread structure will cause the thread

to try and acquire the mutex when it is next scheduled to run. If that thread is

the last one waiting on the sync_entry object, that object is freed back

to the pool.

Priority Considerations

The correct assignment and handling of thread priorities is essential for

achieving real-time behaviour in an operating system. Unfortunately, the use of

mutexes in multi-threaded code can easily lead to the break-down of an otherwise

properly-designed priority-based system: when a high-priority thread blocks on a

mutex held by a lower-priority thread, it cannot make any forward progress until

the low-priority activity reaches the point of releasing the mutex. Thus, the

priority of the waiter is effectively lowered to that of the owner, a situation

known as priority inversion.

POSIX mutexes provide three protocols for dealing with priority inversion:

Priority inheritance : The priority of the owner thread is at least that of the highest-priority thread waiting on the mutex.

: The priority of the owner thread is at least that of the highest-priority thread waiting on the mutex. Priority ceiling : Each mutex employing this protocol is associated with a fixed priority. The priority of the owner thread is at least that associated with the mutex.

: Each mutex employing this protocol is associated with a fixed priority. The priority of the owner thread is at least that associated with the mutex. None: The mutex has no impact on the owner’s priority, and the user accepts the possibility of priority inversion.

To facilitate both the design and implementation of these protocols, it is

easier to associate a priority with each mutex: the priority of a

priority-inheritance mutex is that of the highest-priority waiter, that of a

priority-ceiling mutex is the fixed ceiling value and that of a “none” mutex

is 0. We can now define the mutex-induced priority of a thread as the

maximal priority of all mutexes the thread currently holds. In the QNX Realtime OS, a

thread’s effective priority is the maximum of its base priority (the one set

when the thread is created), its client-inherited priority (if receiving a

message from another thread) and its mutex-induced priority.

Figure 1: The in-kernel representation of a list of mutexes held by a thread (T 1 ),

each of which is waited on by one or more threads. The mutexes are sorted in

priority order on thread T 1 ‘s list, and threads are sorted in priority order on

the waiting list of each mutex.

Figure 1 depicts four mutexes of different protocols held by a

single thread2. The

mutexes are sorted by their priorities: M 1 is a priority-inheritance mutex

with a priority 20 waiter, M 2 is a priority-ceiling mutex with a ceiling

value of 11 (note that the priority 30 waiter has no impact on a

priority-ceiling mutex), M 3 is a priority-inheritance mutex with a priority

10 waiter and M 4 is a “none” mutex, so is associated with a priority of

0. Since the base priority of T 1 is 10, and since its mutex-induced priority

is 20, the effective priority of this thread is 20.

The mutex-induced priority of T 1 can change every time a thread blocks or

unblocks on one of the mutexes it holds. If a priority 30 thread blocks on

M 3 , the priority of that mutex becomes 30, requiring the effective priority of

T 1 to become 30. Conversely, if T 2 stops waiting for M 1 (e.g., because

of a timeout on a pthread_mutex_timedlock() call or because its process

exited abnormally), then the new highest-priority mutex held by T 1 becomes

M 2 , and its priority is lowered to 11. Furthermore, any change to the

priority of T 1 can have an impact on the priority of other threads: if T 1

is currently waiting on another thread to reply to a message, that thread (the

server) needs to inherit the newly adjusted priority of T 1 . If T 1 is

currently blocked on another priority-inheriting mutex, the priority of that

mutex needs to be adjusted, with potentially implications on the priority of its

owner. The kernel is responsible for transitively adjusting all of these

priorities.

Conclusion

While a basic mutex lock is fairly easy to implement in a kernel, performance

considerations and support for a wide variety of features require a robust

design and a careful implementation. In particular, since mutexes are the

governors of concurrent access to shared resources, care must be taken to

account for potential race conditions within the code.

For a real-time operating system, it is imperative that the kernel provide a

solution for the priority inversion problem associated with mutexes. The

solution needs to deal with threads holding multiple, potentially heterogeneous,

mutexes, with threads of different priorities blocking on mutexes, and with

threads leaving the wait queues unexpectedly. The concept of mutexes associated

with priorities facilitates both the design and the implementation of the mutex

code in the QNX Neutrino microkernel.