Common threading mistakes

I though I would put together a list of things I have seen over the years of some obvious and some not so obvious issues when using threads on Linux and c or c++. At lot of these issues will also apply to systems based around POSIX as well. Some are really obvious and well know some other can be rather more concealed issues. How many have to encountered? If I have missed any more obvious ones please let me know and i will add them to the list.

Fork

Fork has a few issues with threads. When fork is called it splits all the global state of a process it can cause some serious issues in both the child and in the parent process.

The main issue that most people should be familiar with is that when the heap is copied so is the state of all the current locks from all threads. Typically in a multi threaded program it is not possible to know the state of these locks. The extremely common case would be after the fork calling malloc can hang for example. Many people have tried different work arounds by implementing things like a mutex buster for something like malloc in order to reset the locks. However since the locks are normally held for a good reason. Then using what they were protecting are highly likely to cause a crash anyway.

A slightly more interesting issue happens with file descriptors. Since fork is normally used to start a new process in a combination of fork + exec. Many file descriptors during the exec call may be closed because they have the close on exec flag set. This can be important because if you have a socket in the parent process. If you don't mark the socket as close on execute then the child process can hold the client connected even if the parent process has called close on the fd. This of course is great in theory until you realise that almost all the calls that create file descriptors like (open, accept, socket, etc..) cannot set the close on execute flag it has to be set by fnctl after you get the file descriptor. This of course creates a race window. This is often somewhat of a minor problem and normally goes silently undetected in a threaded program using fork and the solution is of course to close any file descriptors immediately after the fork.

pthread_cond_signal / pthread_cond_broadcast

It can be common that people attempt to use this function without taking the associated lock that was used on the call to pthread_cond_wait. Don't do it the sleeping thread won't always wake up. The lock needs to be held because there are subtle races in pthread_cond_wait if the lock isn't held. When calling without the lock held the wake up condition will be ignored and missed. I have previously worked in a development team that the tech lead did not understand this and that the pthread_cond_wait function was actually buggy. So it became the normal standard in the team to never use pthread_cond_wait and use pthread_cond_timedwait with a reasonably small timeout for the particular cases at hand. Simply because they did not understand this issue. The same misuse also occurs with pthread_cond_broadcast. The problem can also be detected by using valgrind with the hellgrind plugin.

pthread_cond_broadcast

This isn't so much the case of an undefined behaviour problem but more of a performance problem. When using pthread_cond_broadcast rather than pthread_cond_wait. When you have a lot of sleeping threads on a pthread_cond this function is design to wake all of them. This has sometimes has the unfortunate issue of cause a thundering herd effect with large numbers of threads waking up checking their conditions and going back to sleep again.

pthread_cond_timedwait

API bugs are not great fun when they make it into the posix spec and this is a classic case which several other languages / environments also implemented the exact same bug. Its really is a subtle issue which can cause chaos. The issue being that the timeout value is specified as an absolute value of the systems time. Apparently this is to make it easier or to prevent recalculating the timeout value during a false positive wake up. However by using the systems time somebody must have forgotten that the time on the system can change and in some systems this can be quite significant. An example of this would be adding one minute to 12:00 and having a one minute timeout and the system clock being changed backwards by one hour leaving an actual timeout value of sixty one minutes. The reserve of course happens when the clock goes forward which can give an effective timeout value of 0.

The actual workaround here is to use "pthread_condattr_setclock(&cattr, CLOCK_MONOTONIC)" to switch the cond timeout value to use a monotonic time. Which means every times becomes a relative time since system boot.

dup2

This is a fun function as people think its thread safe which in fairness it actually is to quote the man page "The steps of closing and reusing the file descriptor newfd are performed atomically". The problem occurs because all the other file descriptor creation functions are not thread safe with it. The goal of dup2 is to have a file descriptor end up on a certain number. Its often used around the time of a fork / exec call in order to communicate between the parent and child processes. Though I have seen it misused for other reasons in the past. What really happens with this function is that it does an atomic close + dup of an fd. So if it is called with something like dup(open(), 1); You can re-open your stdout file descriptor. So this becomes really fun when another thread (call it thread1) also creates a file descriptor at the same time using open, socket, accept etc... Of course these will return the lowest numbered file descriptor eg "1" then the dup2 happens on "thread2" this will then atomically close and open the file descriptor 1. Of course thread1 isn't any the wiser. So at this point both thread are now unexpectedly sharing the same file descriptor. Often silent data corruption follows.

Return from main

There really isn't anything special about this. Other than when you return from main and you have not cleaned up other threads in the program expect random crashes. This is kind of obviously because libc will run its "atexit()" handlers which will remove a significant part of the running environment from under the running threads. Something that this really indicated is a serious mis management of threads in your program if it is not able to clean up the threads that it created.

Signals

Signals are bad enough for most people without threads. But when you introduce threads things really get crazy. They are typically best avoided since they will blow your program up randomly in all sorts of ways unless handled in a very specific way. The major problem here start to occur because you get some signals delivered to the program or some signals delivered to a specific thread. When a signal is delivered to a program it can actually arrive on any thread. The fun part here is that if you use sigprocmask to block the signals on a specific thread to protect a critical section the signal can simply arrive on another thread which doesn't have the signal blocked is which is effectively by passing the signal block of critical section of code your trying to protect.

The only thing that really works for signals that I have found out is at the very start of the program block all the signals you want to handle. Then any threads created after this will inherit the same signal mask. Then create a thread and use a sigwaitinfo or a signalfd to read the pending signals from the kernel. This is also great for other reasons since the signals handlers are now not restricted to using signal safe functions. However even this can cause issues such as blocking SIGTRAP prevents debuggers like strace / gdb and such things working on your program. So your pritty much limited to SIGHUP, SIGTERM, SIGUSR1, SIGUSR2, SIGPIPE, SIGCHILD, SIGALARM

Real world bugs I have seen include such things as

Calling pthread_cond_signal from a signal handler

Trying to take a lock in a signal handler.

Taking a recursive lock in a signal handler (the bug fix for the prior issue!)

Legacy libc

Using legacy non reentrant libc functions is unfortunately still a real issue. Particularly when using 3rd party libs or if you have older developers. When they tend to use functions like strtok, gethostbyname instead of the _R version like strtok_r which are not thread safe. These of course can cause really strange behaviour and if your not particularly looking for this problem it can be somewhat hard to track them done since they will often create a silent heap corruption so the crash has happen after a period of time after the real issue has occurred. Where I used to work this problem was so bad I built an LD_PRELOAD style debugging lib for purposely testing for these cases. It is available on github at https://github.com/mistralol/libbreakr

getenv

So getenv is fine to use in a multi threaded program. Just don't use setenv / putenv / unsetenv at all ever. As these can invalidate the pointer returned from getenv at any time. Even the man page quotes "The implementation of getenv() is not required to be reentrant". It is possible to work around it by using locks on all calls. But can you really also get all your 3rd party libs that use getenv to also follow the same locks? No. So its best not to try to change your environment from within the environment.

Queues

Over the years I have seen all sorts of queues implemented and used for cross communication between threads. Large numbers of these have always has the same problems. Which is never to have any limits and to never have a flush function in order to clear / empty the queue. When no limits are used its common to have something on the other side of the queue block then the program expands in memory indefinitely. These implementations even exist in standard frameworks like C# / Java where the typical thread pool models don't have settable limits. Often the current size of the queue cannot even be queried in these ThreadPools so when another programmer is using them it can be impossible to actually write a error free program. So next time you use a queue in a threaded program think about what happens if you feed the queue faster than the items being processed and what happens to the item is there is an error when processing the item on the other side of the queue.

The best part I have often seen used is that queues are often fire and forget usage. This can make error handling or error reporting either non-existent, extremely convoluted or made completely impossible in certain situations.