tl;dr

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

This blog post will show how a fix for XFree86 and linuxthreads ended up causing a major threading regression about 7 years after the fix was created.

The regression was in pthread_create . Thread creation performed very slowly as the number of threads in a process increased. This bug is present on CentOS 5.3 (and earlier) and other linux distros as well.

It is also very possible that this bug impacted research done before August 15, 2008 (in the best case because Linux distro releases are slow) on building high performance threaded applications.

Digging this thing out was definitely one of the more interesting bug hunts in recent memory.

Hopefully, my long (and insane) story will encourage you to thoroughly investigate suspicious sources of performance degradation before applying the “[thing] is slow” label.

[thing] might be slow, but it may be useful to understand why.

Versions

This blog post will be talking about:

glibc 2.7 . Earlier versions are probably affected, but you will need to go look and see for sure.

. Earlier versions are probably affected, but you will need to go look and see for sure. Linux kernels earlier than 2.6.27 . Some linux distributions will backport fixes, so it is possible you could be running an earlier kernel version, but without the bug described below. You will need to go look and see for sure.

Linux’s mmap and MAP_32BIT

mmap is a system call that a program can use to map regions of memory into its address space. mmap can be used to map files into RAM and to share these mappings with other processes. mmap is also used by memory allocators to obtain large regions of memory that can be carved up and handed out to a program.

On June 29, 2001, Andi Kleen added a flag to Linux’s mmap called MAP_32BIT . This commit message reads in part:

This adds a new mmap flag to force mappings into the low 32bit address space.

Useful e.g. for XFree86′s ELF loader or linuxthreads’ thread local

data structures.

As Kleen mentions, XFree86 has its own ELF loader that appears to have been released as part of the 4.0.1 release back in 2000. The purpose of this ELF loader is to allow loadable module support for XFree86 even on systems that don’t necessarily have support for loadable modules. Another interesting side effect of the decision to include an ELF loader is that loadable modules can be built once and then reused on any system that XFree86 supports without recompiling the module source.

It appears that Kleen added MAP_32BIT to allow programs (like XFree86) which assumed mmap would always return 32-bit addresses to continue to work properly as 64-bit processors were beginning to enter the market.

Then, on November 11, 2002, Egbert Eich added some code to XFree86 to actually use the MAP_32BIT flag, the commit message says:

532. Fixed module loader to map memory in the low 32bit address space on

x86-64 (Egbert Eich).

Thus, 64-bit XFree86 builds would now have a working ELF loader since those builds would be able to get memory with 32-bit addresses.

I will touch on the threading implications mentioned in Kleen’s commit message a bit later.

ELF small code execution model and a tweak to MAP_32BIT

The AMD64 ABI lists several different code models which differ in addressing, code size, data size, and address range.

Specifically, the spec defines something called the small code model.

The small code model is defined such that that all symbols are known to be located in the range from 0 to 0x7EFFFFFF (among other things that are way beyond the scope of this blogpost).

In order to support this code execution model, Kleen added a small tweak to MAP_32BIT to limit the range of addresses that mmap would return in order to support the small code execution model.

Unfortunately, I have not been able to track down the exact commit with Kleen’s commit message (if there was one), but it occurred sometime between November 28, 2002 (kernel 2.4.20) and June 13, 2003 (kernel 2.4.21).

I did find what looks like a merge commit or something. It shows the code Kleen added and a useful comment explaining why the address range was being limited:

+ } else if (flags & MAP_32BIT) { + /* This is usually used needed to map code in small + model: it needs to be in the first 31bit. Limit it + to that. This means we need to move the unmapped + base down for this case. This may give conflicts + with the heap, but we assume that malloc falls back + to mmap. Give it 1GB of playground for now. -AK */ + if (!addr) + addr = 0x40000000; + end = 0x80000000;

Unfortunately, now the flag’s name MAP_32BIT is inaccurate.

The range has been limited to a single gigabyte from 0x40000000 (1gb) – 0x80000000 (2gb). This is good enough for the ELF small code model mentioned above, but this means any memory successfully mapped with MAP_32BIT is actually mapped within the first 31 bits and thus this flag should probably be called MAP_31BIT or something else that more accurately describes its behavior.

Oops.

pthread_create and thread stacks

When you create a thread using pthread_create there are two ways to allocate a region of memory for that thread to use as its stack:

Allow libpthread to allocate the stack itself. You do this by simply calling pthread_create . This is the common case for most programs. Use pthread_attr_getstacksize and pthread_attr_setstacksize to get and set the stack size.

or

A three step process: Allocate a region of memory possibly with mmap , malloc (which may just call mmap for large allocations), or statically. Use pthread_attr_setstack to set the address and size of the stack in a pthread attribute object. Pass said attribute object along to pthread_create and the thread which is created will have your memory region set as its stack.



Slow context switches, glibc, thread local storage, … wow …

A lot of really crazy shit happened in 2003, so I will try my best to split into digestible chunks.

Slow context switching on AMD k8

On February 12, 2003, it was reported that early AMD P4 CPUs were very slow when executing the wrmsr instruction. This instruction is used to write to model specific registers (MSRs). This instruction was used a few times in context switch code and removing it would help speed up context switch time. This code was refactored, but the data gathered here would be used as a justification for using MAP_32BIT in glibc a few months later.

MAP_32BIT being introduced to glibc

On March 4, 2003, it appears that Ulrich Drepper added code to glibc to use the MAP_32BIT flag in glibc . As far as I can tell, this was the first time MAP_32BIT was introduced to glibc .

An interesting comment is presented with a small piece of the patch:

+/* For Linux/x86-64 we have one extra requirement: the stack must be + in the first 4GB. Otherwise the segment register base address is + not wide enough. */ +#define EXTRA_PARAM_CHECKS \ + if ((uintptr_t) stackaddr > 0x100000000ul \ + || (uintptr_t) stackaddr + stacksize > 0x100000000ul) \ + /* We cannot handle that stack address. */ \ + return EINVAL

To understand this comment it is important to understand how Linux deals with thread local storage.

Briefly,

Each thread has a thread control block (TCB) that contains various internal information that nptl needs, including some data that can be used to access thread local storage.

needs, including some data that can be used to access thread local storage. The TCB is written to the start of the thread stack by nptl .

. The address of the TCB (and thus the thread stack) needs to be stored in such a way that nptl , code generated by gcc , and the kernel can access the thread control block of the currently running thread. In other words, it needs to be context switch friendly.

, code generated by , and the kernel can access the thread control block of the currently running thread. In other words, it needs to be context switch friendly. The register set of Intel processors is saved and restored each time a context switch occurs.

Saving the address of the TCB in a register would be ideal.

x86 and x86_64 processors are notorious for not having many registers available, however Linux does not use the FS and GS segment selectors for segmentation. So, the address of the TCB can be stored in FS or GS if it will fit.

Unfortunately, the segment selectors FS and GS can only store 32-bit addresses and this is why Ulrich added the above code. Addresses above 4gb could not be stored in FS or GS .

It appears that this comment is correct for all of the Linux 2.4 series kernels and all Linux 2.5 kernels less than 2.5.65. On these kernel versions, only the segment selector is used for storing the thread stack address and as a result no thread stack above 4gb can be stored.

32-bit and 64-bit thread stacks

Starting with Linux 2.5.65 (and more practically, Linux 2.6.0), support for both 32-bit and 64-bit thread stacks had made its way into the kernel. 32-bit thread stacks would be stored in a segment selector while 64-bit thread stack addresses would be stored in a model specific register (MSR).

Unfortunately, as was reported back in February 12, 2003, writing to MSRs is painfully slow on early AMD K8 processors. To avoid writing to an MSR, you would need to supply a 32-bit thread stack address, and thus Ulrich added the following code to glibc on May 9, 2003:

+/* We prefer to have the stack allocated in the low 4GB since this + allows faster context switches. */ +#define ARCH_MAP_FLAGS MAP_32BIT + +/* If it is not possible to allocate memory there retry without that + flag. */ +#define ARCH_RETRY_MMAP(size) \ + mmap (NULL, size, PROT_READ | PROT_WRITE | PROT_EXEC, \ + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0) + +

This code is interesting for two reasons:

The justification for using MAP_32BIT has changed from the kernel not supporting addresses above 4gb to decreasing context switch cost. A retry mechanism is added so that if no memory is available when using MAP_32BIT , a thread stack will be allocated somewhere else.

At some unknown (by me) point in 2003, MAP_32BIT had been sterilized as explained earlier to deal with the ELF small code model in the AMD64 ABI.

The end result being that user programs have only 1gb of space with which to allocate all their thread stacks (or other low memory requested with MAP_32BIT ).

This seems bad.

Fast forward to 2008: an bug

On August 13, 2008 an individual named “Pardo” from Google posted a message to the Linux kernel mailing list about a regression in pthread_create :

mmap() is slow on MAP_32BIT allocation failure, sometimes causing

NPTL’s pthread_create() to run about three orders of magnitude slower.

As example, in one case creating new threads goes from about 35,000

cycles up to about 25,000,000 cycles — which is under 100 threads per

second.

Pardo had filled the 1gb region that MAP_32BIT tries to use for thread stacks causing glibc to fallback to the retry mechanism that Drepper added back in 2003.

Unfortunately, the failing mmap call with MAP_32BIT was doing a linear fucking search of all the “low address space” memory regions trying to find a fit before falling back and calling mmap a second time without MAP_32BIT .

And so after a few thousand threads, every new pthread_create call would trigger two system calls, the first of which would do a linear search of low memory before failing. The linear search and retry dramatically increased the time to create new threads.

This is pretty bad.

bugfixes for the kernel and glibc

So, how does glibc and the kernel fix this problem?

Ingo Molnar convinced everyone that the best solution was to add a new flag to the Linux kernel called MAP_STACK . This flag would be defined as “give out an address that is best suited for process/thread stacks”. This flag would actually be ignored by the kernel. This change appeared in Linux kernel 2.6.27 and was added on August 13, 2008.

Ulrich Drepper updated glibc to use MAP_STACK instead of MAP_32BIT and he removed the retry mechanism he added in 2003 since MAP_STACK should always succeed if there is any memory available. This change was added on August 15, 2008.

MAP_32BIT cannot be removed from the kernel, unfortunately because there are many programs out in the wild (older versions of glibc, Xfree86, older versions of ocamlc) that rely on this flag existing to actually work.

And so, MAP_32BIT will remain. A misnamed, strange looking wart that will probably be around forever to remind us that computers are hard.

Bad research (or not)?

I recall reading a paper that benchmarked thread creation time from back in the 2005-2008 time period which claimed that as the number of threads increased the time to create new threads also increased and thus, threads were bad.

I can’t seem to find that paper and it is currently 3:51 AM PST, so, who knows I could be misremembering things. If some one knows what paper I am talking about, please let me know.

If such a paper exists (and I think it does), this blog post explains why thread creation benchmarks would have resulted in really bad looking graphs.

Conclusion

I really don’t even know what to say, but:

Computers are hard

Ripping the face off of bugs like this is actually pretty fun

Be careful when adding linear time searches to potential hot paths

Make sure to carefully read and consider the effects of low level code

Git is your friend and you should use it to find data about things from years long gone

Any complicated application, kernel, or whatever will have a large number of ugly scars and warts that cannot be changed to maintain backward compatibility. Such is life.

If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.

References