Split Stacks in GCC

Ian Lance Taylor

The goal of split stacks is to permit a discontiguous stack which is grown automatically as needed. This means that you can run multiple threads, each starting with a small stack, and have the stack grow and shrink as required by the program. It is then no longer necessary to think about stack requirements when writing a multi-threaded program. The memory usage of a typical multi-threaded program can decrease significantly, as each thread does not require a worst-case stack size. It becomes possible to run millions of threads (either full NPTL threads or co-routines) in a 32-bit address space.

This is currently implemented for 32-bit and 64-bit x86 targets running GNU/Linux in gcc 4.6.0 and later. For full functionality you must be using the gold linker, which you can get by building binutils 2.21 or later with --enable-gold.

Basic implementation

The stack will have a guaranteed zone which is always available. The size of the guard area will be target specific. It will include enough stack space to actually allocate more stack space. Each function will have to verify that it has enough space in the current stack to execute.

The basic verification will be a comparison between the stack pointer and the current bottom of the stack plus the guaranteed zone size. This will have to be the first operation in the function, and will also be target specific. It must be fast, as it will be executed by each called function.

There are two cases to consider. For functions which require a stack frame less than the size of the guaranteed guard area, we can do a simple comparison between the stack pointer and the stack limit. For functions which require a larger stack frame, we must do a comparison including the size of the stack frame.

Possible strategies:

Reserve a register to hold the bottom of the stack plus the guaranteed size. This will have to be a callee-saved register. Each function with a small stack frame then starts with a two instruction sequence: cmp %sp,%reg bg 1f expand stack 1: Use a TLS variable. In the general case, in a shared library, this will require calling the __tls_get_addr function. That means that that function will have to work without requiring any additional stack space. This is infeasible unless the whole system is compiled with split stacks. It would require LD_BIND_NOW to be set, so that the __tls_get_addr function is resolved at program startup time. Even that is probably insufficient unless we can ensure that the space for the variable is fully allocated. In general I don't think we can ensure this, because dlopen can cause a thread to require more space for TLS variables, and that space will be allocated on the first call to __tls_get_addr . So I don't think this approach can work in general. Have the stack always end at a N-bit boundary. E.g., if we always allocate stack segments as a multiple of 4K, then align each one so that the stack always ends at a 12-bit boundary. Then the amount of space remaining on the stack is SP & 0xfff . We then have a four instruction sequence: mov %sp,%junkreg and $0xfff,%junkreg cmp $framesize plus slop,%junkreg bg 1f expand stack 1: Introduce a new function call which handles the comparison of the stack pointer and the stack expansion. This function call would be part of the threading library and can use special knowledge--e.g, the ability to quickly locate information about the stack of the current frame. This function call would have to use a special calling sequence, with a single argument which is the size of the stack frame. Then each function would start with a call to this function. On some targets this function call could perhaps be combined with a function call required by PIC code--e.g., __i686.get_pc_thunk.bx . Reuse the stack protector support field. When using glibc each thread descriptor has a field used by the stack protector. We can reuse that field to hold the stack limit. Of course it is then not possible to use split stacks in conjunction with stack protector. At least on x86, arrange to allocate a new field in the TCB header accessible via %fs or %gs. This is probably the best solution, and it is the one implemented for i386 and x86_64.

Expanding the stack

Expanding the stack requires allocating additional memory. This additional memory will have to be allocated using only the stack space slop. All of the functions used to allocate additional stack space must be compiled to not use a split stack. A new function attribute, no_split_stack will be introduced to mean that the stack should not be split. It would also work to ensure that the stack is large enough that they do not need to split the stack during the allocation call.

After expanding the stack, the function will copy any stack based parameters from the old stack to the new stack. Fortunately, all C++ objects which require a copy or move constructor are implicitly passed by reference, so copying the parameters on the stack is OK. For varargs functions, this is impossible in general, so we will compile varargs functions differently: they will use an argument pointer which is not necessarily based on the frame pointer. For functions which return objects on the stack, the objects will be returned on the old stack. This should normally happen automatically, as the initial hidden parameter will naturally point to the old stack.

When expanding the stack, the return address of the function will be munged to point to a function which will release the allocated stack block and reset the stack pointer to the caller. The address of the old stack block, and the old stack pointer, will have been saved somewhere in the new stack block.

Backward compatibility

We want to be able to use split stack programs on systems with prebuilt libraries compiled without split stacks. This means that we need to ensure that there is sufficient stack space before calling any such function.

Each object file compiled in split stack mode will be annotated to indicate that the functions use split stacks. This should probably be annotated with a note but there is no general support for creating arbitrary notes in GNU as. Therefore, each object file compiled in split stack mode will have an empty section with a special name: .note.GNU-split-stack . If an object file compiled in split stack mode includes some functions with the no_split_stack attribute, then the object file will also have a .note.GNU-no-split-stack section. This will tell the linker that some functions may not have the expected split stack prologue.

When the linker links an executable or shared library, it will look for calls from split-stack code to non-split-stack code. This will include calls to non-split-stack shared libraries (thus, a program linked against a split-stack shared library may fail if at runtime the dynamic linker finds a non-split-stack shared library; it might be desirable to use a new segment type to detect this situation).

For calls from split-stack code to non-split-stack code, the linker will change the initial instructions in the split-stack (caller) function. This means that the linker will have to have special knowledge of the instructions that the compiler emits. The effect of the changes will be to increase the required framesize by a number large enough to reasonably work for a non-split-stack. This will be a target dependent number; the default will be something like 64K. Note that this large stack will be released when the split-stack function returns. Note that I'm disregarding the case of split-stack code in a shared library calling non-split-stack code in the main executable; that seems like an unlikely problem.

Function pointers are a tricky case. In general we don't know whether a function pointer points to split-stack code. Therefore, all calls through a function pointer will be modified to call (or jump to) a special function __fnptr_morestack . This will use a target specific function calling sequence, and will be implemented as though it were itself a function call instruction. That is, all the parameters will be set up, and then the code will jump to __fnptr_morestack . The __fnptr_morestack function takes two parameters: the function pointer to call, and the number of bytes of arguments pushed on the stack. (This is not yet implemented.)

Debugging

The debugger will have to learn about split stack mode. There are two major issues: unwinding the stack, and finding the arguments to a varargs function.

Fortunately DWARF is complex enough to represent the unusual return sequence used by a split stack function. Therefore, the only major issue is that gdb expects all stack address to monotonically decrease. Some marker will be needed to tell gdb to disable this sanity check.

For a varargs function, the varargs will be accessed via an argument pointer rather than on the stack. gdb may need to be adjusted to understand this.

Implementation steps: