Windows, like practically any other mainstream multithreading operating system, provides a mechanism to allow programmers to efficiently store state on a per-thread basis. This capability is typically known as Thread Local Storage, and it’s quite handy in a number of circumstances where global variables might need to be instanced on a per-thread basis.

Although the usage of TLS on Windows is fairly well documented, the implementation details of it are not so much (though there are a smattering of pieces of third party documentation floating out there).

Conceptually, TLS is in principal not all that complicated (famous last words), at least from a high level. The general design is that all TLS accesses go through either a pointer or array that is present on the TEB, which is a system-defined data structure that is already instanced per thread.

The “per-thread” resolution of the TEB is fairly well documented, but for the benefit of those that are unaware, the general idea is that one of the segment registers (fs on x86, gs on x64) is repurposed by the OS to point to the base address of the TEB for the current thread. This allows, say, an access to fs:[0x0] (or gs:[0x0] on x64) to always access the TEB allocated for the current thread, regardless of other threads in the address space. The TEB does really exist in the flat address space of the process (and indeed there is a field in the TEB that contains the flat virtual address of it), but the segmentation mechanism is simply used to provide a convenient way to access the TEB quickly without having to search through a list of thread IDs and TEB pointers (or other relatively slow mechanisms).

On non-x86 and non-x64 architectures, the underlying mechanism by which the TEB is accessed varies, but the general theme is that there is a register of some sort which is always set to the base address of the current thread’s TEB for easy access.

The TEB itself is probably one of the best-documented undocumented Windows structures, primarily because there is type information included for the debugger’s benefit in all recent ntdll and ntoskrnl.exe builds. With this information and a little disassembly work, it is not that hard to understand the implementation behind TLS.

Before we can look at the implementation of how TLS works on Windows, however, it is necessary to know the documented mechanisms to use it. There are two ways to accomplish this task on Windows. The first mechanism is a set of kernel32 APIs (comprising TlsGetValue, TlsSetValue, TlsAlloc, and TlsFree that allows explicit access to TLS. The usage of the functions is fairly straightforward; TlsAlloc reserves space on all threads for a pointer-sized variable, and TlsGetValue can be used to read this per-thread storage on any thread (TlsSetValue and TlsFree are conceptually similar).

The second mechanism by which TLS can be accessed on Windows is through some special support from the loader (residing ntdll) and the compiler and linker, which allow “seamless”, implicit usage of thread local variables, just as one would use any global variable, provided that the variables are tagged with __declspec(thread) (when using the Microsoft build utilities). This is more convenient than using the TLS APIs as one doesn’t need to go and call a function every time you want to use a per-thread variable. It also relieves the programmer of having to explicitly remember to call TlsAlloc and TlsFree at initialization time and deinitialization time, and it implies an efficient usage of per-thread storage space (implicit TLS operates by allocating a single large chunk of memory, the size of which is defined by the sum of all per-thread variables, for each thread so that only one index into the implicit TLS array is used for all variables in a module).

With the advantages of implicit TLS, why would anyone use the explicit TLS API? Well, it turns out that prior to Windows Vista, there are some rather annoying limitations baked into the loader’s implicit TLS support. Specifically, implicit TLS does not operate when a module using it is not being loaded at process initialization time (during static import resolution). In practice, this means that it is typically not usable except by the main process image (.exe) of a process, and any DLL(s) that are guaranteed to be loaded at initialization time (such as DLL(s) that the main process image static links to).

Next time: Taking a closer look at explicit TLS and how it operates under the hood.

Tags: Internals, TLS