Optimizing global constant data structures using relative references

The hidden costs of global pointers

const struct object_vtable { const type_info *typeinfo; void (*method1)(object *self); void (*method2)(object *self, int argument); /* etc. */ } object_vtable = { .typeinfo = &object_typeinfo, .method1 = &object_method1, .method2 = &object_method2, /* etc. */ };

object_vtable

main()

When a process is formed, the contents of the executable file and all of the dynamic libraries it uses get memory-mapped into the new process's address space by the dynamic linker. The dynamic linker uses the kernel's memory mapping feature to do this, associating regions of memory with the contents of the binary files on disk. As long as this memory isn't changed by the running program, the kernel can consider it to be "clean", so that if the system needs to free up memory for other purposes, it can discard these clean pages and reload them from the original binary later, since they haven't been changed. Furthermore, if multiple processes launch using the same executable or dynamic libraries, the exact same clean memory pages can be shared across the address spaces of all of those different processes, significantly reducing the amount of memory needed by the entire system.

However, for a number of reasons, the code and data in a binary on disk can't know for certain what memory address it's going to end up getting mapped to in a running process. Every executable can link against any set of dynamic libraries, so a dynamic library may have to rebase to make room for other libraries in the process, sliding to a new base address. Additionally, as a layer of security, operating systems use address space layout randomization, or ASLR, so that if a program gets exploited, attacker code can't make static assumptions about the location of other exploitable resources in the program's memory, making exploits harder to write. Because the binary on disk doesn't know what address it's going to be mapped to, when pointers appear in its global data, the dynamic linker has to slide all of those pointers to their correct values for the process. This sliding has to happen when the program is loaded, before entering main , delaying the launch of the program. It also causes the pages those pointers are on to become "dirty" from the kernel's perspective, causing the system to use more memory: dirty pages can't be shared among different processes, since the pointer values may be different in each process, and since they no longer match the contents of the binary on disk, they can no longer be merely discarded if the system needs to free up memory for other uses; dirty pages instead have to be written and reloaded from swap.

I wrote a small C program for macOS to observe the impact of global pointers on memory usage and program launch time. This program measures the current time and then spawns another copy of itself, which measures the time immediately after entering main and reports the time difference in nanoseconds. The program also accepts a -stop argument, which will cause the process to suspend itself immediately after reporting its launch time, allowing us to inspect its memory usage after entering main . We can build it in three variants:

One without any global data structures,

One with 256 kilobytes of non-pointer data structures,

One with 256 kilobytes of pointers, resembling a C++-like set of class vtables.

$ xcrun clang -O3 -fpie lots-of-global-pointers.c -DVARIATION=0 -o no_class_records $ xcrun clang -O3 -fpie lots-of-global-pointers.c -DVARIATION=1 -o non_pointer_class_records $ xcrun clang -O3 -fpie lots-of-global-pointers.c -DVARIATION=2 -o pointer_class_records

$ average() { awk '{ sum += $0 } END { printf "%u

", sum / NR }' } $ (for i in $(seq 1 1000); do ./no_class_records; done) | average 1513058 $ (for i in $(seq 1 1000); do ./non_pointer_class_records; done) | average 1517911 $ (for i in $(seq 1 1000); do ./pointer_class_records; done) | average 1712659

pointer_class_records

dyld

DYLD_PRINT_STATISTICS_DETAILS

no_class_records

pointer_class_records

DYLD_PRINT_STATISTICS_DETAILS

$ ./no_class_records ... total rebase fixups: 14 total rebase fixups time: 0.48 milliseconds (23.4%) ... $ ./pointer_class_records ... total rebase fixups: 32,783 total rebase fixups time: 1.62 milliseconds (50.8%) ...

dyld

no_class_records

pointer_class_record

Now let's look at the memory impact of these rebases. If I run each variant with the -stop flag, and inspect the stopped processes' memory usage in top , I see something like this:

$ ./no_class_records -stop 2034244 nanoseconds from spawn to main() entry pid 73439 zsh: suspended (signal) ./no_class_records -stop $ ./non_pointer_class_records -stop 2147981 nanoseconds from spawn to main() entry pid 73519 zsh: suspended (signal) ./non_pointer_class_records -stop $ ./pointer_class_records -stop 2221860 nanoseconds from spawn to main() entry pid 73528 zsh: suspended (signal) ./pointer_class_records -stop $ top -pid 73439 -pid 73519 -pid 73528 PID COMMAND ... MEM ... 73528 pointer_clas ... 572K ... 73519 non_pointer_ ... 316K ... 73439 no_class_rec ... 316K ...

no_class_records

non_pointer_class_records

pointer_class_records

These costs may seem small, but in an operating system with frequently-used dynamic libraries that get loaded into almost every process, and executables that have many instances of the same binary running at the same time, the launch time and dirty page costs add up. On mobile platforms, where memory is constrained and swap is usually unavailable, dirty pages directly affect how many processes and how much user data can be kept in memory, and small-seeming changes in load times can have a big impact on the perceived responsiveness and performance of the platform.

Using relative references to build position-independent data structures

We can make data structures position-independent the same way code does, by having structures reference other structures by their distance from each other within the binary instead of by their absolute addresses. For example, the object_vtable from the beginning of the article could look something like this instead:

#define OFFSET(target, source) \ ((intptr_t)&target - (intptr_t)&source) const struct object_vtable { int typeinfo_offset; int method1_offset; int method2_offset; /* etc. */ } object_vtable = { .typeinfo_offset = OFFSET(object_typeinfo, object_vtable.typeinfo_offset), .method1_offset = OFFSET(object_method1, object_vtable.method1_offset), .method2_offset = OFFSET(object_method2, object_vtable.method2_offset), /* etc. */ };

Unfortunately, C and C++ don't consider expressions involving subtracting global addresses to be constant expressions. If we try to compile something like the offset-based object_vtable implementation above, we'll get an error:

$ xcrun clang foo.c foo.c:16:22: error: initializer element is not a compile-time constant .typeinfo_offset = OFFSET(object_typeinfo, object_vtable.typeinfo_offset), ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.section __TEXT, __const ; Put the following into the constants section of the binary .global _object_vtable ; Export the C symbol 'object_vtable' ; (C symbols begin with underscores in macOS assembly) _object_vtable: ; Define object_vtable .long _object_typeinfo - . ; Emit the distance between object_typeinfo and here ; (A '.' represents the current address in assembly) .long _object_method1 - . .long _object_method2 - .

%object_vtable = type { i32, i32, i32 } %object_typeinfo = type opaque %object_method = type opaque @object_typeinfo = external constant %object_typeinfo @object_method1 = external constant %object_method @object_method2 = external constant %object_method @object_vtable = constant %object_vtable { i32 trunc (i64 sub (i64 ptrtoint (%object_typeinfo* @object_typeinfo to i64), i64 ptrtoint (i32* getelementptr (%object_vtable, %object_vtable* @object_vtable, i32 0, i32 0) to i64)) to i32), i32 trunc (i64 sub (i64 ptrtoint (%object_method* @object_method1 to i64), i64 ptrtoint (i32* getelementptr (%object_vtable, %object_vtable* @object_vtable, i32 0, i32 1) to i64)) to i32), i32 trunc (i64 sub (i64 ptrtoint (%object_method* @object_method2 to i64), i64 ptrtoint (i32* getelementptr (%object_vtable, %object_vtable* @object_vtable, i32 0, i32 2) to i64)) to i32) }

One tradeoff to using relative references is that they do require slightly more generated code on average to dereference than absolute pointers, leading to small performance and code size costs. To explore these costs, here's another small macOS C program that microbenchmarks absolute and relative references by creating a large number of vtables using either relative or absolute method pointers and measuring the time taken to call through them all. We can build both variations like this:

$ xcrun clang -O3 -fpie invoking-relative-references.c -DVARIATION=0 -o relative $ xcrun clang -O3 -fpie invoking-relative-references.c -DVARIATION=1 -o absolute

And then run them:

$ ./absolute 850393727 nanoseconds to invoke methods $ ./relative 867976645 nanoseconds to invoke methods

$ ls -l absolute relative -rwxr-xr-x 1 joe staff 270872 Feb 14 20:23 absolute* -rwxr-xr-x 1 joe staff 139792 Feb 14 20:45 relative*

$ objdump -section-headers absolute | grep __text 0 __text 000000e7 0000000100000e50 TEXT $ objdump -section-headers relative | grep __text 0 __text 0000010d 0000000100000e20 TEXT

Leveraging the dynamic linker's data structures for references across libraries

.o

extern int some_variable; int get_value_of_some_variable(void) { return some_variable; }

_get_value_of_some_variable: mov rax, some_variable@GOTPCREL[rip] ; load the address of some_variable from the GOT mov eax, [rax] ; load the value of some_variable ret ; return it

@GOTPCREL

The same @GOTPCREL syntax works for data as well as code in assembly language:

.global _external_relative_reference _external_relative_reference: ; Note that GOTPCREL measures the distance from the address at the end of the ; four-byte value rather than the beginning, since x86-64 does ; PC-relative addressing relative to the address of the following instruction. ; Adding 4 compensates for this offset. .long _external_global@GOTPCREL+4

unnamed_addr

@GOTPCREL

@external_global = external constant i32 ; LLVM treats this global variable as a "GOT equivalent", and will replace ; references to @got.external_global with GOT relocations for external_global ; when possible. @got.external_global = private unnamed_addr constant i32* @external_global @external_relative_reference = constant i32 trunc (i64 sub (i64 ptrtoint (i32** @got.external_global to i64), i64 ptrtoint (i32* @external_relative_reference to i64)) to i32)

First of all, we could uniformly reference all symbols by GOT entry, local and external. The GOT doesn't have to exclusively be for external references; if the linker sees a @GOTPCREL reference to a local variable, it will obligingly create a GOT entry for it. Uniformly referencing all symbols by GOT allows code that follows the reference to remain branchless, since it can uniformly load the offset, then load the GOT entry to get the desired address. It also preserves the benefits that a data structure's inline storage can remain in clean memory and use four-byte offsets instead of potentially eight-byte pointers. On the other hand, it forces GOT entries to be created for local symbols that wouldn't otherwise need them, and makes following every reference require a two-load chain, whether local or external. The additional launch time and dirty cost of an additional GOT entry is negligible; however, the added GOT entries incur an indirect size cost for the data structures that demand them—we save four bytes of inline storage using a relative reference over a pointer, but that'd be offset by adding eight bytes for the otherwise unnecessary GOT entry. Furthermore, long load chains increase the latency of an operation, and going through the GOT adds a load to the chain necessary to follow a reference, since the CPU has to first load the offset, wait for the value to come from memory, then load the final address from the GOT. Since the GOT load is dependent on the offset load, the CPU can't do anything to parallelize this operation.

Contemporary CPUs have good branch predictors and depend on instruction-level parallelism for maximum performance, so if we can branch to avoid extending a load chain, it's often worth it. That suggests an alternative approach, where we distinguish local from external references somehow so we can avoid going through the GOT for local references. If our data structures are aligned to 2, 4, 8 byte boundaries, we could encode whether a reference is external by setting one of the low bits of the offset, which would normally always be zero. We can then check this bit and branch to do the extra dereference necessary to load the GOT entry only for an external symbol, something like this:

const void *resolve_indirectable_relative_offset(const int *offset_ptr) { int offset = *offset_ptr; intptr_t resolved_addr = (intptr_t)offset_ptr + (offset & ~1); // If the low bit is set, then the offset refers to a GOT entry, and we need to // load from there to get the resolved address. if (offset_value & 1) { return *(const void *const *)resolved_addr; } else { return (const void *)resolved_addr; } }

It's also possible to avoid the need for external references in some situations by generating an equivalent local definition and referencing that instead. The aforementioned two kinds of object, strings and functions, are prime candidates for this. It's uncommon to link strings from other binaries, since C compilers typically give every binary its own string table. For functions, instead of referencing an external function directly, it's possible to instead reference a local thunk that jumps to the external implementation. This duplication simplifies the reference mechanism, which now only has to worry about local relative references, but eats into the size savings we get from relative referencing.

When does using this technique make sense?

For some examples of data structures using relative references being used in practice, look at the Swift programming language runtime, which uses this technique extensively with its reflection data structures, enabling a Swift program to include a lot of runtime metadata without that metadata impacting the launch time or memory usage of the program. This technique has also been implemented in the Clang C++ compiler, allowing it to emit C++ vtables using relative references, an option Chromium uses to reduce the size, launch time, and memory cost of all of the vtables from their gigantic C++ codebase. If you're working on a language runtime in a similar vein, it's a good technique to consider adopting for your compiler-generated data structures as well.