How programs get run: ELF binaries

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

The previous article in this series described the general mechanisms that the Linux kernel has for executing programs as a result of a user-space call to execve() . However, the particular format handlers described in that article each deferred the process of execution to an inner call to search_binary_handler() . That recursion almost always ends with the invocation of an ELF binary program, which is the subject of this article.

The ELF format

The ELF (Executable and Linkable Format) format is the main binary format in use on modern Linux systems, and support for it is implemented in the file fs/binfmt_elf.c . It's also a slightly complicated format for the kernel to handle; the main load_elf_binary() function spans over 400 lines, and the ELF support code is more than four times as big as the code that supports the old a.out format.

An ELF file for an executable program (rather than a shared library or an object file) must always contain a program header table near the start of the file, after the ELF header; each entry in this table provides information that is needed to run the program.

The kernel only really cares about three types of program header entries. The first type is the PT_LOAD segment, which describes areas of the new program's running memory. This includes code and data sections that come from the executable file, together with the size of a BSS section. The BSS will be filled with zeroes (thus only its length needs to be stored in the executable file). The second entry of interest is a PT_INTERP entry, which identifies the run-time linker needed to assemble the complete program; for the time being, we'll assume a statically linked ELF binary and return to dynamic linking later. Finally, the kernel also gets a single bit of information from a PT_GNU_STACK entry, if present, which indicates whether the program's stack should be made executable or not.

(This article only focuses on what's needed to load an ELF program, rather than exploring all of the details of the format. The interested reader can find much more information via the references linked from Wikipedia's ELF article or by exploring real binaries with the objdump tool.)

Processing ELF binaries

Loading an ELF binary is handled by the load_elf_binary() function, which starts by examining the ELF header to check that the file in question does indeed look like a supported ELF format. The handler needs the whole of the ELF program header, whether it is within the first 128 bytes read into buf in linux_binprm or not, so it needs to read it into some scratch space.

The code now loops over the program header entries, checking for an interpreter ( PT_INTERP ) and whether the program's stack should be executable (from the PT_GNU_STACK entry). With this preparation done, the code needs to initialize those attributes of the new program that are not inherited from the old program; the Single UNIX Specification version 3 (SUSv3) exec specification describes most of the required behavior (and table 28-4 of The Linux Programming Interface gives an excellent summary of the attributes involved).

The process of setting up the new program starts with a call to flush_old_exec() , which clears up state in the kernel that refers to the previous program. Any other threads of the old program are killed so the new program starts with a single thread, and the signal-handling information for the process is unshared so that it can be safely altered later. Any pending POSIX timers for the old program are cleared, and the location of the executable file for the program (visible at /proc/pid/exe ) is updated. The virtual memory mappings for the old program are released, which also kills any pending asynchronous I/O operations and frees any uprobes. Finally, the personality of the process is updated to remove any features that could affect security, as previously recorded in the per_clear field in linux_binprm . The main handler code also calls the SET_PERSONALITY() macro to set the thread flags appropriately for a new 64-bit program.

A corresponding call to setup_new_exec() now sets up the kernel's internal state for the new program. This function starts by determining whether the new program can generate a core dump (or have ptrace() attach to it); this is disabled by default for setuid or setgid programs. Dumping is also disabled when the program file isn't readable under the current credentials. A call to __set_task_comm() sets the current task's comm field to the basename of the originally invoked filename; this value is used as a thread name, and is accessible to user space via the PR_GET_NAME and PR_SET_NAME prctl() operations. A call to flush_signal_handlers() sets up the signal handlers for the new program; any signal handler that's not SIG_IGN gets set to the default SIG_DFL value (so any ignored signals are inherited by the new program). Finally, a call to do_close_on_exec() closes all of the old program's file descriptors that have the O_CLOEXEC flag set; other file descriptors will be inherited by the new program.

The virtual memory for the new program also needs to be set up. To improve security (by helping protect against stack overflow attacks), the highest address for the stack is typically moved downward by a random offset. An initial call to setup_arg_pages() then sets up the kernel's memory tracking structures, and adjusts for the new location of the stack. The code loops through all of the PT_LOAD segments in the program file and maps them into the process's address space, setting up the new program's memory layout. It then sets up zero-filled pages that correspond to the program's BSS segment. Also, additional special pages — such as the virtual dynamic shared object (vDSO) pages — need to be mapped, which is taken care of by a call to arch_setup_additional_pages() . An empty page may also be mapped at the zero address in the program's address space for backward-compatibility reasons (old SVr4 programs apparently assume that reading from a NULL pointer would return zeros rather than SIGSEGV ).

Next, the credentials for the new program are set up via a call to install_exec_creds() . This function lets any active Linux Security Module (LSM) know about the change in credentials (through the bprm_committing_creds and bprm_committed_creds LSM hooks), and the inner commit_creds() function performs the assignment.

The final preparation for running the new program is to set up the rest of its stack (in its new randomized location), by calling the create_elf_tables() function; this is described in a separate section below.

All of the preparation has now been done, and the new program can be launched. An earlier article explained how the kernel's system_call entry point pushes the user-space CPU registers to the kernel stack before entering the main kernel code, and these registers are correspondingly restored when the system call completes. The area of the stack that holds the saved registers is cast to a pt_regs structure, and the saved user-space CPU registers can thus be overwritten with suitable values (zeroes) for the start of the new program. The call to the start_thread() function also sets the saved instruction pointer to the entry point of the program (or the dynamic linker), and the saved stack pointer to the current top of the stack (from the p field in linux_binprm ). The zero return code from the handler indicates success, and the execve() syscall returns to user space — but to a completely different user space, where the process's memory has been remapped, and the restored registers have values that start the execution of the new program.

Populating the stack: the auxiliary vector, environment and arguments

The create_elf_tables() function adds more information to the new program's stack, below the argument and environment information added by the generic code, as two distinct chunks. An initial call to arch_align_stack() rounds down the existing stack position to a 16-byte boundary, and may also further randomize the stack position downward slightly.

The first collection of information forms the ELF auxiliary vector, a collection of (id, value) pairs that describe useful information about the program being run and the environment it is running in, communicated from the kernel to user space. To build this vector, the handler code first needs to push onto the stack any information that doesn't fit within a 64-bit value; for x86_64 this is a platform capability description (the string "x86_64" ) and 16 bytes of random data (to help seed user-space random number generators).

Next, the code assembles the (id, value) pairs for the auxiliary vector in the saved_auxv space within the mm_struct . An LWN article from Michael Kerrisk describes the contents of this vector, so here we just mention a few interesting entries:

The (architecture-specific) first entry in the vector is the AT_SYSINFO_EHDR value for x86_64; this indicates the location of the vDSO page, as referenced in an earlier article.

value for x86_64; this indicates the location of the vDSO page, as referenced in an earlier article. The AT_PLATFORM value is the location of the "x86_64" platform capability description pushed earlier.

value is the location of the platform capability description pushed earlier. The AT_RANDOM value is the location of the random data pushed earlier.

value is the location of the random data pushed earlier. The AT_EXECFN value holds the location of the program filename that was pushed as the very first thing on the stack (and whose location was stored in the exec field of linux_binprm ), above the arguments and environment values.

value holds the location of the program filename that was pushed as the very first thing on the stack (and whose location was stored in the field of ), above the arguments and environment values. The AT_ENTRY value holds the entry point for the text segment, i.e. where program execution should start.

Once this auxiliary vector is created, the code now assembles the rest of the new program's stack. The required space is calculated, and then the entries are inserted from low addresses to higher ones:

The argc argument count is inserted first.

argument count is inserted first. An array of argument pointers is inserted next, ending with a NULL pointer. This is where main() 's argv will eventually point.

's will eventually point. An array of environment pointers is inserted next, ending with a NULL pointer. This is where environ will point.

will point. The auxiliary vector is put at the highest address, just below the additional values it references.

Taken together, the top of the new program's address space will have contents like the following example (this page has a similar example):

------------------------------------------------------------- 0x7fff6c845000 0x7fff6c844ff8: 0x0000000000000000 _ 4fec: './stackdump\0' <------+ env / 4fe2: 'ENVVAR2=2\0' | <----+ \_ 4fd8: 'ENVVAR1=1\0' | <---+ | / 4fd4: 'two\0' | | | <----+ args | 4fd0: 'one\0' | | | <---+ | \_ 4fcb: 'zero\0' | | | <--+ | | 3020: random gap padded to 16B boundary | | | | | | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -| | | | | | 3019: 'x86_64\0' <-+ | | | | | | auxv 3009: random data: ed99b6...2adcc7 | <-+ | | | | | | data 3000: zero padding to align stack | | | | | | | | . . . . . . . . . . . . . . . . . . . . . . . . . . .|. .|. .| | | | | | 2ff0: AT_NULL(0)=0 | | | | | | | | 2fe0: AT_PLATFORM(15)=0x7fff6c843019 --+ | | | | | | | 2fd0: AT_EXECFN(31)=0x7fff6c844fec ------|---+ | | | | | 2fc0: AT_RANDOM(25)=0x7fff6c843009 ------+ | | | | | ELF 2fb0: AT_SECURE(23)=0 | | | | | auxiliary 2fa0: AT_EGID(14)=1000 | | | | | vector: 2f90: AT_GID(13)=1000 | | | | | (id,val) 2f80: AT_EUID(12)=1000 | | | | | pairs 2f70: AT_UID(11)=1000 | | | | | 2f60: AT_ENTRY(9)=0x4010c0 | | | | | 2f50: AT_FLAGS(8)=0 | | | | | 2f40: AT_BASE(7)=0x7ff6c1122000 | | | | | 2f30: AT_PHNUM(5)=9 | | | | | 2f20: AT_PHENT(4)=56 | | | | | 2f10: AT_PHDR(3)=0x400040 | | | | | 2f00: AT_CLKTCK(17)=100 | | | | | 2ef0: AT_PAGESZ(6)=4096 | | | | | 2ee0: AT_HWCAP(16)=0xbfebfbff | | | | | 2ed0: AT_SYSINFO_EHDR(33)=0x7fff6c86b000 | | | | | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | | | | | 2ec8: environ[2]=(nil) | | | | | 2ec0: environ[1]=0x7fff6c844fe2 ------------------|-+ | | | 2eb8: environ[0]=0x7fff6c844fd8 ------------------+ | | | 2eb0: argv[3]=(nil) | | | 2ea8: argv[2]=0x7fff6c844fd4 ---------------------------|-|-+ 2ea0: argv[1]=0x7fff6c844fd0 ---------------------------|-+ 2e98: argv[0]=0x7fff6c844fcb ---------------------------+ 0x7fff6c842e90: argc=3

Note that although there are two randomizations in the stack layout (the position of the top of memory and the size of the gap between the argument values and the auxiliary vector), the newly running program can still figure out where all of the information on the stack is. The SP register tells the program where the top of the stack is (i.e. the lowest address), and the command-line arguments are arranged upwards in memory from there, with a NULL pointer to mark where they end. The environment values are found next, again with a NULL pointer to terminate, and the auxiliary vector is found at the next consecutive addresses, closing with an AT_NULL ID. The values found within all of this information give the addresses of the argument strings, environment strings, and auxiliary data values, so no explicit information about the size of the random gap is needed.

Dynamically linked programs

So far we've assumed the program being executed is statically linked and skipped over steps that would be triggered by the presence of a PT_INTERP entry in the ELF program header. However, most programs are dynamically linked, meaning that required shared libraries have to be located and linked at run-time. This is performed by the runtime linker (typically something like /lib64/ld-linux-x86-64.so.2 ), and the identity of this linker is specified by the PT_INTERP program header entry.

To cope with a runtime linker, the ELF handler first reads the ELF interpreter file name into scratch space, then opens the executable file with open_exec() . The first 128 bytes of the file are read into the bprm->buf scratch area, replacing the contents of the original program file and allowing access to the ELF header of the interpreter program — which must therefore be an ELF binary itself, rather than any other format.

After the program code has been loaded into memory as described previously, the ELF handler also loads the ELF interpreter program into memory with load_elf_interp() . This process is similar to the process of loading the original program: the code checks the format information in the ELF header, reads in the ELF program header, maps all of the PT_LOAD segments from the file into the new program's memory, and leaves room for the interpreter's BSS segment.

The execution start address for the program is also set to be the entry point of the interpreter, rather than that of the program itself. When the execve() system call completes, execution then begins with the ELF interpreter, which takes care of satisfying the linkage requirements of the program from user space — finding and loading the shared libraries that the program depends on, and resolving the program's undefined symbols to the correct definitions in those libraries. Once this linkage process is done (which relies on a much deeper understanding of the ELF format than the kernel has), the interpreter can start the execution of the new program itself, at the address previously recorded in the AT_ENTRY auxiliary value.

Compatibility with other architectures

As described previously, a modern 64-bit (x86_64) Linux system can also support running 32-bit binaries of two types: normal 32-bit binaries (x86_32), and x32 ABI programs (which can make use of additional x86_64 registers). So how does the kernel support these binaries?

The key file that provides support for these formats is compat_binfmt_elf.c , which is included in the kernel when the CONFIG_COMPAT_BINFMT_ELF config option is set. This file didn't appear in our earlier list of places that register binary handlers, because the file contains almost no code of its own. Instead, it includes the main binfmt_elf.c ELF handler code (using #include ), and uses the preprocessor to redirect various internal functions and values to 32-bit compatibility versions. Other than these changes, the format handler therefore behaves the same as the normal ELF handler described above.

One set of changes uses 32-bit versions of the structures describing the layout of the ELF file; similarly, the appropriate constant values for 32-bit binaries are used, which ensures that the compatibility handler only claims support for the relevant ELF binary types. In particular, the elf_check_arch() call is replaced with a compat_elf_check_arch() version that checks for either x86_32 or (if configured) x32.

The preprocessor changes also redirect some of the inner functionality of the ELF handler code. The invocation of the SET_PERSONALITY() macro is redirected to set_personality_ia32() so that the relevant thread flags for the 32-bit architecture are set and, similarly, the arch_setup_additional_pages() function is replaced with a version that sets up a 32-bit vDSO. More significantly, the start_thread() function is replaced with compat_start_thread() , which maps to start_thread_ia32() . This alters the arguments to the inner start_thread_common() function so that the saved segment registers are initialized differently than for x86_64 binaries (and the ELF_PLAT_INIT() macro is also adjusted to match).

Epilogue