What happens when a file gets executed in Linux? What does it mean that a file is executable? Can we only execute compiled binaries? What about shell scripts then? If I can execute shell scripts, what else can I execute? In this article we will try to answer those questions.

What involves executing a file

Starting from the basics, let’s try to understand what happens when we type the following in our terminal

$ /usr/bin/sleep 30

The way our programs in the userspace interact with the kernel is through system calls. In practice we need to interact with the kernel to do almost anything that is interesting, such as printing output, reading input, reading files and so on.

The syscall that is in charge of executing files is the execve() system call. When we are coding, we normally access it through the exec family of functions present in the standard library, or even more commonly through higher level abstractions such as popen() or system().

It is important to note that when we execute a file through execve() , no new process is generated. Instead, our calling process will mutate into an instance of execution of the new executable. The PID won’t change, but the machine code, data, heap and stack of the process will be replaced inside the kernel space.

This is different to the way we are used to launching executables from the terminal. When we type sleep 30 in the terminal, we get a child process of bash, and the latter does not disappear.

$ sleep 30 & $ ps -HF UID PID PPID C SZ RSS PSR STIME TTY TIME CMD nacho 21628 13409 1 4123 4312 3 17:59 pts/1 00:00:00 bash nacho 21916 21628 0 1489 712 2 18:00 pts/1 00:00:00 sleep 30

Here another system call is coming into play, the fork() syscall. bash will first create a copy of itself in another child process, and then this child process will call execve() in order to transform itself into sleep. This way bash doesn’t disappear, and will be there to take over control when sleep dies after 30 seconds.

We can skip the forking step through the exec bash bultin

$ echo $$ # this is the PID of bash 23714 $ exec sleep 30

In another terminal we can see that sleep takes over the PID

$ ps -e | grep sleep 23714 pts/7 00:00:00 sleep

And sure enough, after 30 seconds when sleep exits there is no bash session, so the terminal window will close.

The kernel side of things

So far we have seen the userspace interacting through syscalls, let’s look at what happens at the other side.

The implementation of those syscalls lives in the kernel. In general lines, the execve() syscall requests the kernel the execution of a certain file in disk, and the kernel needs to load that file into memory where it can be accessed by the CPU.

This is the entry point of the system call, at fs/exec.c

SYSCALL_DEFINE3(execve, const char __user *, filename, const char __user *const __user *, argv, const char __user *const __user *, envp) { return do_execve(getname(filename), argv, envp); }

From here, the kernel first performs all required preparations in order to start executing a binary, such as setting up the virtual memory for the process

static int __bprm_mm_init(struct linux_binprm *bprm) { ... vma->vm_end = STACK_TOP_MAX; vma->vm_start = vma->vm_end - PAGE_SIZE; vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP; vma->vm_page_prot = vm_get_page_prot(vma->vm_flags); INIT_LIST_HEAD(&vma->anon_vma_chain); err = insert_vm_struct(mm, vma); ... }

, and setting the filename, command line arguments, and inherited environment. Yes, environment variables are also first class citizens of the Linux kernel.

/* * sys_execve() executes a new program. */ static int do_execveat_common(int fd, struct filename *filename, struct user_arg_ptr argv, struct user_arg_ptr envp, int flags) { ... retval = copy_strings_kernel(1, &bprm->filename, bprm); if (retval < 0) goto out; bprm->exec = bprm->p; retval = copy_strings(bprm->envc, envp, bprm); if (retval < 0) goto out; retval = copy_strings(bprm->argc, argv, bprm); ... }

Finally, the file “will be executed”.

ELF binaries

Binaries are not only a big blob of machine code. Modern binaries are using the binary ELF format, for Executable and Linkable Format. In simple terms, this is a way of packaging the different code and data sections with some attributes, such as write protection for the .text code section, so that they get mapped into virtual memory for execution. In addition, there are machine independent headers that provide basic information about the executable, such as wether is statically or dynamically linked, the architecture and so on.

The kernel code responsible for parsing the ELF format lives in fs/binfmt_elf.c. Here, the ELF headers are read and analyzed

/** * load_elf_phdrs() - load ELF program headers * @elf_ex: ELF header of the binary whose program headers should be loaded * @elf_file: the opened ELF binary file * * Loads ELF program headers from the binary file elf_file, which has the ELF * header pointed to by elf_ex, into a newly allocated array. The caller is * responsible for freeing the allocated data. Returns an ERR_PTR upon failure. */ static struct elf_phdr *load_elf_phdrs(struct elfhdr *elf_ex, struct file *elf_file) { ... }

, and the PT_LOAD sections are loaded into virtual memory

static int load_elf_binary(struct linux_binprm *bprm) { ... /* Now we do a little grungy work by mmapping the ELF image into the correct location in memory. */ for(i = 0, elf_ppnt = elf_phdata; i < loc->elf_ex.e_phnum; i++, elf_ppnt++) { int elf_prot = 0, elf_flags, elf_fixed = MAP_FIXED_NOREPLACE; unsigned long k, vaddr; unsigned long total_size = 0; if (elf_ppnt->p_type != PT_LOAD) continue; ... }

This is just slightly more convoluted for dynamically linked programs. The kernel recognizes a dynamically linked program by the presence of the PT_INTERP header.

$ readelf -l /usr/bin/sleep Elf file type is DYN (Shared object file) Entry point 0x1710 There are 9 program headers, starting at offset 64 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align PHDR 0x0000000000000040 0x0000000000000040 0x0000000000000040 0x00000000000001f8 0x00000000000001f8 R E 0x8 INTERP 0x0000000000000238 0x0000000000000238 0x0000000000000238 0x000000000000001c 0x000000000000001c R 0x1 [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2] LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000006a38 0x0000000000006a38 R E 0x200000 LOAD 0x0000000000006bb0 0x0000000000206bb0 0x0000000000206bb0 0x00000000000004d0 0x0000000000000690 RW 0x200000 DYNAMIC 0x0000000000006c78 0x0000000000206c78 0x0000000000206c78 0x00000000000001b0 0x00000000000001b0 RW 0x8 NOTE 0x0000000000000254 0x0000000000000254 0x0000000000000254 0x0000000000000044 0x0000000000000044 R 0x4 GNU_EH_FRAME 0x0000000000005bf0 0x0000000000005bf0 0x0000000000005bf0 0x000000000000025c 0x000000000000025c R 0x4 GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 RW 0x10 GNU_RELRO 0x0000000000006bb0 0x0000000000206bb0 0x0000000000206bb0 0x0000000000000450 0x0000000000000450 R 0x1 Section to Segment mapping: Segment Sections... 00 01 .interp 02 .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .init .text .fini .rodata .eh_frame_hdr .eh_frame 03 .init_array .fini_array .data.rel.ro .dynamic .got .data .bss 04 .dynamic 05 .note.ABI-tag .note.gnu.build-id 06 .eh_frame_hdr 07 08 .init_array .fini_array .data.rel.ro .dynamic .got

This header is hardcoded at compile time with the path of the runtime linker ld-linux-x86-64.so that needs to be used to run it. The runtime linker will find in the filesystem the .so libraries that the binary needs to execute, and will load them into memory.

$ ldd /usr/bin/sleep linux-vdso.so.1 (0x00007ffc9dbfe000) libc.so.6 => /usr/lib/libc.so.6 (0x00007f71d401d000) /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f71d45e1000)

In this simple case only the standard C library needs to be dynamically linked, because vDSO is a special virtual library put there for efficient execution of read only syscalls.

The kernel will recognize the PT_INTERP header, and will also load the runtime linker (a.k.a ELF interpreter) ld.so and execute it.

static int load_elf_binary(struct linux_binprm *bprm) { ... if (elf_interpreter) { unsigned long interp_map_addr = 0; elf_entry = load_elf_interp(&loc->interp_elf_ex, interpreter, &interp_map_addr, load_bias, interp_elf_phdata); ... }

Then ld.so will find and load the .so dynamic libraries, or fail if any undefined symbol remains unresolved, and finally will jump execution to the beginning (AT_ENTRY) of the original binary code.

We get the idea that the kernel is the one that inspects the binary and handles it, but in reallity, we are not executing our binary, but the ELF interpreter ld.so instead. Technically it is our binary’s machine code that it is still being executed but we are faced with a concept, the interpreter which is the one that is actually executable, and that is the one actually “interpreting” our file.

At the end of the day, we are doing this

$ /lib64/ld-linux-x86-64.so.2 /bin/sleep 30

What about scripts?

I have been trying hard to avoid the word binary because you can actually “execute” files that are not binary. These files are technically not executed themselves, because they don’t necessarily contain machine code, but by an interpreter as we just saw.

Now, let’s look at how a executable script is run. I used to imagine that the bash process (or the file manager) inspects the first bytes of the file, and if it finds the shebang, for instance #!/bin/python2, then it calls the appropriate interpreter. Turns out this is not how it works. The shebang detection actually happens in the Linux Kernel itself.

This means there is more to execution than just ELF: Linux supports a bunch of binary formats, being ELF just one of them. Inside the kernel, each binary format is run by a handler that knows how to deal with said file. There are some handlers that come with the standard kernel, but some others can be added through loadable modules.

Whenever a file is to be executed through execve(), its 128 first bytes are read and passed on to every handler. This occurs at fs/exec.c

/* * cycle the list of binary formats handler, until one recognizes the image */ int search_binary_handler(struct linux_binprm *bprm) { ... list_for_each_entry(fmt, &formats, lh) { ... retval = fmt->load_binary(bprm); // OYB: the first 128B are in bprm->buf[128] ... } ... }

Each handler can then accept it or ignore it, usually depending on some magic in the first bytes of the binary. This way, the appropriate handler takes care of the execution of that binary, or passes on the chance of doing so to another handler.

In the case of the ELF format, the magic is 0x7F ‘ELF’ in the field e_ident

#define EI_NIDENT 16 typedef struct { unsigned char e_ident[EI_NIDENT]; /* 0x7F 'ELF' four byte ELF magic number */ uint16_t e_type; uint16_t e_machine; uint32_t e_version; ElfN_Addr e_entry; ElfN_Off e_phoff; ElfN_Off e_shoff; uint32_t e_flags; uint16_t e_ehsize; uint16_t e_phentsize; uint16_t e_phnum; uint16_t e_shentsize; uint16_t e_shnum; uint16_t e_shstrndx; } ElfN_Ehdr;

This is checked by the ELF handler at binfmt_elf.c in order to accept the binary.

static int load_elf_binary(struct linux_binprm *bprm) { ... loc->elf_ex = *((struct elfhdr *)bprm->buf); if (memcmp(elf_ex.e_ident, ELFMAG, SELFMAG) != 0) goto out; ... }

So what happens with scripts? well, it turns out that there is a handler for that in the kernel, that can be found at binfmt_script.c.

All binary format handlers offer an interface to the execve(), for instance this is the one for the ELF format

static struct linux_binfmt elf_format = { .module = THIS_MODULE, .load_binary = load_elf_binary, .load_shlib = load_elf_library, .core_dump = elf_core_dump, .min_coredump = ELF_EXEC_PAGESIZE, };

The main hook here is the load_binary() function that will be different for each handler.

In the case of the script handler, its load_binary() hook is at binfmt_script.c, and starts like this.

static int load_script(struct linux_binprm *bprm) { const char *i_arg, *i_name; char *cp; struct file *file; int retval; if ((bprm->buf[0] != '#') || (bprm->buf[1] != '!')) return -ENOEXEC; ... }

So it is the kernel who actually parses the first line of the script, and passes on execution to the interpreter with the script path as an argument. As long as the file starts with the shebang #!, it will be interpreted as a script, be it python, awk, sed, perl, bash, ash, sh, zsh or any similar other.

Once more, we are not actually executing our script, but we are doing something like

$ /bin/bash ./sleep30.sh

We are starting to see that the name binary format is a bit misleading, as we can execute things that are not binaries. There are other handlers for exotic or old binary formats, such as the flat format, or the old a.out format, and in particular there is a very powerful and versatile handler, the binfmt_misc handler.

Execute anything with bitfmt_misc

Now we know what a binary handler is, and we can understand binfmt_misc. This is a flexible format handler that allows us to specify what userland interpreter should run for a specific file type. It doesn’t just look at a hardcoded magic at the beginning of the file, but also supports detecting the binary by extension, using masks, and offers a /proc interface to the system administrator. Remember that all this is happening in kernel space. The loader for this handler is load_misc_binary() at fs/binfmt_misc.c.

If the /proc interface is not already mounted for us, we can do so with

# mount binfmt_misc -t binfmt_misc /proc/sys/fs/binfmt_misc

Let’s have a look at it

$ ls -l /proc/sys/fs/binfmt_misc total 0 -rw-r--r-- 1 root root 0 May 16 10:38 CLR -rw-r--r-- 1 root root 0 May 16 11:15 python2.7 -rw-r--r-- 1 root root 0 May 16 11:15 python3.5 --w------- 1 root root 0 May 16 11:15 register -rw-r--r-- 1 root root 0 May 16 11:15 status

We can see that we already have it populated with some python and other entries.

$ cat /proc/sys/fs/binfmt_misc/python3.5 enabled interpreter /usr/bin/python3.5 flags: offset 0 magic 160d0d0a

We can remove, enable or disable these entries.

echo 1 to enable entry

to enable entry echo 0 to disable entry

to disable entry echo -1 to remove entry

What is really cool is that we can easily add custom entries through binfmt_misc. In order to add an entry, you echo a format string to register. Details on how to configure all these flags, masks and magic values can be found here.

As an example, let’s create a handler for the JPG image format to be opened by the feh image viewer. In this case we are matching by extension (therefore the E)

# echo ':fehjpg:E::jpg::/usr/bin/feh:' > /proc/sys/fs/binfmt_misc/register

The handler is registered now. We can inspect it

cat /proc/sys/fs/binfmt_misc/fehjpg enabled interpreter /usr/bin/feh flags: extension .jpg

Now let’s take a picture, make executable, et voila.

$ chmod +x ncpotato.jpg $ ./ncpotato.jpg

Let’s now create an executable TODO list, based on magic number detection (M).

# echo ':TODOlist:M::#~[TODO]::/usr/bin/vi:' > /proc/sys/fs/binfmt_misc/register $ cat /proc/sys/fs/binfmt_misc/TODOlist enabled interpreter /usr/bin/vi flags: offset 0 magic 237e5b544f444f5d

PDFs start with the text %PDF, as seen in the specification.

$ head test.pdf %PDF-1.4 %íì¦" % Created by calibre 2.57.1 [http://calibre-ebook.com] 4 0 obj << /Type /XObject /ColorSpace /DeviceRGB /BitsPerComponent 8 /Subtype /Image /Filter [/DCTDecode] /Length 206115 /Height 2000 /DL 206115 /Width 1525 >> stream JFIFddC

so we just

# echo ':PDF:M::%PDF::/usr/bin/evince:' > /proc/sys/fs/binfmt_misc/register $ chmod +x test.pdf $ ./test.pdf

Another example, Libreoffice files by extension

# echo ':ODT:E::odt::/usr/bin/soffice:' > /proc/sys/fs/binfmt_misc/register chmod +x test.odt ./test.odt

We have all the new entries in proc

# ls -l /proc/sys/fs/binfmt_misc total 0 -rw-r--r-- 1 root root 0 May 22 20:54 ODT -rw-r--r-- 1 root root 0 May 23 14:14 PDF -rw-r--r-- 1 root root 0 May 19 14:07 TODOlist -rw-r--r-- 1 root root 0 May 16 14:09 fehjpg -rw-r--r-- 1 root root 0 May 18 09:40 python2.7 -rw-r--r-- 1 root root 0 May 18 09:40 python3.5 --w------- 1 root root 0 May 23 14:14 register -rw-r--r-- 1 root root 0 May 18 09:40 status

With this technique we can run transparently Java applications (based on the 0xCAFEBABE magic)

# binfmt_misc support for Java applications: # echo ':Java:M::\xca\xfe\xba\xbe::/usr/local/bin/javawrapper:' > /proc/sys/fs/binfmt_misc/register # binfmt_misc support for executable Jar files: # echo ':ExecutableJAR:E::jar::/usr/local/bin/jarwrapper:' > /proc/sys/fs/binfmt_misc/register # binfmt_misc support for Java Applets: # echo ':Applet:E::html::/opt/java/bin/appletviewer:' > /proc/sys/fs/binfmt_misc/register

This requires the use of a wrapper, that you can get from the Arch Wiki.

This also works for Mono, and even DOS! In order to run good old Civilization transparently, install dosbox, configure binfmt_misc

# echo ':DOSEXE:M::MZ::/usr/bin/dosbox:' > /proc/sys/fs/binfmt_misc/register

, and now we can

$ ./CIV.EXE

Some goes for Windows emulated binaries

# echo ':DOSWin:M::MZ::/usr/local/bin/wine:' > /proc/sys/fs/binfmt_misc/register $ ./winemine.exe

The problem is that all DOS, Windows and Mono binaries share the sameMZ magic, so in order to combine them, we would need to use a special wrapper that is able to detect the differences deeper in the file, such as start.exe.

If we want to make these settings permanent, we can setup this configuration at boot time at /etc/binfmt.d. For instance, if we wanted to setup the PDF handler at boot time, we just add a new file to the folder

# echo ':PDF:M::%PDF::/usr/bin/evince:' > /etc/binfmt.d/pdf.conf

We can see that this approach is very flexible and powerful. One problem with it is that it is a bit unconventional to have “regular files” other than scripts as executable. The good thing about it is that it is truly a system wide setting, so once you set it up in the kernel, it will work from all your shells, file managers and any other user of the execve() system call.

We will see some more useful things we can do with binfmt_misc in the following post.

References

This article aims at being a gentle overview of binfmt_misc and the execution process with some practical examples. If you want the gory details, consider reading the following references.

How programs get run

How programs get run: ELF binaries

Anatomy of a system call, part 2

System calls in the Linux kernel