I’ve wanted to understand more about the process of how source code gets compiled and packaged with its dependencies into a deployable artifact. I’m starting with C, since most things either follow the C way of doing things or get compared to it.

I’d like to start filling in some gaps in my knowledge like:

What are the steps of building a C program? A compiler? Linker? What else?

What are those .o files that come out?

How does source code depend on other files? How does the compiler package dependencies?

What are libraries? How are they different, how do I make them? Dynamic vs static?

File types

When dealing with C, we have four different types of files:

Source code ( *.c files) Source files contain function definitions

Header files ( *.h files) If we don’t define a function signature before using it, the compiler will complain. We include header files, which contain function declarations, when we want source files to reference externally defined functions.

Object files ( *.o files) Object files are the output of a compiler. They are contain function definitions in binary form (machine code), but haven’t been packaged into an executable yet and may contain references to symbols.

Binary executables Executables are the output of the linker, which links a number of object files together to form a file that can be directly executed. Sometimes just called binaries.

But wait, here’s another file type for free!

Libraries ( .a for static libraries, *.so for dynamic libraries) Libraries are just object files joined together into one file. Conceptually, they do they same thing as object files: they contain binary forms of function definitions. They can be linked with other object files and libraries to form a binary. Static libraries are packaged into the executable at compile time like other object files. Dynamic libraries let us defer loading until runtime.

Building our source code

Building our code is the process of taking our source code to an executable. Without digging into the internals of a compiler, for C this involves:

Preprocessor (source/header files to expanded source) The preprocessor is responsible for transforming source code as indicated by the preprocessor directives. For example the preprocessor replaces the line #include "header.h with the entire contents of header.h . #define is another common directive used for macros and constants, where the preprocessor can replace all instances of a defined keyword. The compiler invokes the preprocessor automatically before it runs, so all it sees are the processed source files.

Compiler (expanded source -> object files) With the processed source code, the compiler turns source code into binary versions of the source code, the object files. Object files can be packaged together into a library by a separate tool.

Linker (object files -> executable) The linker takes object files and libraries and combines them into an executable, resolving any external symbols in the process.

The build C process is actually even simpler logically than the file types we mentioned.

Header files are just source files that get preprocessed into other files, not a separate concept. By convention, header files contain just function declarations, but you can include anything a source file can, and people commonly do (like with single header file libraries).

Libraries, again, are just object files packaged together. A library is like an uncompressed zip or tar archive and I like to think of it as a bunch of object files cat ‘d together with an index at the top.

So really what we have are source files (source and header files), intermediates (object files and libraries) and the final target (binaries). You may consider a library the final target of your build depending on if you’re building an executable or not.

Building in action

Let’s check out how this maps to the simplest of examples:

// main.c int main () { return 0 ; }

# compile main.c into an object file, main.o gcc -c main.c # link main.o into an executable gcc main.o -o main

Cool! We’ve compiled a source file into an object file ( main.o ) and then linked it into an executable ( main ).

Single source dependency

Okay now let’s add a source file dependency:

// add.h int add ( int a , int b );

// add.c int add ( int a , int b ) { return a + b ; }

// main.c #include "add.h" int main () { return add ( 0 , 1 ); }

We can use gcc -E to see the output of the preprocessor:

> gcc -E main # 1 "main.c" # 1 "<built-in>" # 1 "<command-line>" # 1 "/usr/include/stdc-predef.h" 1 3 4 # 1 "<command-line>" 2 # 1 "main.c" # 1 "add.h" 1 int add ( int a, int b ) ; # 3 "main.c" 2 int main () { return add ( 0,1 ) ; }

I haven’t dug into what all the output is, but we can see that the preprocessor copies add.h into main.c as we thought.

However, using the same compile commands fails:

> gcc -c main.c > gcc main.o -o main main.o: In function `main': main.c:(.text+0xf): undefined reference to `add' collect2: error: ld returned 1 exit status

Let’s run nm on main.o to see what symbols are used.

# nm shows symbols in a object file # man nm shows all the symbol types # briefly T = symbol is in the code section, U = undefined > nm main.o U add 0000000000000000 T main

Here we see add is undefined, which makes sense since we never compiled the add function to binary. We need to go through the same process to compile add.c into an object file and then link it with main.o .

# compile object files gcc -c main.c gcc -c add.c # link gcc main.o add.o -o main

Building our own static library

Now let’s add mult.h/c and build our own static library.

// mult.h int mult ( int a , int b );

// mult.c int mult ( int a , int b ) { return a * b ; }

Before we would have to do something like:

# compile object files gcc -c main.c gcc -c add.c gcc -c mult.c # link gcc main.o add.o mult.o -o main

But now we will package add.o and mult.o into a single library:

gcc -c main.c gcc -c add.c gcc -c mult.c # create library ar rcs libmath.a add.o mult.o # link gcc main.o libmath.a -o main

ar creates an archive from our object files and s makes it include an index. Let’s run nm on it:

> nm libmath.a Archive index: add in add.o mult in mult.o add.o: 0000000000000000 T add mult.o: 0000000000000000 T mult

So it looks like what we expected, it includes an index from symbol to object file and then the contents of each object file. We end up using it exactly the same as an object file when linking.

Building our own dynamic library

Dynamic libraries (aka shared libraries) do change things a little, they let us defer symbol resolution until runtime. This lets us do cool stuff like hot reloading code, and letting multiple binaries load the same shared library.

Continuing from the same example before, our compiling now looks like:

# compile object files # -fPIC makes it position independent # positions are relative, so it can be relocated in memory when loaded gcc -c main.c gcc -c -fPIC add.c gcc -c -fPIC mult.c # create library gcc -shared -o libmath.so.1 add.o mult.o # link # -L. adds the current dir to the library search path # you can also use -lmath to link libmath.so gcc main.o -o main -L . -l :libmath.so.1

When running we need to also specify the library search path (where the loader looks for dynamic libraries):

# Run > LD_LIBRARY_PATH = . ./main # Show dynamic library dependency resolution > LD_LIBRARY_PATH = . ldd main ... libmath.so.1 => ./libmath.so.1 ( 0x00007fa023369000 ) ...

Dynamic loading is a pretty big topic of its own, but it still serves the same purpose of resolving symbols like an object file, just with some magic so we can do that after compile time. Unfortunately, this complicates deploying build artifacts since you need to have the library in place with the final binary.

Printing and libc

We’re going to get a little crazy here and actually output text. This time we’re just going to have main.c but include stdio.h .

// main.c #include <stdio.h> // puts int main () { puts ( "Hello" ); }

# compile main.c into an object file, main.o gcc -c main.c # link main.o into an executable gcc main.o -o main

./main # outputs: Hello

We never defined stdio.h or puts but everything works fine. Running gcc -E main.c produces an enormous output but it looks like stdio.h is coming from somewhere. Lets run nm on the object file and the binary to see the symbols in each.

> nm main.o 0000000000000000 T main U puts > nm main ... U [email protected]@GLIBC_2.2.5 0000000000400526 T main U [email protected]@GLIBC_2.2.5 00000000004004a0 t register_tm_clones ...

Looks like puts is referenced but not defined in main.o , and nm main points us to GLIBC . Libc is the standard library for c, and glibc is the implementation that gcc includes. It turns out this gets implicitly dynamically link on every build.

Running ldd on the gcc output shows us dynamic library dependencies (also called shared objects), confirming gcc is linking more than just our main.o

# ldd prints "shared object dependencies" (dynamic libraries) > ldd main linux-vdso.so.1 => ( 0x00007fff945b0000 ) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 ( 0x00007f0c8d0f7000 ) /lib64/ld-linux-x86-64.so.2 ( 0x0000556bcd7fb000 )

GCC is doing a lot more than just calling the linker ld with ld main.o -o main , we can see it all with gcc -v main.o -o main … it’s a lot. It seems hard to make ld work directly because of all the libraries we need to link against to actually make a C executable.

So even for 4 lines of code, we’ve got a lot going on. We found out GCC is doing a lot implicitly to build an executable that we glossed over before. Apparently we need vdso (lib to attempt to use faster hardware instructions for system calls?), libc (standard c library) and ld-linux (dynamic linker/loader).

BUT, it does still fall under our mental model. We build main.c into main.o , which has some undefined references. In order to make an executable, we combine our intermediate object files with libc dynamically (and a loader) and if every symbol is resolved, it works! It’s the same as the dynamic library example, with just implicit stuff happening that probably makes reliable building a headache.

Wrapping up

I’ve learned a lot about C builds from this, and I’m curious to see what other languages do. Thankfully, the mental model of source to object to binary (or library) target is pretty straightforward, even if we ended up doing a lot of work digging into some really simple builds.