Let’s take a look at a hello-world application.

main.c :

#include <stdio.h> int main(void) {

puts("Hello World!");

return 0;

}

This is a trivial working program in C. If you’re coming from a higher-level language background (say, JavaScript), this code seems to make sense. We define a function that in turn calls another function and returns a value.

But C and JavaScript are very different languages. The former is compiled, relatively low-level, and its primitives map quite well to the way computers work on a hardware level. The latter deals with higher-level constructs and requires a complex engine that runs the code and abstracts away a lot of machine-level details (how memory is managed, how certain data structures are stored, etc.).

The upcoming series of posts means to explain the areas of C that might not be obvious if you only have experience with high-level, interpreted languages.

Here’s the first question: what does it take to make this code run (how is it compiled)? And where exactly does the puts function come from?

Note: I’m running all the below commands on an Ubuntu Linux. On a Mac, you should be able to run these commands unchanged. On Windows, see documentation for your compiler, or try Cygwin. The ideas presented here are OS-independent.

Compilation Process

In order to understand the example in the beginning of the article, let’s see how a C program is compiled (and what it’s compiled to).

First of all, what’s the main difference between interpreted and compiled languages? The former rely on a language engine — a separate piece of software that takes your JavaScript (or Python, or whatever) code as input and runs it. The latter use a compiler to generate machine instructions that the processor later executes directly, without the assistance of an extra software layer. So in both cases there’s a process of translating human-readable code into machine code, and the difference is whether it happens at compile time or at runtime.

What do you need in order to compile C code? Beginners might think that to do anything with C you need a large and complex IDE, like MS Visual Studio. In fact, the only thing you need is a compiler. Granted, modern compilers also tend to be large and complex, but they can run on the command line, and the only thing they need is an input file, which can be written in any text editor. Popular compilers include gcc and clang for Mac and Linux, or Visual C++ for Windows (yes, it is also available on the command line).

So if we have the main.c source file from the beginning of the article, compiling it is as simple as invoking something like the following:

> gcc main.c -o out

-o out stands for “write the output to the file out ” (which is going to be the executable). When you run the executable, you see the expected output:

> ./out

Hello World!

Actually, the term “compilation” is overloaded. It can either mean the whole process of transforming a source file (or several source files) into an executable, or just a part of this process, namely creating a corresponding assembly file for each source file (this is also called “compilation proper”). So here’s the whole process:

Preprocessing. Before compilation starts, preprocessing directives ( #include , #define , #if and the like) are executed. At the end of this step, we have files with no preprocessing directives. These files (which are inputs to the compiler) are called translation units. Compilation (proper) and assembly. Each translation unit is separately transformed into an object file, which contains machine code and linker directives. These files cannot yet be executed. Linking. All object files are finally combined to produce an executable.

Dependency Resolution

Where does the puts function come from? If you think that it’s somehow being imported by including stdio.h , you’re only partially right.

Let’s remove the #include and see what happens.

main.c :

int main(void) {

puts("Hello World!");

return 0;

} > gcc main.c -o out

main.c: In function ‘main’:

main.c:2:3: warning: implicit declaration of function ‘puts’ [-Wimplicit-function-declaration]

puts(“Hello World!”);

^

We got a warning, but no errors. Let’s try running out :

> ./out

Hello World!

Success! So what’s going on here? We did not import or include anything in main.c that would contain the function name puts , so how does it know what function to call?

To understand how that works, let’s look at the assembly code that main.c gets compiled to. With gcc , you can pass a -S option to make it stop after preprocessing and compilation proper.

> gcc -S main.c -o main.s -masm=intel -fno-asynchronous-unwind-tables

main.c: In function ‘main’:

main.c:2:3: warning: implicit declaration of function ‘puts’ [-Wimplicit-function-declaration]

puts(“Hello World!”);

^

We got that same warning here (which was to be expected), but let’s look at the resultant assembly file.

main.s :

.file "main.c"

.intel_syntax noprefix

.section .rodata

.LC0:

.string "Hello World!"

.text

.globl main

.type main, @function

main:

push rbp

mov rbp, rsp

mov edi, OFFSET FLAT:.LC0

call puts

mov eax, 0

pop rbp

ret

.size main, .-main

.ident “GCC: (Ubuntu 5.4.0–6ubuntu1~16.04.10) 5.4.0 20160609”

.section .note.GNU-stack,””,@progbits

The exact output depends on the compiler, the processor architecture, and the options you’re compiling with (omitting -masm=intel -fno-asynchronous-unwind-tables in the above example changes the output a little, but the relevant bits stay similar). We should be looking for some sort of call or jump command.

call puts — that is, set the instruction pointer to the value of the label puts . Or, more plainly, “make it so that the next command that the processor executes is the one marked by puts ”. The command before that ( mov edi, OFFSET FLAT:.LC0 ) places the address of the "Hello World!" string in the edi register — that’s where any function in x64 Linux would be looking for it. Those two commands are all we get for the puts function call.

To be fair, modern versions of the C standard (starting with C99) demand that if a function is to be called, it has to be previously declared (hence the warning). If we want to fix the issue, we don’t actually need to include stdio.h . All we need to do is add a declaration for puts .

main.c :

int puts (const char *); int main(void) {

puts("Hello World!");

return 0;

}

If we try to compile the above, it will go smoothly.

> gcc main.c -o out

> ./out

Hello World!

Now we’re making the signature of the function known (and the compiler acknowledges it by not issuing the warning), but there’s still no indication where the function body is to be found.

In fact, compiler and assembler don’t care about the function’s location. All the compiler does when it sees the function call is add assembly instructions for placing parameters in their respective locations (specified by the relevant application binary interface) and for calling the function.

Being able to call a function without declaring it is a remnant of previous, less strictly typed versions of C. gcc allows such code to be compiled for compatibility reasons — it simply infers the number and types of a function’s parameters by the arguments you’re passing it, and assumes its type to be int . We’re not using the return value of puts (and even if we did, its actual type happens to be int , just like the compiler’s default), and we’re passing it a string, which the actual function expects to get, that’s why everything works even without a declaration.

So at what point do we finally resolve the connection between the function call and its definition? And where is that definition?

That’s one of the things the linker is responsible for. As we discussed previously, linker deals with non-executable object files — it takes one or more of them as input and produces an executable. In the process, it makes sure that if something is being called somewhere, a corresponding label exists somewhere else. If it does exist, the label is converted into a concrete memory address. If not — the linker produces an error.

Let’s see how the linker is being called (the -v flag below is for verbose output).

> gcc -v main.c -o out

... (output omitted)

/usr/lib/gcc/x86_64-linux-gnu/5/collect2 -plugin /usr/lib/gcc/x86_64-linux-gnu/5/liblto_plugin.so -plugin-opt=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper -plugin-opt=-fresolution=/tmp/ccz0r2qe.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s — sysroot=/ — build-id — eh-frame-hdr -m elf_x86_64 — hash-style=gnu — as-needed -dynamic-linker /lib64/ld-linux-x86–64.so.2 -z relro -o out /usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crt1.o /usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/5/crtbegin.o -L/usr/lib/gcc/x86_64-linux-gnu/5 -L/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/5/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/5/../../.. /tmp/ccAjWoyV.o -lgcc — as-needed -lgcc_s — no-as-needed -lc -lgcc — as-needed -lgcc_s — no-as-needed /usr/lib/gcc/x86_64-linux-gnu/5/crtend.o /usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crtn.o

The last command here — the one we’re interested in — is presented in full. It was autogenerated and was not intended to be particularly human-readable, but we may still see what’s going on. The collect2 utility (which is the gcc linker) is called with a few options, and a bunch of object files ( *.o ) are being passed to it (including /tmp/ccAjWoyV.o , which is the temporary object file generated from main.c ). The option that is of interest to us is -lc , which tells the linker to link with the shared library libc.so , which can be found in a predefined system location. That is actually a file in your system where the implementations of the C standard library functions, including puts , reside.

If you’re curious to see how the function might be implemented in C, you can find examples online. The source doesn’t have to be installed on your machine though — a compiled library version is good enough.

Header files

Let’s go back to the initial example.

main.c :

#include <stdio.h> int main(void) {

puts("Hello World!");

return 0;

}

We’ve shown that the #include directive above is not necessary for compilation and can be replaced with a single declaration. But why do we generally include header files, and what happens when we do so? And how is a header file intrinsically different from a source file?

Let’s create another file in the same directory.

misc.c :

#include <stdio.h> void say_hi(void) {

puts("Hi!");

} void say_thanks(void) {

puts("Thanks!");

} void say_bye(void) {

puts("Bye!");

}

Now we’d like to use the functions from that source file in main.c :

int main(void) {

say_hi();

say_thanks();

say_bye();

return 0;

}

Let’s try compiling it. Multiple source files can be compiled into a single executable, and one of the conditions for successful compilation is that among all those sources there’s one and only one function called main , which is the entry point into the application (I will expand on the semantics of main and its usage in different environments in a later post).

> gcc main.c misc.c -o out

main.c: In function ‘main’:

main.c:2:3: warning: implicit declaration of function ‘say_hi’ [-Wimplicit-function-declaration]

say_hi();

^

main.c:3:3: warning: implicit declaration of function ‘say_thanks’ [-Wimplicit-function-declaration]

say_thanks();

^

main.c:4:3: warning: implicit declaration of function ‘say_bye’ [-Wimplicit-function-declaration]

say_bye();

^

> ./out

Hi!

Thanks!

Bye!

We got the expected warnings, but the program did compile and worked as planned. We can make the warnings go away, like in the previous example, by declaring the functions in main.c before using them.

main.c :

void say_hi(void);

void say_thanks(void);

void say_bye(void); int main(void) {

say_hi();

say_thanks();

say_bye();

return 0;

}

If we try to run the same compilation command, there’ll be no warnings.

In fact, it’s common practice to also declare all misc.c functions at the top of misc.c as well, before they’re defined.

misc.c :

#include <stdio.h> void say_hi(void);

void say_thanks(void);

void say_bye(void); void say_hi(void) {

puts("Hi!");

} void say_thanks(void) {

puts("Thanks!");

} void say_bye(void) {

puts("Bye!");

}

Header files are a way to avoid this repetition.

misc.h :

void say_hi(void);

void say_thanks(void);

void say_bye(void);

misc.c :

#include <stdio.h> #include "misc.h" void say_hi(void) {

puts("Hi!");

} void say_thanks(void) {

puts("Thanks!");

} void say_bye(void) {

puts("Bye!");

}

main.c :

#include "misc.h" int main(void) {

say_hi();

say_thanks();

say_bye();

return 0;

}

What happens is we simply extract the repeating bit of code to a dedicated file and replace it with an #include directive.

Then during preprocessing the opposite happens — the directive is replaced with the full contents of the header file. We can try it like so (the -E option is for stopping compilation after preprocessing, and the *.i file extension is gcc convention for sources that need not be preprocessed).

> gcc -E main.c -o main.i

main.i :

# 1 "main.c"

# 1 "<built-in>"

# 1 "<command-line>"

# 1 "/usr/include/stdc-predef.h" 1 3 4

# 1 "<command-line>" 2

# 1 "main.c"

# 1 "misc.h" 1

void say_hi(void);

void say_thanks(void);

void say_bye(void);

# 2 "main.c" 2 int main(void) {

say_hi();

say_thanks();

say_bye();

return 0;

}

You may ignore the linemarkers (lines starting with # ) — the rest of it is just the contents of the original main.c with the full contents of misc.h .

That’s it. Header files are not indicative of a module system — the preprocessor is just doing text substitution. A header file may contain anything, not only function declarations (as long as it’s meaningful when compiled). What makes it a header file is not its .h extension or its contents. It’s rather the fact that we don’t pass it to the compiler as part of the source files list, but paste its contents into other source files via the #include directive. It’s getting compiled in the end, and multiple times — once per each #include .

As an aside, headers may have #include directives too, so the preprocessor has to perform the replacement recursively. It may so happen that two different header files independently include stdio.h , and they both get included in a source file. What we end up with is a header file included twice, since nothing prevents us from doing so. That doesn’t necessarily break anything since double declarations are permitted, but it slows down compilation. A common workaround is checking manually whether a certain header has ever been included.

misc.h :

#ifndef _MISC_H_

#define _MISC_H_ void say_hi(void);

void say_thanks(void);

void say_bye(void); #endif

The code above defines a preprocessing macro called _MISC_H_ , but if the preprocessor encounters the same block again, it removes it since _MISC_H_ is already defined. This way we ensure that there’s only one copy of the above header in any given file after preprocessing, no matter how many times it’s directly or indirectly included.

As another aside, the quotation marks ( "" ) versus angle brackets ( <> ) notation in the #include directive specifies which directories will be searched for the file. The standard doesn’t specify this, but all major compilers behave in the same fashion: the angle-bracket version just searches in a predefined set of system directories, and in the case of quotation marks, the search starts in the directory where the source file is located, and if it’s not found, it falls back to the system directories. To make it simpler, angle brackets are for system header files, and quotes are for custom headers.

You can now try running gcc with the -E option on the example from the very beginning of the article — the output should be a few hundred lines long, but the puts declaration should be there.

Conclusion

C is not a language that can replace JavaScript or Python — it’s much less pleasant to write high-level software in it. But it still makes a lot of sense to learn it because it gives you a much better understanding of how computers work. And understanding how programs are compiled and how they run is essential to writing anything more serious than a hello-world application.

The Full Series