How the LLVM Compiler Infrastructure Works

By David Chisnall

Date: May 23, 2008

Article is provided courtesy of Prentice Hall Professional.

Return to the article

LLVM has been creating waves recently as the compiler for the iPhone and in a number of other places. David Chisnall takes a look at what makes this project so interesting.

I talked a bit about the low-level virtual machine (LLVM) when comparing open source compilers. Since then, I’ve become involved with the LLVM project, working on code generation for the Objective-C language. In this article, I’ll give you a more in-depth overview of how LLVM works.

What Is LLVM?

LLVM is a virtual machine infrastructure that doesn’t provide any of the high-level features you’d find in something like the Java or .NET virtual machines, including garbage collection and an object model.

The basic design of LLVM is an unlimited register machine (URM), familiar to most computer scientists as a universal model of computation. It differs from most URMs in two ways:

Registers are single-assignment. Once a value for a register has been set, it can’t be modified. This is a common representation in a lot of compilers, and has been since the idea was invented by an IBM researcher in 1985.

Each register has a type associated with it.

LLVM programs are assembled from basic blocks. A basic block is a sequence of instructions with no branches. The phi instruction is used to create conditional execution. The name comes from the original work in static single assignment, so the semantics will be familiar to anyone who has worked on a compiler that uses this form. It allows the value of an LLVM register to be set to one of a group of values, depending on the basic block from which the current value was entered. Consider the following snippet of C:

if(condition) a = 1; else a = 2;

In LLVM, or any other compiler with an SSA intermediate representation, a basic block would be constructed for each of the assignments. A phi instruction would then be used in the following code to select the correct value for a .

In LLVM, there are two sorts of registers:

Global registers have names that are valid in the entire module (or possibly the entire program).

Local registers have names that are valid only in the current function.

The Intermediate Representation

The core of LLVM is the intermediate representation (IR). Front ends compile code from a source language to the IR, optimization passes transform the IR, and code generators turn the IR into native code.

LLVM provides three isomorphic representations of the IR. The most common one used in examples is the assembly format, which looks roughly like an assembly language for a real machine (although with a few significant differences). A "Hello, world" program might look something like this:

@.str = internal constant [12 x i8] c"hello world\00" define i32 @main() nounwind { entry: %tmp1 = getelementptr ([12 x i8]* @.str, i32 0, i32 0) %tmp2 = call i32 (i8*, ...)* @printf( i8* %tmp1 ) nounwind ret i32 0 }

First is a constant string, @.str . This has two qualifiers, internal and constant , which are the equivalent of static const in C. It then has a type. The square brackets signal that it’s an array; in this case, an array of 12 8-bit integers.

The main function doesn’t contain any branches, so it’s a single basic block. The label entry: indicates the start of the basic block, and the final instruction, ret , indicates the end. Every basic block is terminated with some kind of flow-control instruction. The ret instruction means return; in this case, returning 0 as a 32-bit integer. The type specified by the ret instruction and the return type specified in the function definition must match, or the IR will fail to validate.

Above the return instruction is a call to printf . Again, note the type signatures everywhere. The printf function’s return and argument types are given explicitly, and the types of the arguments are also listed. nounwind on the end indicates that this function is guaranteed not to throw an exception, which can be used in optimization later.

I’ve waited until now to describe the first instruction in this basic block because it’s the one that most commonly causes confusion. Most programming languages (certainly, all Algol-family languages) contain some data structures that are accessed via offsets from their starts. A lot of CPUs include complex addressing modes for dealing with them. The getelementptr instruction (often referred to as GEP) provides something that can easily map to both.

The first argument is a complex type, in this case our global string variable. Note that, although the string is declared as an array type, when you reference it you actually get a pointer to that array. Our printf statement wants a pointer to an i8, but we have a pointer to an array of i8s. The remaining arguments to our GEP instruction are element offsets. The first dereferences the pointer, to give an array. The second then gets a pointer to the 0th element in the array. This instruction can get pointers to any element in an arbitrarily-complex data structure.

I’ve said that it dereferences the pointer, but actually that’s not quite true. The GEP instruction just calculates offsets. When given all zero arguments, as in this example, all it’s really doing is casting a pointer to another type, which will emit no code in the final code-generation phase. This instruction could be replaced by a cast instruction that would simply change the pointer types. Both are semantically valid in this instance, but the GEP instruction is safer because it will validate the types.

While this representation of the IR is the one you’re likely to see most often, it’s not the most commonly used format. When generating IR, it’s common to use a set of C++ classes that represent it and provide convenience methods for constructing it. Intermediate values then are referenced simply as pointers to llvm::Value objects, rather than by name. Most of the time, the IR is used; but when being generated, transformed, or emitted, the C++ representation is used.

The final representation is the bitcode, a very dense binary format used to transfer LLVM IR between components in different address spaces. When using LLVM tools connected by pipes, the bitcode is sent between them. It can also be serialized to disk and loaded later.

Optimizers Everywhere

The LLVM infrastructure is designed to be modular. Each of the optimization passes is a self-contained transform that takes LLVM IR as input and produces it as output. Any combination of the optimizations can be run in any order. (Sometimes you might even want to run one more than once.)

LLVM aims to allow optimizations to be run at any time. When a module is compiled to the IR, the first set of optimizations runs. Then, when it’s linked with other modules, it can be optimized again. This functionality is used by the OpenGL Shading Language (GLSL) implementation on newer versions of Mac OS X.

GLSL is a language for writing shaders for OpenGL programs, and features a lot of vector operations. When an OpenGL program runs, the shader program is sent to the driver, which just-in-time (JIT) compiles it to the GPU’s instruction set and loads it. For GPUs that don’t support the program, the driver needs to provide fallback code to run the program on the CPU. Before adopting LLVM, Apple had two GLSL implementations. One was a simple interpreter, in which every GLSL operation was a simple C function call. The other was a hand-coded JIT that used emitted AltiVec instructions.

The new version unifies these implementations. The JIT emits LLVM code that simply calls the functions that the interpreter uses. However, these functions are compiled to LLVM IR, not to native code. At runtime, the LLVM link-time optimization passes run, inlining the operations and performing a number of other optimizations. The final code takes advantage of whatever vector unit the target CPU has (SSE or AltiVec), running about 10% faster than the original hand-coded JIT. Since the same code is used in the interpreter as in the JIT, it’s also much easier to debug.

The IR doesn’t have to be compiled to native code; it can also run in an interpreter. This approach allows runtime optimizations to be performed, transforming the program at runtime into a more optimal version based on profiling information. Alternatively, the profiling information can be collected at runtime and the optimizations can be applied between program runs, in what LLVM calls idle-time optimization.

Once the optimizations have run, the IR is exported. Usually the exported format is machine code for some architecture, but a few other back ends exist, including one that produces C code and one that produces MSIL for the .NET runtime (still under development). The mechanism is quite simple. Writing a back end just requires you to map each LLVM instruction to a native instruction (or sequence of instructions). This is great for RISC architectures, but for something like x86 it’s not ideal. In addition to the simple mappings, it’s also possible to define more complex mappings that translate a sequence of LLVM IR instructions. These will be tried first, lowered to simpler mappings for architectures that don’t support the more complex mappings. This technique is used in particular for emitting vector instructions. LLVM supports vectors types in the IR. Operations on vector instructions are generated directly on architectures that support them, or lowered to sequences of scalar operations on ones that don’t support vectors.

Clang!

The part of LLVM that’s been getting a lot of attention recently is the C language family front end, known as "clang." Currently, for compiling C, C++, Objective-C, and a few other languages, LLVM uses code taken from GCC—parsing the code with GCC and then converting GIMPLE (the GCC intermediate representation) into LLVM IR, which is fed into the optimization stages. This strategy is not ideal, for two reasons:

GIMPLE throws away some of the semantic information that could be useful to optimizers.

The GCC front end is GPL’d (since GCC is GPL’d), and the rest of the LLVM code is under a BSD-style license.

Yet another problem: Apple seems to have a corporate allergy to GPLv3. and since GCC is now developed under this license, Apple is forced to maintain its fork completely independently of the main version. Even in version 2, the GPL presents other problems. Apple wants to integrate the compiler’s parser closely with its (proprietary) IDE, so that syntax highlighting is done by something that’s capable of understanding macros and has exactly the same behavior as the compiler. The idea is that warnings can be displayed without needing to go through the whole compile process. But the parser from GCC can’t be used for this without making the IDE GPL’d as well.

Last June, Apple began the clang project, a C-family (C, Objective-C and C++) front end for LLVM. Like the rest of LLVM, this is highly modular, allowing individual parts to be used in other projects easily. Somewhat unusually for Apple, clang is being developed out in the open, in a University of Illinois, Urbana-Champaign (UIUC) Subversion server, with public mailing lists for developers (also hosted by UIUC).

In many ways, the clang front end can be seen as a simple compiler in its own right. It takes C source code and compiles it into LLVM "machine code." Unlike most compilers, it performs no optimizations (LLVM does those for clang). This approach makes LLVM very interesting for developers who want to implement their own languages. Writing a compiler that targets LLVM is much easier than producing one that targets a real architecture. You don’t have to worry about register allocation at all, and you can produce very inefficient code that still will run fast.

The Code and the People

LLVM is written in C++. Regular readers will know that I consider C++ to have almost no redeeming features, and a load of hacks piled on to make up for fundamentally flawed underlying semantics. That said, LLVM is not bad by C++ standards. The developers claim that it’s written in a "tasteful subset" of C++, which is fairly accurate.

In spite of my dislike for the language, and the fact that it’s the embodiment of Greenspun’s Tenth Rule of Programming ("Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp"), it’s very easy to use. It took me four days between looking at the code for the first time and getting my first patch accepted, and for someone who hasn’t spent the last five years pretending C++ was just a bad dream it would probably take even less time. In contrast, every time I look at the GCC code, it takes two people to prevent me from clawing my eyeballs out.

The real make-or-break metric for any open source project is the community. LLVM’s community is very friendly, and it’s a lot of fun to participate. Even looking in the bug database, particularly at PR1000, gives you some idea of the kind of people you’ll find on the project. LLVM is generating interest from a number of companies, including Apple and Adobe, but is still officially an academic research project, maintaining a nice balance between corporate and academic interests and the greater open source community.