Introduction

After studying compilers and programming languages, I felt like internet tutorials and guides are way too complex for beginners or are missing some important parts about these topics.

My goal with this post is to help people that are seeking a way to start developing their first programming language/compiler.

Want to read this story later? Save it in Journal.

Requirements

On this guide, I’ll be using PLY as lexer and parser, and LLVMlite as low level intermediate language to do code generation with optimizations (if you don’t know what I’m talking about, don’t worry, I’ll explain it later).

So, the requirements for this project are:

Anaconda (way simpler to install LLVMlite through conda than pip)

LLVMlite

$ conda install --channel=numba llvmlite

RPLY (same as PLY but with a better API)

$ conda install -c conda-forge rply

LLC (LLVM static compiler)

GCC (or other linking tool)

Getting Started

“Where to begin?”. This is the most common question when trying to create your programming language.

I’ll start by defining my own language. Let’s call it TOY, here’s a simple example of a TOY program:

var x;

x := 4 + 4 * 2;

if (x > 3) {

x := 2;

} else {

x := 5;

}

print(x);

Although this example is really simple, it is not so easy to be implemented as a programming language. So let’s start with a simpler example:

print(4 + 4 - 2);

But how do you formally describe a language grammar? It is way to hard to create all possible examples of a language to show everything it can do.

To do this, we do what is called an EBNF. It is a metalanguage to define with one document, all possible grammar structures of a language. You can find most programming languages EBNFs easily.

To understand better how a EBNF grammar works, I recommend reading this post.

EBNF

Let’s create a EBNF that describes the minimal possible functionality of TOY, only a sum operation. It will describe the following example:

4 + 2;

It’s EBNF can be described as:

expression = number, "+", number, ";";

number = digit+;

digit = [0-9];

This example is way too simple to be useful as a programming language, so let’s add some functionalities. The first one, is to be able to add as many numbers as you want, and the second one, is to to be able to subtract numbers as well.

Here’s an example of our new programming language:

4 + 4 - 2;

And it’s EBNF can be described as:

expression = number, { ("+"|"-"), number }, ";";

number = digit+;

digit = [0-9];

And finally, adding print to our programming language:

print(4 + 4 - 2);

We get to this EBNF:

program = "print", "(", expression, ")", ";";

expression = number, { ("+"|"-"), number } ;

number = digit+

digit = [0-9];

Now that we’ve defined our grammar, how do we translate it to code? So we can validate and understand a program? And after that, how can we compile it to a binary executable?

Compiler

A compiler is a program that turns a programming language into machine language or other languages. In this guide, I’m going to compile our programming language into LLVM IR and then into machine language.

Diagram showing the process of compiling some programming languages, passing through an IR (Intermediate Representation).

Using LLVM, it is possible to optimize your compilation without learning compiling optimization, and LLVM has a really good library to work with compilers.

Our compiler can be divided into three components:

Lexer

Parser

Code Generator

For the Lexer and Parser we’ll be using RPLY, really similar to PLY: a Python library with lexical and parsing tools, but with a better API. And for the Code Generator, we’ll use LLVMlite, a Python library for binding LLVM components.

Lexer

The first component of our compiler is the Lexer. It’s role is to take the program as input and divide it into Tokens.

Lexer process of transforming a string into a list of tokens.

We use the minimal structures from our EBNF to define our tokens. For example, with the following input:

print(4 + 4 - 2);

Our Lexer would divide this string into this list of tokens:

Token('PRINT', 'print')

Token('OPEN_PAREN', '(')

Token('NUMBER', '4')

Token('SUM', '+')

Token('NUMBER', '4')

Token('SUB', '-')

Token('NUMBER', '2')

Token('CLOSE_PAREN', ')')

Token('SEMI_COLON', ';')

So, let’s start coding our compiler. First, create a file named lexer.py . We’ll define our tokens on this file. We’ll only use LexerGenerator class from RPLY to create our Lexer.

After this, create your main file named main.py . We’ll combine all three compiler components on this file.

If you run $ python main.py , the output of tokens will be the same as described above. You can change the name of your tokens if you want, but I recommend keeping the same to keep consistency with the Parser.

Parser

The second component in our compiler is the Parser. It’s role is to do a syntax check of the program. It takes the list of tokens as input and create an AST as output. This concept is more complex than a list of tokens, so I highly recommend a little bit of research about Parsers and ASTs.

Lexical analysis followed by parsing to create the AST.

To implement our parser, we’ll use the structure created with out EBNF as model. Luckly, RPLY’s parser uses a format really similar to the EBNF to create it’s parser, so it is really straightforward.

The most challenging is to attach the Parser with the AST, but when you get the idea, it becomes really mechanical.

First, create a new file named ast.py . It will contain all classes that are going to be called on the parser and create the AST.

Second, we need to create the parser. For that, we’ll use ParserGenerator from RPLY. Create a file name parser.py:

And finally, we’ll update our file main.py to combine Parser with Lexer.

Now, if you run $ python main.py , you’ll see the output being the result of print(4 + 4 — 2) , which is equal to printing 6.

With these two components, we have a functional compiler that interprets TOY language with Python. However, it still doesn’t create a machine language code and is not well optimized. To do this, we’ll enter the most complex part of the guide, code generation with LLVM.

Code Generator

The third and last component of out compiler is the Code Generator. It’s role is to transform the AST created from the parser into machine language or an IR. In this case, it’s going to transform the AST into LLVM IR.

This component is the main reason why I’m writing this post. There aren’t good guides on how to implement code generation with LLVM on Python.

LLVM can be really complex to understand, so if you wish to fully understand what is going on, I recommend reading LLVMlite docs.

LLVMlite doesn’t have a implementation to a print function, so you have to define your own.

So, to start, let’s create a file named codegen.py that will contain the class CodeGen . This class is responsible to configure LLVM and create and save the IR code. We also declare the Print function on it.

After this, let’s update our main.py file to call CodeGen methods:

As you can see, I removed the input program from this file and created a new file called input.toy to simulate a external program. It’s content is the same as the input described.

print(4 + 4 - 2);

Another change that was made is passing module , builder and printf objects to the Parser. This was made so we could pass this objects to the AST, where the LLVM AST is created. So, we change parser.py to receive these objects and pass them to the AST.

And finally, and most important, we change the ast.py to receive these objects and create the LLVM AST using LLVMlite methods.

With these changes, our compiler is ready to transform a TOY program into an LLVM IR file output.ll . To compile this .ll file into a executable, we’ll use LLC to create a object file output.o , and finally, GCC (you could use other linking programs) to create the final executable file.

$ llc -filetype=obj output.ll

$ gcc output.o -o output

And you can finally run your executable file compiled from the initial program.

$ ./output

Next Steps

After this guide, I hope you can understand an EBNF and the three basic concepts of a compiler. With this knowledge, you now can create your own programming language and write a optimized compiler to it with Python. I encourage you to go further and add new elements to your language and compiler, here are some ideas:

Statements

Variables

New Binary Operators ( Multiplication, Division)

Unary Operators

If Statement

While Stametement

Feel free to send me any compiler projects. I’ll be happy to help you with anything.

You can contact me at marceloga1@al.insper.edu.br

Hope you all enjoyed this post and got some love to programming languages and compilers!

You can see the final code on GitHub.