How Difficult is it to Write a Compiler? Recently I was discussing Converge with someone, and mentioned how little time the core compiler had taken to implement (no compile-time meta-programming, limited error checking, but a functioning compiler nonetheless) - only a few days. The chap I was talking to looked at me and told me that he didn't believe me. I laughed. Actually, he really didn't believe me. Initially that surprised me, and I wondered why he might think that I was pulling his leg. But then I thought back, and only a few years ago I would probably have had the same reaction. This article is my attempt to explain his reaction, and why it's no longer the case. When I was younger, there were three things in computing that seemed so complex that I had no idea how anyone had ever created them; certainly, I didn't think I would ever be capable of working on such things. The imposing triumvirate were hardware, operating systems, and programming languages. Although electricity has always baffled me, the hardware people have done an amazing job on interoperability of components in recent years, such that I've never felt the need to gain a low-level understanding of how electrons whizz around my systems. Operating systems were partially demystified for me as I moved into the world of open source UNIX-like operating systems (for the record, I've been an OpenBSD user for 8 years or so), where one could poke and prod virtually any aspect - although kernels retain an aura of mystery for most of us. Despite the fact that I was familiar with several languages, and had even implemented a primitive assembler language (which, to my surprise, found its way into a real system - my first large-ish programming success), real programming languages remained a mystery to me for much longer. The obvious question is: why? Programming languages are, for most of us, an integral part of the computing infrastructure. Only a small handful of languages have ever had a wide impact, and by the time a programming language becomes popular it already probably has a decade or more of history behind it. This means that when the average programmer first comes across a new language, it has had substantial development behind it, libraries and documentation, developed its own culture, and undoubtedly had several flaws uncovered; not to mention lots of little platform portability hacks, and often complex optimisations. Faced with this wodge of sheer stuff, most peoples reaction is either to shrug their shoulders and accept it at face value (the majority), try to understand it but not know where to start (a minority), or spot how similar it is to every other language (a tiny handful of people). Because the first two categories massively outnumber the third category, a general feeling - which we're not necessarily consciously aware of - has developed that programming languages are special, and therefore that only special people can create them (a veiled insult, if ever I saw one). Thus if I suggest to people that, despite having created a fairly fully featured programming language, I'm no more capable a programmer than the next man, they dismiss it as false modesty. So let's start looking at things in more detail. If you break a programming language down, it only has a handful of things to it. What's more, the variance between different languages is - despite the unfortunate zealotry which often clouds debate - exceedingly small. In fact, at a high level there are really only two things which any language must have: a definition and an implementation. Sometimes a paper definition is produced up front, and only when this is complete is an implementation created. Sometimes the latter defines the former; but conceptually these are two separate things. Basically the definition says things like statement S does X , and the implementation actually takes in a source file and compiles S such that it really does do X. Defining a language is very simple: any fool can do it (the skill comes in defining a good language, but that's a different story). We can then break the implementation down into three components: a compiler, libraries, and a run-time system. Again, these are conceptually separate things, although sometimes they are combined (e.g. most Python programmers do not distinguish the compiler from the run-time system). Libraries are not difficult to create, although they are time consuming. Sometimes the run-time system will be a VM, sometimes it will be bundled up with the executable created by a compiler; regardless, we'll assume for the purposes of this article that a reasonable VM is already available. Most people don't have much trouble understanding the high level view, but it's when they start trying to understand the compiler - the thing that ultimately they interact with - that things start getting more complex. The problem is that looking at implementations like Python (quite complex), Java (very complex), or GCC (something beyond merely very complex ), reveals too much detail to easily uncover the big picture. The Converge compiler, at the moment at least, is rather simpler (although, like any other program, it is growing slowly but surely as it accretes features) and is a nice snapshot of a compiler. It has three basic stages: Read in a source file, and create a parse tree. Turn the parse tree into an abstract syntax tree. Turn the abstract syntax tree into object code. Although the terminology might be a little different from language to language, these three stages are fairly common (although they might be augmented by extra stages e.g. for optimisation). Unfortunately my experience is that even at this stage, people look at those three codes and think magic . Let's break them down a little more. Parsing Parsing is the process of reading in a big wodge of text, discovering its underlying structure, and then creating a parse tree which is easy to operate on. For English, this is a bit like taking in a sentence and discovering what its subject, its verb, and so on is. Parsing doesn't really uncover meaning as such: that's for later stages. If we have an input file such as the following: import Sys func main(): Sys::println(2 + 3) What is its parse tree? Well, in order to uncover that, we use a parsing system, which given a description of a language structure - a grammar - can automatically create a parse tree. Parsing has a bad reputation, often deservedly so: commonly used parsing technologies are often horrible to use, although it is perfectly possible to make much more pleasant parsing technologies. Regardless of that, this is the only thing in a compiler where you will need to check an external source to understand a bit more about grammars (Wikipedia's page on formal grammars is a decent start). An indicative chunk of the Converge grammar is as follows: top_level ::= definition ( "NEWLINE" definition )* ::= definition ::= class_def ::= func_def ::= import ::= var_def ( "," var_def )* ":=" expr ::= splice ::= insert import ::= "IMPORT" import_name import_as ( "," import_name import_as )* import_name ::= "ID" ( "::" "ID" )* import_as ::= "AS" "ID" ::= In essence this says: a program consists of zero or more definitions; a definition is a class, a function etc.; an import is one or more module name imports, each of which can assign to a different variable name than the module name (using as ). Given that and our input program, the following parse tree is produced (using only a couple of lines of code to call a library function or two): top_level -> definition -> import -> <IMPORT import> -> import_name -> <ID Sys> -> import_as -> <NEWLINE> -> definition -> func_def -> func_type -> <FUNC func> -> func_name -> <ID main> -> <(> -> func_params -> <)> -> <:> -> <INDENT> -> func_decls -> expr_body -> expr -> application -> expr -> module_lookup -> expr -> var_lookup -> <ID Sys> -> <::> -> <ID println> -> <(> -> expr -> binary -> expr -> number -> <INT 2> -> binary_op -> <+> -> expr -> number -> <INT 3> -> <)> -> <DEDENT> This might look a bit imposing at first, but it's trivial to convert this back into the original input (although information about blank lines and comments would disappear reversing the transformation). If you want to see what the parse tree is for other Converge inputs, Converge helpfully includes a little program called convergep which given a source file input automatically prints out its parse tree (the above output is cut 'n' paste straight from convergep ). If you've understood this part, congratulations - you've got over the main stumbling block to creating a compiler. From parse tree to abstract syntax tree As stated earlier, a parse tree captures structure, but it doesn't capture meaning as such, beyond that which is captured in the structure. This can be seen by the fact that the parse tree still contains things like the : token, despite the fact this is for the users' visual benefit, rather than effecting the meaning of the program. What the second stage of a compiler does is to understand the meaning of the program (in some sense), strip out all the stuff which doesn't effect that, and convert it into another, very similar, data structure - the Abstract Syntax Tree (AST) - which captures this. For example the text 2 + 3 is converted from its parse tree equivalent: -> expr -> binary -> expr -> number -> <INT 2> -> binary_op -> <+> -> expr -> number -> <INT 3> into an AST along the lines of Add(Int(2), Int(3)) . Notice that typically the parse tree and the AST have almost identical structure. Because of this, the conversion from parse tree to AST is generally very simple. For Converge, this conversion is captured in the compiler/Compiler/IModule_Generator.cv file. As the language grows, clearly this conversion gets larger, but because it is largely mechanical (and therefore often a simple cut 'n' paste job), it is not a difficult task. For example in Converge's early days, when it had approximately the same expressivity as Python, this conversion took around 2-3 days to code from scratch. From abstract syntax tree to object code Object code is an intentionally vague term - it might be machine code, VM bytecode, or even an intermediary programming language. Basically here we take an AST like Add(Int(2), Int(3)) and turn it into instructions such as: INT 2 INT 3 ADD In a language such as Converge, where the VM is designed explicitly to be used with the language, then there is a strong relationship between the AST and the object code. In other words, this means that the conversion to object code is about the same size and complexity as the conversion from parse tree to AST. If you're converting to, say, assembly code then this will be a trickier task, but even then there will still be a largely mechanical element to the conversion. You can see this in Converge's compiler/Compiler/Bytecode_Generator.cv file. Conclusion That's it. You don't need to know, or do, any more to create a compiler. When it's broken down in this simple way, I hope that it partly demystifies what a compiler is. There's nothing particularly magical about a compiler, and with the exception of parsing (which regretfully is often more involved than it should be), nothing particularly complex. I hope you get some sense of how little work there is in creating a compiler for a simple language. Of course, if you want to do something a little more complex, then things become rapidly more work, but you'd be amazed how few languages require such complexity. In conclusion, the main difficulty for most of us in creating a compiler is overcoming the cultural absurdity which tells us that mere mortals can't create compilers. They can. I did. You can too. Follow me on Twitter