Next in compiler basics:

2. user-defined functions and variables

3. LLVM

4. LLVM conditionals and compiling fibonacci

5. LLVM system calls

6. an x86 upgrade

In this post we'll write a simple compiler in Javascript (on Node) without any third-party libraries. Our goal is to take an input program like (+ 1 (+ 2 3)) and produce an output assembly program that does these operations to produce 6 as the exit code. The resulting compiler can be found here.

We'll cover:

Parsing

Code generation

Assembly basics

Syscalls

And for now we'll omit:

Programmable function definitions

Non-symbol/-numeric data types

More than 3 function arguments

Lots of safety

Lots of error messsages

Parsing

We pick the S-expression syntax mentioned earlier because it's very easy to parse. Furthermore, our input language is so limited that we won't even break our parser into separate lexing/parsing stages.

Once you need to support string literals, comments, decimal literals, and other more complex literals it becomes easier to use separate stages.



If you're curious about these separate stages of parsing, you may be interested in my post on writing a JSON parser.



Or, check out my BSDScheme project for a fully-featured lexer and parser for Scheme.

The parser should produce an Abstract Syntax Tree (AST), a data structure representing the input program. Specifically, we want (+ 1 (+ 2 3)) to produce ['+', 1, ['+', 2, 3]] in Javascript.

There are many different ways to go about parsing but the most intuitive to me is to have a function that accepts a program (a string) and returns a tuple containing the program parsed so far (an AST) and the rest of the program (a string) that hasn't been parsed.

That leaves us with a function skeleton that looks like this:

module.exports.parse = function parse(program) { const tokens = []; ... logic to be added ... return [tokens, '']; };

The code that initially calls parse will thus have to deal with unwrapping the outermost tuple to get to the AST. For a more helpful compiler we could check that the entire program was actually parsed by failing if the second element of the return result is not the empty string.

Within the function we will iterate over each character and accumulate until we hit space, left or right parenthesis:

module.exports.parse = function parse(program) { const tokens = []; let currentToken = ''; for (let i = 0; i < program.length; i++) { const char = program.charAt(i); switch (char) { case '(': // TODO break; case ')': // TODO break; case ' ': tokens.push(+currentToken || currentToken); currentToken = ''; break; default: currentToken += char; break; } } return [tokens, '']; };

The recursive parts are always the most challenging. The right paren is easiest. We must push the current token and return all tokens with the rest of the program:

module.exports.parse = function parse(program) { const tokens = []; let currentToken = ''; for (let i = 0; i < program.length; i++) { const char = program.charAt(i); switch (char) { case '(': // TODO break; case ')': tokens.push(+currentToken || currentToken); return [tokens, program.substring(i + 1)]; case ' ': tokens.push(+currentToken || currentToken); currentToken = ''; break; default: currentToken += char; break; } } return [tokens, '']; };

Finally the left paren should recurse, add the parsed tokens to the list of sibling tokens, and force the loop to start at the new unparsed point.

module.exports.parse = function parse(program) { const tokens = []; let currentToken = ''; for (let i = 0; i < program.length; i++) { const char = program.charAt(i); switch (char) { case '(': { const [parsed, rest] = parse(program.substring(i + 1)); tokens.push(parsed); program = rest; i = 0; break; } case ')': tokens.push(+currentToken || currentToken); return [tokens, program.substring(i + 1)]; case ' ': tokens.push(+currentToken || currentToken); currentToken = ''; break; default: currentToken += char; break; } } return [tokens, '']; };

Assuming this is all in parser.js , let's try it out in the REPL:

$ node > const { parse } = require('./parser'); undefined > console.log(JSON.stringify(parse('(+ 3 (+ 1 2)'))); [[["+",3,["+",1,2]]],""]

Solid. We move on.

Assembly 101

Assembly is essentially the lowest-level programming language we can use. It is a human readable, 1:1 representation of the binary instructions the CPU can interpret. Conversion from assembly to binary is done with an assembler; the reverse step is done with a disassembler. We'll use gcc for assembling since it deals with some oddities of assembly programming on macOS.

The primary data structures in assembly are registers (temporary variables stored by the CPU) and the program stack. Every function in a program has access to the same registers, but convention cordons off sections of the stack for each function so it ends up being a slightly more durable store than registers. RAX , RDI , RDX , and RSI are a few registers available to us.

Now we only need to know a few instructions to compile our program (the rest of programming assembly is convention):

MOV : store one register's content into another, or store a literal number into a register

: store one register's content into another, or store a literal number into a register ADD : store the sum of two register's contents in the first register

: store the sum of two register's contents in the first register PUSH : store a register's content on the stack

: store a register's content on the stack POP : remove the top-most value from the stack and store in a register

: remove the top-most value from the stack and store in a register CALL : enter a new section of the stack and start running the function

: enter a new section of the stack and start running the function RET : enter the calling functions stack and return to evaluating from the next instruction after the call

: enter the calling functions stack and return to evaluating from the next instruction after the call SYSCALL : like CALL but where the function is handled by the kernel

Function calling convention

Assembly instructions are flexible enough that there is no language-defined way to make function calls. Therefore it is important to answer (at least) the following few questions:

Where are parameters stored by the caller so that the callee has access to them?

Where is the return value stored by the callee so the caller has access to it?

What registers are saved by whom?

Without getting too far into the specifics, we'll assume the following answers for development on x86_64 macOS and Linux systems:

Parameters are stored (in order) in the RDI , RSI , and RDX registers We won't support passing more than three arguments

, , and registers The return value is stored in the RAX register

register RDI , RSI , and RDX registers are stored by the caller

Code generation

With assembly basics and the function call convention in mind, we've got enough to generate code from the parsed program's AST.

The skeleton of our compile code will look like this:

function emit(depth, code) { const indent = new Array(depth + 1).map(() => '').join(' '); console.log(indent + code); } function compile_argument(arg, destination) { // If arg AST is a list, call compile_call on it // Else must be a literal number, store in destination register } function compile_call(fun, args, destination) { // Save param registers to the stack // Compile arguments and store in param registers // Call function // Restore param registers from the stack // Move result into destination if provided } function emit_prefix() { // Assembly prefix } function emit_postfix() { // Assembly postfix } module.exports.compile = function parse(ast) { emit_prefix(); compile_call(ast[0], ast.slice(1)); emit_postfix(); };

From our pseudo-code in comments it is simple enough to fill in. Let's fill in everything but the prefix and postfix code.

function compile_argument(arg, destination) { // If arg AST is a list, call compile_call on it if (Array.isArray(arg)) { compile_call(arg[0], arg.slice(1), destination); return; } // Else must be a literal number, store in destination register emit(1, `MOV ${destination}, ${arg}`); } const BUILTIN_FUNCTIONS = { '+': 'plus' }; const PARAM_REGISTERS = ['RDI', 'RSI', 'RDX']; function compile_call(fun, args, destination) { // Save param registers to the stack args.forEach((_, i) => emit(1, `PUSH ${PARAM_REGISTERS[i]}`)); // Compile arguments and store in param registers args.forEach((arg, i) => compile_argument(arg, PARAM_REGISTERS[i])); // Call function emit(1, `CALL ${BUILTIN_FUNCTIONS[fun] || fun}`); // Restore param registers from the stack args.forEach((_, i) => emit(1, `POP ${PARAM_REGISTERS[args.length - i - 1]}`)); // Move result into destination if provided if (destination) { emit(1, `MOV ${destination}, RAX`); } emit(0, ''); // For nice formatting }

In a better compiler, we would not make plus a built-in function. We'd emit code for the assembly instruction ADD . However, making plus a function makes code generation simpler and also allows us to see what function calls look like.

We'll define the plus built-in function in the prefix code.

The prefix

Assembly programs consist of a few "sections" in memory. The most important of which are the text and data sections. text is a read-only section where the program instructions themselves are stored. The CPU is instructed to start interpreting from some location in this text section and it will increment through instructions, evaluating each instruction until it reaches an instruction that tells it to jump to a different location to evaluate instructions (e.g. with CALL, RET, or JMP).

To denote the text section we emit .text in our prefix before we emit our generated code.

The data section is for statically initialized values (e.g. global variables). We don't have any need for that right now so we'll ignore it.



Here is a good read with more detail on these (and other) sections.

Additionally, we need to emit an entrypoint (we'll use _main ) and add a notice ( .global _main ) so that the location of this entrypoint is visible externally. This is important because we let gcc handle the hairier parts of generating an executable file and it needs access to the entrypoint.

So far, our prefix looks like this:

function emit_prefix() { emit(1, '.global _main

'); emit(1, '.text

'); // TODO: add built-in functions emit(0, '_main:'); }

The last part of our prefix needs to include the plus built-in function. For this, we add the first two parameter registers we agreed on ( RDI and RSI ) and store the result in RAX .

function emit_prefix() { emit(1, '.global _main

'); emit(1, '.text

'); emit(0, 'plus:'); emit(1, 'ADD RDI, RSI'); emit(1, 'MOV RAX, RDI'); emit(1, 'RET

'); emit(0, '_main:'); }

And we're golden.

The postfix

The job of the postfix will be simple, call exit with the value of RAX since this will be the result of the last function called by the program.

exit is a syscall, so we'll use the SYSCALL instruction to call it. The x86_64 calling convention on macOS and Linux for SYSCALL defines parameters the same way CALL does. But we also need to tell SYSCALL what syscall to call. The convention is to set RAX to the integer representing the syscall on the current system. On Linux it will be 60 ; on macOS it is 0x2000001 .

When I say "convention", I don't mean that you really have a choice as a programmer. It was arbitrary when the operating system and standard libraries chose it. But if you want to write a working program that uses syscalls or calls out to (say) glibc, you'll need to follow these conventions.

The postfix then looks like this:

const os = require('os'); const SYSCALL_MAP = os.platform() === 'darwin' ? { 'exit': '0x2000001', } : { 'exit': 60, }; function emit_postfix() { emit(1, 'MOV RDI, RAX'); // Set exit arg emit(1, `MOV RAX, ${SYSCALL_MAP['exit']}`); // Set syscall number emit(1, 'SYSCALL'); }

And we're set here too.

Putting it all together

We can finally write our Javascript entrypoint and run our compiler against a sample program.

The entrypoint might look like this:

const { parse } = require('./parser'); const { compile } = require('./compiler'); function main(args) { const script = args[2]; const [ast] = parse(script); compile(ast[0]); } main(process.argv);

And we can call it like so:

$ node ulisp.js '(+ 3 (+ 2 1))' .global _main .text plus: ADD RDI, RSI MOV RAX, RDI RET _main: PUSH RDI PUSH RSI MOV RDI, 3 PUSH RDI PUSH RSI MOV RDI, 2 MOV RSI, 1 CALL plus POP RSI POP RDI MOV RSI, RAX CALL plus POP RSI POP RDI MOV RDI, RAX MOV RAX, 0x2000001 SYSCALL

Generating an executable file from the output

If we redirect the previous output to an assembly file and call gcc on it, we can generate a program we can run. Then we can echo the $? variable to see the exit code of the previous process.

$ node ulisp.js '(+ 3 (+ 2 1))' > program.S $ gcc -mstackrealign -masm=intel -o program program.s $ ./program $ echo $? 6

And we've got a working compiler! The full source of the compiler is available here.

Further reading

Please reply on Twitter with questions or comments.