Writing a BF Compiler for .net (Part 6: The actual compiler)

In the last 5 parts we looked at each of the 8 instructions, how it works in BF, how it would work in C# and how it looks in IL. Now, what is a compiler actually?

Quick Compiler Theory

Basically our end result has to be a series of bytes. For example, the IL Instruction ldc.i4.0 is compiled as 0x16. Some instructions take parameters. For example, the br instruction is compiled as 0x38 0x?? 0x?? 0x?? 0x??. That's 5 bytes: 0x38 for the actual br and then 4 bytes for the address to jump to.

However, there is more. .net assemblies are usually .exe or .dll files and pure IL may only help you on mono (not sure if that's true and if you can feed mono with a pure IL file). If you're running on Windows, you need some additional "stuff" to make a proper .exe - see this Article.

Now, how do we write a compiler? In what language do we write compilers? If you looked into Compiler development you may have heard the term "bootstrapping" which means "Write a compiler for language X in language X through an intermediary compiler written in language Y". So if you want to write the very first C# compiler you can't write it in C# as there are no C# compilers to compile your C# code with, so you use another language. You could use C, C++, Delphi, Assembler or any other language.

Needless to say, this whole process is very complicated and involves a lot of plumbing (many lines of code for simple things because you have to manage everything). Luckily, Microsoft gave us a fantastic built-in functionality to compile .net Assemblies: The System.Reflection.Emit Namespace.

Reflection.Emit and how a Console Application works

What does Reflection.Emit allow us to do? It allows us to emit IL instructions into an assembly - we just feed it the IL instructions and it takes care of everything else. As a side effect, this allows us to write our compiler in any .net language! In our case, it means we will write our Compiler in C#.

But, what will we actually need to emit? Can we really just take the BF code and emit IL from it? No, not really. Each .net Assembly needs an Entry Point: A method. Create a new C# Console Application and notice that you start with a method "Main":

static void Main(string[] args) { }

Try renaming the method to something else and compile your Application. You will get this error message:

Program 'test.exe' does not contain a static 'Main' method suitable for an entry point

Each C# Application needs a method called Main. This is the entry point of the Application, this is the very first method that .net calls. Does it have to be called "Main"?

Not really. In .net, any method can be an entry point. It's just that the C# guys at Microsoft decided that for C# it shall be called Main, possibly because that's how it's usually called in C/C++.

Additionally, you may notice that the Main method is part of a class called Program. Do methods have to be part of a Type? In C# yes, but in C++/CLI you can have global methods. So does .net as a whole require a Type to hold methods? Yes, it does. Let me quote from Partition I, 9.8: "The CTS does not have the notion of global statics: all statics are associated with a particular class.". There is an annotation to that: "[...] the CLI supports languages that supports global statics by creating a special class names <module>".

So to conclude, we need a class which contains a method. So, what will we actually compile now?

The BF Compiler we are going to write will create a Console Application. Specifically, we will create this class (shown as a C# class for illustration purposes):

public static class Program { private static byte[] memory = new byte[short.MaxValue]; // 32767 bytes private static short pointer = 0; public static void Execute() { // Our BF Code here } }

As you see, we have a byte-array that serves as our memory. We offer BF a generous 32 kilobyte of memory. Additionally, we have a pointer into the memory so that we know where we are in the memory. Our containing class is called Program and our entry point is called Execute (just to show that we are so cool, we don't need a method called Main). This method will then contain our compiled BF Code.

Creating the bare minimum compiler skeleton

Let's start simple. Create a new C# Console Application and use this code for your main method:

static void Main(string[] args) { string sourceFileName = args[0]; string outputName = args[1]; var sourceCode = File.ReadAllText(sourceFileName); BFCompiler.Compile(outputName, outputName + ".exe", sourceCode); Console.WriteLine("{0} compiled to {1}.exe",sourceFileName,outputName); }

Then, create a new class BFCompiler:

public static class BFCompiler { public static void Compile(string assemblyName, string outputFileName, string sourceCode) { } }

Create a new text file and call it "hello.bf":

++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.

In Visual Studio, set "Copy to Output Directory" to "Copy Always" and compile. You can now invoke your compiler using

bfnetc.exe hello.bf hello

and it should print "hello.bf compiled to hello.exe" on the command line.

Okay, that was our skeleton: Read the .bf source code, call the compiler with the desired assembly name, output filename and source code.

Creating an Assembly

Now we need to actually create the assembly. Change BFCompiler.Compile to contain this code:

public static void Compile(string assemblyName, string outputFileName, string sourceCode) { var asm = AppDomain.CurrentDomain.DefineDynamicAssembly(new AssemblyName(assemblyName), AssemblyBuilderAccess.Save); var mod = asm.DefineDynamicModule(assemblyName, outputFileName); var mainClassTypeName = assemblyName + ".Program"; // e.g., hello.Program var type = mod.DefineType(mainClassTypeName, TypeAttributes.Class | TypeAttributes.Sealed | TypeAttributes.Abstract | TypeAttributes.Public); var pointerField = type.DefineField("pointer", typeof (Int16), FieldAttributes.Static | FieldAttributes.Private); var memoryField = type.DefineField("memory", typeof(Byte[]), FieldAttributes.Static | FieldAttributes.Private); var constructor = type.DefineConstructor(MethodAttributes.Static, CallingConventions.Standard, null); var cctorIlGen = constructor.GetILGenerator(); GenerateConstructorBody(cctorIlGen,pointerField,memoryField); var mainMethod = type.DefineMethod("Execute", MethodAttributes.Public | MethodAttributes.Static); var ilGen = mainMethod.GetILGenerator(); ilGen.Emit(OpCodes.Ret); type.CreateType(); asm.SetEntryPoint(mainMethod); asm.Save(outputFileName); }

Whoa, that looks complicated! (And also it doesn't compile as GenerateConstructorBody doesn't exist). Let us walk through the code step by step:

var asm = AppDomain.CurrentDomain.DefineDynamicAssembly(new AssemblyName(assemblyName), AssemblyBuilderAccess.Save); var mod = asm.DefineDynamicModule(assemblyName, outputFileName); var mainClassTypeName = assemblyName + ".Program"; // e.g., hello.Program var type = mod.DefineType(mainClassTypeName, TypeAttributes.Class | TypeAttributes.Sealed | TypeAttributes.Abstract | TypeAttributes.Public);

We start by creating a new Assembly that has a certain name (in our example that would be "Hello"). This Assembly exists in the current Application Domain (This could be a security risk - please read up about application domains in .net if you want to know more). The second parameter specifies what can be done with this assembly. This is another security feature, as we specified that "The dynamic assembly can be saved, but not executed.". The "but not executed" part does not mean that we generate an assembly that can't be run (that would be stupid, right?). It merely means that the Assembly (which only exists in memory for now) cannot be executed in the context of our running compiler. We can save it to disk though.

We then specify a module. Assemblies are collections of modules, and each assembly needs to contain at least one module. Usually, it contains exactly one module as Visual Studio doesn't support us in packaging multiple modules into one assembly. If you come from a C/C++ background, think of modules as static compilations. If you don't come from a C/C++ background, read up on ilmerge and netmodules.

So our module is called "hello" and has a filename of "hello.exe".

We then create a new class (or Type) called "Program". This is a public sealed abstract class. Uhm... "sealed abstract"? Well, .net does not support static classes. But C# does? How? By cheating: A static class in C# is a class that is both abstract and sealed (something that's legal in .net but illegal in C#). By declaring it abstract it is ensured that nobody can instantiate it and by making it sealed it is ensured that nobody can subclass it. (For some trivia around C# static classes, see this article).

The other thing: What is the mainClassTypeName doing? It generates a namespace for us. Our typename in this example isn't Program - it's "hello.Program". During compilation, AssemblyBuilder will create a namespace "hello" and a Type "Program" automatically.

So now we have the "outer shell" of our compiled Application: An assembly called "hello" containing a module "hello.exe" containing a public "static" (abstract sealed) class "Program" in a namespace "hello".

Next, we define two fields in the Program Class:

var pointerField = type.DefineField("pointer", typeof (Int16), FieldAttributes.Static | FieldAttributes.Private); var memoryField = type.DefineField("memory", typeof(Byte[]), FieldAttributes.Static | FieldAttributes.Private);

These fields are private and static and of type short (=Int16) and Byte-Array respectively. But wait, don't we specify the size of them?

No, we don't. See, the C# way of writing

private static byte[] memory = new byte[short.MaxValue];

is just a convenience feature. If you compile this assembly and look at the IL, you will notice that the assignment is done in a static constructor. This is why we define a constructor:

var constructor = type.DefineConstructor(MethodAttributes.Static, CallingConventions.Standard, null);

The first parameter declares that we want to create a static constructor (.cctor). The second parameter means that the constructor is a normal method (as opposed to a virtual method or one that takes variable arguments). The third parameter defines any arguments the method takes - static constructors are always parameterless, hence we pass null.

We have our constructor method now, next we need to put some code into it:

var cctorIlGen = constructor.GetILGenerator(); GenerateConstructorBody(cctorIlGen,pointerField,memoryField);

GetILGenerator is the magic that makes compilers happen. This thing allows you to write code into the method. Add a new method to the BFCompiler class:

private static void GenerateConstructorBody(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField) { // construct the memory as byte[short.MaxValue] ilGen.Emit(OpCodes.Ldc_I4,0x7fff); ilGen.Emit(OpCodes.Newarr,typeof(Byte)); ilGen.Emit(OpCodes.Stsfld,memoryField); // Construct the pointer as short = 0 ilGen.Emit(OpCodes.Ldc_I4_0); ilGen.Emit(OpCodes.Stsfld, pointerField); ilGen.Emit(OpCodes.Ret); }

Whoa, another highly complex method. Well, actually it's rather simple, just many lines of code. Remember, you are now directly generating IL, so no convenience features. You have to do all the plumbing yourself. What are we doing? First, we call ldc.i4 0x7fff which pushes the 32-Bit value 0x7fff (=32767 or short.MaxValue) to the stack. Then we call newarr typeof(byte) which creates a new Byte Array. Important: It's typeof(Byte), not typeof(Byte[])! newarr pops the length from the stack, so this now created a new Byte[32767] for us.

stsfld puts it into the memoryField.

So the static constructor now looks like this:

static Program(){ memory = new Byte[0x7fff]; }

The next two instructions of the constructor initialize the pointer field: Push 0 to the stack, replace pointer with the value on the stack.

You may wonder about the last instruction:

ilGen.Emit(OpCodes.Ret);

This generates a return; statement. Why a return statement on a void method? Is this required? Technically, it's not. But in order to create verifiable assemblies, you have to emit it. Veri-what? Verifiable Code is a security feature in .net. It essentially allows tools (like PEVerify) to determine that no security boundaries are violated (that is: no unsafe pointer manipulation and a well defined "flow" through the application). In a nutshell: By emitting the ret statement at the end of a method, any static analysis tool knows for sure that at this point, control is returned to the calling method. So always generate ret at the end of each method, even for void methods.

So that's our constructor. We now have a class Program with two fields (pointer and memory) that are properly initialized. Great! Now we create the main method:

var mainMethod = type.DefineMethod("Execute", MethodAttributes.Public | MethodAttributes.Static); var ilGen = mainMethod.GetILGenerator(); ilGen.Emit(OpCodes.Ret);

This is almost the same as above, except we call DefineMethod. The method is called "Execute" and is public static. We then get the ILGenerator for the method and emit a single instruction: Ret. Ignoring the issue with verifiable code outlined above, do we really need the ret statement? Yes, we do. No, actually we don't. We just need any statement. Why?

Because each method must have a method body. If you generate a method without any instructions in the body, you will get an exception:

System.InvalidOperationException: Method 'Execute' does not have a method body.

So we have to emit something, and we just return. We could also have chosen nop (no-operation), but since we want to create verifiable code, we emit ret.

The next three instructions then actually create the Type and Assembly:

type.CreateType(); asm.SetEntryPoint(mainMethod); asm.Save(outputFileName);

CreateType essentially "compiles" the type - until this moment, the Type only exists as some description, but we can't use it from other code.

SetEntryPoint then points to the Execute method. If you don't set an entry point then the Assembly is still generated, but you can't run it ("hello.exe is not a valid Win32 application").

Save then finally writes the Assembly to disk.

Run your compiler again and you should get an hello.exe which does nothing, but at least it runs! Look at it in ildasm or Reflector and you should see this code:

class public abstract auto ansi sealed Program extends [mscorlib]System.Object { .method privatescope specialname rtspecialname static void .cctor() cil managed { .maxstack 1 L_0000: ldc.i4 0x7fff L_0005: newarr uint8 L_000a: stsfld uint8[] Program::memory L_000f: ldc.i4.0 L_0010: stsfld int16 Program::pointer L_0015: ret } .method public static void Execute() cil managed { .entrypoint .maxstack 0 L_0000: ret } .field private static uint8[] memory .field private static int16 pointer }

Whoa, nice! Notice that Program derives from System.Object - we don't have to do that manually, Reflection.Emit does that for us automatically. Also, we don't have to worry about all this other stuff that is required to create a Windows Executable - AssemblyBuilder does that as well.

Compiling the BF Code

We now have a working .net assembly that sets up some fields, but it doesn't do much. So let's do something with the BF Source Code! Add the following code to the BFCompiler.Compile method:

var ilGen = mainMethod.GetILGenerator(); // New Code foreach(char c in sourceCode) { switch (c) { case '>': GenerateMovePointerForwardInstruction(ilGen, pointerField, memoryField); break; case '<': GenerateMovePointerBackwardsInstruction(ilGen, pointerField, memoryField); break; case '+': GenerateIncrementInstruction(ilGen, pointerField, memoryField); break; case '-': GenerateDecrementInstruction(ilGen, pointerField, memoryField); break; case '.': GenerateWriteInstruction(ilGen, pointerField, memoryField); break; case ',': GenerateReadInstruction(ilGen, pointerField, memoryField); break; case '[': GenerateOpenBracketInstruction(ilGen, pointerField, memoryField); break; case ']': GenerateCloseBracketInstruction(ilGen, pointerField, memoryField); break; } } // New Code ilGen.Emit(OpCodes.Ret); type.CreateType();

We are walking to the sourceCode, character by character. For each of the 8 known BF instructions we call another function that generates the proper IL. Let's quickly look at each of them:

< and > instructions

private static void GenerateMovePointerForwardInstruction(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField) { ilGen.Emit(OpCodes.Ldsfld, pointerField); ilGen.Emit(OpCodes.Ldc_I4_1); ilGen.Emit(OpCodes.Add); ilGen.Emit(OpCodes.Conv_I2); ilGen.Emit(OpCodes.Stsfld, pointerField); } private static void GenerateMovePointerBackwardsInstruction(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField) { ilGen.Emit(OpCodes.Ldsfld, pointerField); ilGen.Emit(OpCodes.Ldc_I4_1); ilGen.Emit(OpCodes.Sub); ilGen.Emit(OpCodes.Conv_I2); ilGen.Emit(OpCodes.Stsfld, pointerField); }

We've been through them in Part 3: Load the pointer value, add/subtract 1 from it, convert the result to an Int16, store it in the pointer field again.

+ and - instructions

private static void GenerateIncrementInstruction(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField) { ilGen.Emit(OpCodes.Ldsfld, memoryField); ilGen.Emit(OpCodes.Ldsfld, pointerField); ilGen.Emit(OpCodes.Ldelema, typeof(Byte)); ilGen.Emit(OpCodes.Dup); ilGen.Emit(OpCodes.Ldobj, typeof(Byte)); ilGen.Emit(OpCodes.Ldc_I4_1); ilGen.Emit(OpCodes.Add); ilGen.Emit(OpCodes.Conv_U1); ilGen.Emit(OpCodes.Stobj, typeof(Byte)); } private static void GenerateDecrementInstruction(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField) { ilGen.Emit(OpCodes.Ldsfld, memoryField); ilGen.Emit(OpCodes.Ldsfld, pointerField); ilGen.Emit(OpCodes.Ldelema, typeof(Byte)); ilGen.Emit(OpCodes.Dup); ilGen.Emit(OpCodes.Ldobj, typeof(Byte)); ilGen.Emit(OpCodes.Ldc_I4_1); ilGen.Emit(OpCodes.Sub); ilGen.Emit(OpCodes.Conv_U1); ilGen.Emit(OpCodes.Stobj, typeof(Byte)); }

These were covered in part 2. Load the memory and the pointer, get the memory-byte at the pointer address, add/subtract 1 from it, convert it to a byte (UInt8), replace it in the byte array.

. and , instructions

private static void GenerateWriteInstruction(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField) { ilGen.Emit(OpCodes.Ldsfld, memoryField); ilGen.Emit(OpCodes.Ldsfld, pointerField); ilGen.Emit(OpCodes.Ldelem_U1); ilGen.Emit(OpCodes.Call, typeof(Console).GetMethod("Write", BindingFlags.Public | BindingFlags.Static, null, new Type[] { typeof(Char) }, null)); } private static void GenerateReadInstruction(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField) { ilGen.Emit(OpCodes.Ldsfld, memoryField); ilGen.Emit(OpCodes.Ldsfld, pointerField); ilGen.Emit(OpCodes.Ldelem_U1); ilGen.Emit(OpCodes.Call, typeof(Console).GetMethod("Read", BindingFlags.Public | BindingFlags.Static)); ilGen.Emit(OpCodes.Conv_U1); ilGen.Emit(OpCodes.Stelem_I1); }

Part 4 covered these two, but now it gets a bit interesting. We need to call a method, Console.Read and Console.Write.

OpCodes.Call expects a MethodInfo and optionally an array of Types.

For Console. Read, We get the Console Type through typeof. Then we get a Method called "Read" which is defined as Public and Static. We pass this to the Call opcode. Console.Read writes it's return value to the stack, so we convert it to a Byte again and store it in memory[pointer].

With Console.Write it is a little bit more complicated because there are a lot of overloads to this methods. Remember that overload resolution is a convenience function of a compiler - as we are generating raw IL we need to precisely define which function we want to have. So we start again from the Console type, getting a Method called "Write". The fourth parameter is an array of types: The Argument signature. As we want the Console.Write(char) overload, we create a new Array containing the Char type. How do we pass parameters to Console.Write? By putting memory[pointer] onto the stack before.

[ and ] instructions

Part 5 covered the while loop. As you know, we need to create br instructions to jump into some parts of the method. Do we have to dynamically calculate the locations? No, we don't.

Add a new field to the BFCompiler class:

private readonly static Stack<System.Reflection.Emit.Label> _bracketStack = new Stack<System.Reflection.Emit.Label>();

Then add this instruction to the Compile method:

public static void Compile(string assemblyName, string outputFileName, string sourceCode) { _bracketStack.Clear(); // Ensure that the BracketStack is clear (if Compile is called multiple times)

Then add these two methods to implement the [ and ] commands:

private static void GenerateOpenBracketInstruction(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField) { var firstLabel = ilGen.DefineLabel(); var secondLabel = ilGen.DefineLabel(); ilGen.Emit(OpCodes.Br, secondLabel); ilGen.MarkLabel(firstLabel); _bracketStack.Push(firstLabel); _bracketStack.Push(secondLabel); } private static void GenerateCloseBracketInstruction(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField) { var secondLabel = _bracketStack.Pop(); var firstLabel = _bracketStack.Pop(); ilGen.MarkLabel(secondLabel); ilGen.Emit(OpCodes.Ldsfld, memoryField); ilGen.Emit(OpCodes.Ldsfld, pointerField); ilGen.Emit(OpCodes.Ldelem_U1); ilGen.Emit(OpCodes.Ldc_I4_0); ilGen.Emit(OpCodes.Bgt, firstLabel); }

This looks terribly complicated, but it's relatively simple. We are essentially implementing a GOTO. But we are not jumping to an address, we are jumping to a Label. The [ instruction defines two labels, firstLabel and secondLabel. These are essentially just uninitialized variables. We then emit a jump to the secondLabel - this is not defined yet, but it doesn't have to be.

MarkLabel then defines the Label firstLabel at the current location in the method. But wait, something is missing: We need to define secondLabel and we need to jump back to the first Label. We do this on the ] instruction. But we somehow need to pass the two labels from the [ to the ] function, and we need to support nested [ ] loops. This is why we defined a Stack<Label> in our Compiler: Whenever [ is called, we push the two labels onto it. Whenever ] is called, we pop the last two labels out of it.

So with the two labels available to the ] instruction, we mark the location of the secondLabel, then generate the comparison of memory[pointer] with 0. What we do was outlined in Part 5: If memory[pointer] is greater than 0, we jump to firstLabel.

Compile your Application again and run the compiler on hello.bf again. If everything went fine, you should now be able to run the generated hello.exe and see a satisfying "Hello World!" on the console.

Congratulations - you now have a BF compiler that generates a .net Console Application!

You can find the source code as a Visual Studio 2010 solution on http://github.com/mstum/bfnetc

This could conclude this series, but I'm going to write a Part 7 containing a few more considerations: EXE vs. DLL, Release vs. Debug builds and some more.