Statically Recompiling NES Games into Native Executables with LLVM and Go

2013 June 7

I have always wanted to write an emulator. I made a half-hearted attempt in college but it never made it to a demo-able state. I also didn't want to write Yet Another Emulator. It should at least bring something new to the table. When I discovered LLVM, I felt like I finally had worthwhile idea for a project.

This article presents original research regarding the possibility of statically disassembling and recompiling Nintendo Entertainment System games into native executables. I attempt to bring the reader along from beginning to end in a sort of "let's build it together" fashion. I assume as little as possible about what the reader knows about the subjects at hand, without compromising the depth of the article.

See the project on GitHub

Table of Contents

Discovering LLVM

I became giddy with excitement when I discovered LLVM. Generate code in LLVM IR format, and LLVM can optimize and generate native code for any of its supported backends. Here's an incomplete list:

ARM

STI CBEA Cell SPU [experimental]

C++ backend

Hexagon

MBlaze

Mips

Mips64 [experimental]

Mips64el [experimental]

Mipsel

MSP430 [experimental]

NVIDIA PTX 32-bit

NVIDIA PTX 64-bit

PowerPC 32

PowerPC 64

Sparc

Sparc V9

Thumb

32-bit X86: Pentium-Pro and above

64-bit X86: EM64T and AMD64

XCore

Furthermore, you can write additional backends to support additional targets. For example, emscripten provides a JavaScript backend.

To seal the deal, the LLVM project offers clang, a C, C++, Objective C, and Objective C++ front-end for LLVM. This means that because of the way LLVM is designed, emscripten allows you to compile C, C++, Objective C, and Objective C++ code for the browser. Wow!

Just to show you how refreshing this technology is, have a look at the following C code:

#include <stdio.h> #include <stdint.h> int main() { uint8_t foo = 0xf0; if (foo & 0x80 == 0x80) { for (int i = 1; i <= 10; ++i) { printf("%d

", i); } } }

Simple enough. Assign 11110000 to foo . If the highest bit is 1 (which it is), print the integers from 1 to 10.

Compiling with gcc with default settings, we get:

$ gcc test.c test.c: In function ‘main’: test.c:7:5: error: ‘for’ loop initial declarations are only allowed in C99 mode test.c:7:5: note: use option -std=c99 or -std=gnu99 to compile your code $

Right, so default gcc doesn't have the nice things that C99 brings. Let's try that again with gcc -std=c99 :

$ c99 test.c $ ./a.out $

Okay, so now it compiled and ran, but why didn't it print the integers from 1 to 10? Let's try compiling with clang and see what happens:

$ clang test.c test.c:6:13: warning: & has lower precedence than ==; == will be evaluated first [-Wparentheses] if (foo & 0x80 == 0x80) { ^~~~~~~~~~~~~~ test.c:6:13: note: place parentheses around the '==' expression to silence this warning if (foo & 0x80 == 0x80) { ^ ( ) test.c:6:13: note: place parentheses around the & expression to evaluate it first if (foo & 0x80 == 0x80) { ^ ( ) 1 warning generated. $

Aha! In C, & has lower precedence than == . In addition to outputting with fancy terminal colors, clang found the bug in our program and issued a warning. Once we fix that up and run the code again, we get the expected behavior.

Note: It has been shown to me that gcc -Wall produces a similar warning for the same code example here. I apologize for the bogus example. However, my point still stands. LLVM has a much better explanation of the philosophy and differences in error messages between clang and gcc.

There are many instances where cryptic gcc errors in your code are made clear by clang. In the words of my friend Josh Wolfe,

Clang makes gcc look like a rusty old grandpa.

This is exciting because clang is merely a front-end to LLVM. What this means is that if you generate LLVM IR code, you share the same code generation backend as clang. With this I felt that powerful optimization and wide target platform support were within my grasp.

What if we could translate NES assembly code into LLVM IR code? We could completely, statically recompile NES games into native binaries!

NES Crash Course

Take a break from LLVM for a moment and consider the Nintendo Entertainment System.

The NES consists of 4 systems working together in parallel:

CPU - 8-bit processor. Basically a MOS 6502.

Picture Processing Unit

Audio Processing Unit

Usually a mapper: custom hardware used to provide extra ROM (and sometimes other stuff)

The NES uses memory-mapped I/O, which means that, to interact with hardware, you read from and write to special memory addresses. Take a look at a simplified version of the memory layout of the NES at runtime:

Start End Description $0000 $07ff RAM. Unless they use a mapper, NES games only get 2KB of RAM! $2000 $2007 These memory addresses are hooked up to the PPU and hence are used to control what is being displayed on the screen. $4000 $4015 These are hooked up to the APU and are used to control the sound that plays. $4016 $4017 These are used to read the state of the game controllers. $8000 $ffff This is Read Only Memory. The game code itself is loaded here, so games can embed data in their program code and read it in this address range.

Let's break this down. First, note that the entire addressable range of the NES is 64KB. Thus all addresses can be represented by 2 bytes, or 4 hex digits. Next, note that in 6502 assembly, $ is standard for hexadecimal notation. So the address range is $0000 to $ffff . Of the 64KB addressable range, 29KB is (usually) completely unused!

Note the $8000 - $ffff range. When you create an NES game, you write 6502 assembly code, which is assembled into a 32KB binary, which is loaded into the $8000 - $ffff range during bootup.

Let's write some example 6502 code to illustrate.

Creating a Custom ABI for Testing Purposes

One of the challenges of NES development is that a basic "Hello, World" program is pretty complicated. This is why I created a custom Application Binary Interface to make it a little bit easier. Refer back to the memory layout above, and recall that many addresses are unused. I decided to add a couple things:

Address Description $2008 Write a byte to this address and the byte will go to standard out. $2009 Write a byte to this address and the program will exit with that byte as the return code.

With this add-on ABI, I was able to create a simple 6502 application that prints "Hello, World!

" to stdout and then exits. In 6502 land, that looks like:

; Remember that this code gets loaded at memory address $8000. ; This instruction tells the assembler that the code and data ; that follow are starting at address $8000. ; For example, the 'W' character in the "Hello, World!

\0" text ; below is located at memory address $8007. .org $8000 ; This is a label. The assembler will replace references to ; this label with the actual address that it represents, which ; in this case is $8000. msg: ; 'db' stands for "Data Bytes". This line puts the string ; "Hello, World!

\0" in bytes $8000 - $800e when this ; code is assembled. .db "Hello, World!", 10, 0 Reset_Routine: LDX #$00 ; put the starting index, 0, in register X loop: ; This instruction does 3 things: ; 1. Take the address of msg and add the value of register X to get a pointer. ; 2. Read the byte at the pointer and put it into register A ; 3. Set the zero flag to 1 if A is zero; or 0 if A is nonzero. LDA msg, X ; read 1 char ; If the zero flag is 1, goto loopend. Otherwise, go to the next instruction. BEQ loopend ; end loop if we hit the \0 ; Put the value of register A into memory address $2008. But remember, ; with our custom ABI, this means write the value of register A to stdout. STA $2008 ; putchar ; Register X = Register X + 1 INX JMP loop ; goto loop loopend: LDA #$00 ; put the return code, 0, in register A STA $2009 ; exit with a return code of register A (0 in this case) ; These are interrupt handlers. Let's not think about interrupts yet. IRQ_Routine: RTI ; do nothing NMI_Routine: RTI ; do nothing ; Another org statement. Remember that NES programs are 32KB. This tells the assembler ; to fill the unused space up until $fffa with zeroes and that we are now ; specifying the data at $fffa. .org $fffa ; 'dw' stands for "Data Words". These lines put the address of the NMI_Routine ; label at $fffa, the address of Reset_Routine at $fffc, and the address ; of IRQ_Routine at $fffe. .dw NMI_Routine .dw Reset_Routine .dw IRQ_Routine

There are some things to point out here. First, note the entry points into the program. Whereas in most programming environments, there is a main function, in NES games, the data at $fffa - $ffff has hardcoded special meaning. When a game starts, the NES reads the 2 bytes at $fffc and $fffd and uses that as the memory address of the first instruction to run.

Next let's talk about the CPU. In the code snippet above I used register A and register X. There are 6 registers total:

Name Bits Description A 8-bit The "main" register. Most of the arithmetic operations are performed on this register. X 8-bit Another general purpose register. Fewer instructions support X than A. Often used as an index, as in the above code. Y 8-bit Pretty much the same as X. Fewer instructions support Y than X. Also often used as an index. P 8-bit The "status" register. When the assembly program mentions the "zero flag" it is actually referring to bit 1 (the second smallest bit) of the status register. The other bits mean other things, which we will get into later. SP 8-bit The "stack pointer". The stack is located at $100 - $1ff . The stack pointer is initialized to $1ff . When you push a byte on the stack, the stack pointer is decremented, and when you pull a byte from the stack, the stack pointer is incremented. PC 16-bit The program counter. You can't directly modify this register but you can indirectly modify it using the stack.

After coming up with this well-defined task - get "Hello World" running natively - it is time to write some code. First things first - this project should be able to assemble our source code into binary machine code format.

Assembly

One tried and true method to parsing source code is to use a lexer to tokenize the source, and then a parser to turn the tokens into an Abstract Syntax Tree. From there we can process the AST and turn it into the 32KB binary payload.

In Go land, this is straightforward thanks to nex, a lexer, and go tool yacc, a parser that is bundled with Go. Here is the parser code that can build an AST for our "Hello, World" example:

%{ package jamulator import ( "fmt" "strconv" "container/list" ) // Define the structures that make up the abstract syntax tree. // Example: // Label1234: type LabelStatement struct { LabelName string Line int } // Example: // Label1234: BRK type LabeledStatement struct { Label *LabelStatement Stmt interface{} } // Example: // .org $8000 type OrgPseudoOp struct { Value int Fill byte Line int } type InstructionType int const ( // Example: // ADC #$44 ImmediateInstruction InstructionType = iota // Example: // INX ImpliedInstruction // Example: // ADC Label1234, X DirectWithLabelIndexedInstruction // Example: // ADC $44, X DirectIndexedInstruction // Example: // ADC Label1234 DirectWithLabelInstruction // Example: // ADC $44 DirectInstruction // Example: // ADC ($44, X) IndirectXInstruction // Example: // ADC ($44), Y IndirectYInstruction // Example: // JMP ($81cc) IndirectInstruction ) type Instruction struct { Type InstructionType OpName string Line int // not all fields are used by all instruction types. Value int LabelName string RegisterName string // filled in later OpCode byte Offset int Payload []byte } type DataStmtType int const ( // Example: // .db $44 ByteDataStmt DataStmtType = iota // Example: // .dw Label1234 WordDataStmt ) type DataStatement struct { Type DataStmtType dataList *list.List Line int // filled in later Offset int Payload []byte } type IntegerDataItem int type StringDataItem string type LabelCall struct { LabelName string } type ProgramAst struct { List *list.List } // This is the root node of the abstract syntax tree. var programAst ProgramAst %} // This instructs yacc how to generate a structure that will fit // any data type that we need to use for a node. %union { integer int str string list *list.List orgPsuedoOp *OrgPseudoOp node interface{} } // These are the nodes that we build based on other nodes and tokens. %type <list> statementList %type <node> statement %type <node> instructionStatement %type <node> dataStatement %type <list> dataList %type <list> wordList %type <node> dataItem %type <orgPsuedoOp> orgPsuedoOp %type <node> numberExpr %type <node> numberExprOptionalPound // These are tokens that we will encounter when reading the output from // the lexer. %token <str> tokIdentifier %token <integer> tokInteger %token <str> tokQuotedString %token tokEqual %token tokPound %token tokDot %token tokComma %token tokNewline %token tokDataByte %token tokDataWord %token tokProcessor %token tokLParen %token tokRParen %token tokDot %token tokColon %token tokOrg %% // An assembly program is defined as a statementList. programAst : statementList { programAst = ProgramAst{$1} } // A statementList is defined as another statementList, plus a single // statement, or simply as a statement. This creates a linked list // of statements. statementList : statementList tokNewline statement { if $3 == nil { $$ = $1 } else { $$ = $1 $$.PushBack($3) } } | statement { if $1 == nil { $$ = list.New() } else { $$ = list.New() $$.PushBack($1) } } // This defines a statement, which can be many things. statement : tokDot tokIdentifier instructionStatement { $$ = &LabeledStatement{ &LabelStatement{"." + $2, parseLineNumber}, $3, } } | tokIdentifier tokColon instructionStatement { $$ = &LabeledStatement{ &LabelStatement{$1, parseLineNumber}, $3, } } | orgPsuedoOp { $$ = $1 } | instructionStatement { $$ = $1 } | tokDot tokIdentifier dataStatement { $$ = &LabeledStatement{ &LabelStatement{"." + $2, parseLineNumber}, $3, } } | tokIdentifier tokColon dataStatement { $$ = &LabeledStatement{ &LabelStatement{$1, parseLineNumber}, $3, } } | dataStatement { $$ = $1 } | tokIdentifier tokColon { $$ = &LabelStatement{$1, parseLineNumber} } | { // empty statement $$ = nil } dataStatement : tokDataByte dataList { $$ = &DataStatement{ Type: ByteDataStmt, dataList: $2, Line: parseLineNumber, } } | tokDataWord wordList { $$ = &DataStatement{ Type: WordDataStmt, dataList: $2, Line: parseLineNumber, } } wordList : wordList tokComma numberExprOptionalPound { $$ = $1 $$.PushBack($3) } | numberExprOptionalPound { $$ = list.New() $$.PushBack($1) } dataList : dataList tokComma dataItem { $$ = $1 $$.PushBack($3) } | dataItem { $$ = list.New() $$.PushBack($1) } numberExpr : tokPound tokInteger { tmp := IntegerDataItem($2) $$ = &tmp } | tokIdentifier { $$ = &LabelCall{$1} } numberExprOptionalPound : numberExpr { $$ = $1 } | tokInteger { tmp := IntegerDataItem($1) $$ = &tmp } dataItem : tokQuotedString { tmp := StringDataItem($1) $$ = &tmp } | numberExprOptionalPound { $$ = $1 } orgPsuedoOp : tokOrg tokInteger { $$ = &OrgPseudoOp{$2, 0xff, parseLineNumber} } | tokOrg tokInteger tokComma tokInteger { if $4 > 0xff { yylex.Error("ORG directive fill parameter must be a single byte.") } $$ = &OrgPseudoOp{$2, byte($4), parseLineNumber} } instructionStatement : tokIdentifier tokPound tokInteger { $$ = &Instruction{ Type: ImmediateInstruction, OpName: $1, Value: $3, Line: parseLineNumber, } } | tokIdentifier { $$ = &Instruction{ Type: ImpliedInstruction, OpName: $1, Line: parseLineNumber, } } | tokIdentifier tokIdentifier tokComma tokIdentifier { $$ = &Instruction{ Type: DirectWithLabelIndexedInstruction, OpName: $1, LabelName: $2, RegisterName: $4, Line: parseLineNumber, } } | tokIdentifier tokInteger tokComma tokIdentifier { $$ = &Instruction{ Type: DirectIndexedInstruction, OpName: $1, Value: $2, RegisterName: $4, Line: parseLineNumber, } } | tokIdentifier tokIdentifier { $$ = &Instruction{ Type: DirectWithLabelInstruction, OpName: $1, LabelName: $2, Line: parseLineNumber, } } | tokIdentifier tokInteger { $$ = &Instruction{ Type: DirectInstruction, OpName: $1, Value: $2, Line: parseLineNumber, } } | tokIdentifier tokLParen tokInteger tokComma tokIdentifier tokRParen { if $5 != "x" && $5 != "X" { yylex.Error("Register argument must be X.") } $$ = &Instruction{ Type: IndirectXInstruction, OpName: $1, Value: $3, Line: parseLineNumber, } } | tokIdentifier tokLParen tokInteger tokRParen tokComma tokIdentifier { if $6 != "y" && $6 != "Y" { yylex.Error("Register argument must be Y.") } $$ = &Instruction{ Type: IndirectYInstruction, OpName: $1, Value: $3, Line: parseLineNumber, } } | tokIdentifier tokLParen tokInteger tokRParen { $$ = &Instruction{ Type: IndirectInstruction, OpName: $1, Value: $3, Line: parseLineNumber, } } %%

And here is the lexer code that generates the tokens listed above:

/\.[dD][bB]/ { return tokDataByte } /\.[dD][wW]/ { return tokDataWord } /\.[oO][rR][gG]/ { return tokOrg } /"[^"

]*"/ { t := yylex.Text() lval.str = t[1:len(t)-1] return tokQuotedString } /[a-zA-Z][a-zA-Z_.0-9]*/ { lval.str = yylex.Text() return tokIdentifier } /%[01]+/ { binPart := yylex.Text()[1:] n, err := strconv.ParseUint(binPart, 2, 16) if err != nil { yylex.Error("Invalid binary integer: " + binPart) } lval.integer = int(n) return tokInteger } /\$[0-9a-fA-F]+/ { hexPart := yylex.Text()[1:] n, err := strconv.ParseUint(hexPart, 16, 16) if err != nil { yylex.Error("Invalid hexademical integer: " + hexPart) } lval.integer = int(n) return tokInteger } /[0-9]+/ { n, err := strconv.ParseUint(yylex.Text(), 10, 16) if err != nil { yylex.Error("Invalid decimal integer: " + yylex.Text()) } lval.integer = int(n) return tokInteger } /=/ { return tokEqual } /:/ { return tokColon } /#/ { return tokPound } /\./ { return tokDot } /,/ { return tokComma } /\(/ { return tokLParen } /\)/ { return tokRParen } /[ \t\r]/ { // ignore whitespace } /;[^

]*

/ { // ignore comments parseLineNumber += 1 return tokNewline } /

+/ { parseLineNumber += len(yylex.Text()) return tokNewline } /./ { yylex.Error(fmt.Sprintf("Unexpected character: %q", yylex.Text())) } // package jamulator import ( "strconv" "os" "fmt" ) var parseLineNumber int var parseFilename string var parseErrors ParseErrors type ParseErrors []string func (errs ParseErrors) Error() string { return strings.Join(errs, "

") } func Parse(reader io.Reader) (ProgramAst, error) { parseLineNumber = 1 lexer := NewLexer(reader) yyParse(lexer) if len(parseErrors) > 0 { return ProgramAst{}, parseErrors } return programAst, nil } func ParseFile(filename string) (ProgramAst, error) { parseFilename = filename fd, err := os.Open(filename) if err != nil { return ProgramAst{}, err } programAst, err := Parse(fd) err2 := fd.Close() if err != nil { return ProgramAst{}, err } if err2 != nil { return ProgramAst{}, err2 } return programAst, nil } func (yylex Lexer) Error(e string) { s := fmt.Sprintf("%s line %d %s", parseFilename, parseLineNumber, e) parseErrors = append(parseErrors, s) }

Let's demonstrate our ability to parse the source code by printing and inspecting the abstract syntax tree.

import ( "fmt" "reflect" "os" ) func astPrint(indent int, n interface{}) { for i := 0; i < indent; i++ { fmt.Print(" ") } fmt.Println(reflect.TypeOf(n)) } func (ast ProgramAst) Print() { for e := ast.List.Front(); e != nil; e = e.Next() { astPrint(0, e.Value) switch t := e.Value.(type) { case *LabeledStatement: astPrint(2, t.Label) astPrint(2, t.Stmt) case *DataStatement: for de := t.dataList.Front(); de != nil; de = de.Next() { astPrint(2, de.Value) } } } } func main() { programAst, err := jamulator.ParseFile("hello.asm") if err != nil { fmt.Fprintf(os.Stderr, "%s

", err.Error()) os.Exit(1) } programAst.Print() }

Now we need to compile this code. At this point, we can no longer rely on Go to build everything because we now have 2 generated Go files:

Source File Generated File asm6502.nex asm6502.nn.go asm6502.y y.go

The most straightforward way to build the project at this point is to use a Makefile. Simple enough:

build: jamulator/y.go jamulator/asm6502.nn.go go build -o jamulate main.go jamulator/y.go: jamulator/asm6502.y go tool yacc -o jamulator/y.go -v /dev/null jamulator/asm6502.y jamulator/asm6502.nn.go: jamulator/asm6502.nex ${GOPATH}/bin/nex -e jamulator/asm6502.nex clean: rm -f jamulator/asm6502.nn.go rm -f jamulator/y.go rm -f jamulate .PHONY: build clean

Now we can build the project with one command:

$ make go tool yacc -o jamulator/y.go -v /dev/null jamulator/asm6502.y /home/andy/golang/bin/nex -e jamulator/asm6502.nex go build -o jamulate main.go

And, sure enough, when we run this code, we get a nice breakdown of the source:

$ ./jamulate *jamulator.OrgPseudoOp *jamulator.LabelStatement *jamulator.DataStatement *jamulator.StringDataItem *jamulator.IntegerDataItem *jamulator.IntegerDataItem *jamulator.LabelStatement *jamulator.Instruction *jamulator.LabelStatement *jamulator.Instruction *jamulator.Instruction *jamulator.Instruction *jamulator.Instruction *jamulator.Instruction *jamulator.LabelStatement *jamulator.Instruction *jamulator.Instruction *jamulator.LabeledStatement *jamulator.LabelStatement *jamulator.Instruction *jamulator.LabeledStatement *jamulator.LabelStatement *jamulator.Instruction *jamulator.OrgPseudoOp *jamulator.DataStatement *jamulator.LabelCall *jamulator.DataStatement *jamulator.LabelCall *jamulator.DataStatement *jamulator.LabelCall

Once we have this AST, building a binary is a cinch:

Loop over the AST and compute the byte offset for each instruction. Use the computed byte offsets to resolve the instructions with labels. Final pass to write the payload to the disk.

Disassembly

Now we need to work backwards - we have the binary payload and we want to get the source. There is a challenge here. Remember that in assembly programs, we can insert arbitrary data with .db or .dw statements alongside instructions. In order to disassemble effectively, we have to be able to figure out what is "data" and what is "code".

One possible technique is to emulate the assembly program, and then record the ways in which memory addresses are accessed. After playing a game for a while, you would have a pretty good record of exactly which sections are data and which are code. I decided not to use this technique, however, since the goal of this project is static recompilation. I want to explore just how much is possible to do at compile-time.

So what can we do?

First, recall that the last 6 bytes in NES programs are 3 memory addresses which are the 3 entry points into the program:

.org $fffa .dw NMI_Routine .dw Reset_Routine .dw IRQ_Routine

Given this, a workable strategy becomes clear:

Create an AST where every single byte is a single .db statement. Replace the .db statements at $fffa and $fffb with a .dw statement which references an NMI_Routine label. Calculate the address that the .db statements at $fffa and $fffb were referring to, and insert a LabelStatement with the NMI_Routine label before the .db statement at that address. Mark the .db statement at that address as an instruction.

When I say "mark as an instruction", what I mean is that we should interpret the .db statement at that location as an op code, and then use that to replace the following .db statements as part of the instruction as necessary. Then, based on the instruction, we want to recursively mark other locations as instructions:

Instruction What to Do BPL, BMI, BVC, BVS, BCC, BCS, BNE, BEQ, JSR Mark the jump target address and the next address as an instruction. JMP absolute Mark the jump target address as an instruction. RTI, RTS, BRK, JMP indirect Do nothing. everything else Mark the next address as an instruction.

The instructions that start with "B" are all conditional branch instructions. This means that they test some condition, and then either transfer control flow to the next instruction, or to a different label. This means that we can mark the possible branch address and the next address as instructions.

JSR stands for "Jump to SubRoutine". This will transfer control to a target address and then later when the RTS ("ReTurn from Subroutine") instruction is reached, continue execution where the JSR instruction left off.

It is possible for assembly programmers to use JSR and then inside the subroutine, do tricks with the stack to return to a different location. This is a problem that will be tackled later. It's not an issue with our "Hello World" example.

RTI , RTS , and BRK modify control flow, but the destination address is not constant, so these instructions do not help us know what else to mark as instructions.

As seen in this table, there are 2 types of JMP instructions: absolute and indirect:

JMP Type Example Assembly Description Absolute JMP Label_80a2 This version is used in the "Hello World" program. It sends control flow to the operand address - which in this example is a label. This version is convenient for disassembly because the destination address is known statically - the address is hard-coded. Indirect JMP ($0123) Uses the operand address as a pointer, sending control flow to the address at the pointer. This will prove to be one of the big challenges of this project. More on this later.

The instructions in the earlier table are the only ones that modify control flow. All other instructions execute serially. Thus if we encounter one of these, we can reliably decode the next byte as an instruction.

Here's what it looks like when we apply this algorithm to our Hello World binary code:

.org $8000 Label_8000: .db $48 .db $65 .db $6c .db $6c .db $6f .db $2c .db $20 .db $57 .db $6f .db $72 .db $6c .db $64 .db $21 .db $0a .db $00 Reset_Routine: LDX #$00 Label_8011: LDA Label_8000, X BEQ Label_801d STA $2008 INX JMP Label_8011 Label_801d: LDA #$00 STA $2009 IRQ_Routine: RTI NMI_Routine: RTI .db $ff ... (this is repeated about 30,000 times) ... .db $ff .dw NMI_Routine .dw Reset_Routine .dw IRQ_Routine

This technique was able to decode all of the instructions, but we can't read the text, and it's pretty annoying having $ff - the filler byte value - repeated so many times. Let's add a pass to detect ASCII strings. We can do this by counting how many characters in a .db statement in a row are considered "ASCII" and when a threshold of 4 is reached, replace the .db statements with a quoted string:

.org $8000 Label_8000: .db "Hello, World!" .db $0a .db $00 Reset_Routine: LDX #$00 Label_8011: LDA Label_8000, X BEQ Label_801d STA $2008 INX JMP Label_8011 Label_801d: LDA #$00 STA $2009 IRQ_Routine: RTI NMI_Routine: RTI .db $ff .db $ff ... (repeated about 30,000 times) ... .db $ff .db $ff .dw NMI_Routine .dw Reset_Routine .dw IRQ_Routine

Better! Let's solve the problem of the repeating $ff by adding a pass to detect where it would make sense to place .org statements. We can do this in much the same way as ASCII detection, but in this case we will look out for repeating bytes rather than bytes which fit in the ASCII range. 64 seems like a good threshold. If a byte repeats 64 times, replace all the repeated occurences with a .org statement:

.org $8000 Label_8000: .db "Hello, World!" .db $0a .db $00 Reset_Routine: LDX #$00 Label_8011: LDA Label_8000, X BEQ Label_801d STA $2008 INX JMP Label_8011 Label_801d: LDA #$00 STA $2009 IRQ_Routine: RTI NMI_Routine: RTI .org $fffa .dw NMI_Routine .dw Reset_Routine .dw IRQ_Routine

And finally, one minor detail. Let's do a final pass to collapse .db statements together:

.org $8000 Label_8000: .db "Hello, World!", $0a, $00 Reset_Routine: LDX #$00 Label_8011: LDA Label_8000, X BEQ Label_801d STA $2008 INX JMP Label_8011 Label_801d: LDA #$00 STA $2009 IRQ_Routine: RTI NMI_Routine: RTI .org $fffa .dw NMI_Routine .dw Reset_Routine .dw IRQ_Routine

Not bad. This is as close as we can get to the original source. It's impossible to know what label names were used, but we can give good names to the interrupt vectors. We just turned a binary machine code program into human-readable assembly.

Now that we can figure out the assembly source code from 6502 machine code, we can start the fun part - converting the assembly program into native machine code.

Code Generation

Our code generation code will generate an LLVM bitcode module. We can then use llc to compile the bitcode into an object file, the same as if we used gcc -c module.c and looked at the resulting module.o.

LLVM is written in C++, but it also exposes a C interface. This means we can integrate cleanly using cgo. In fact, Andrew Wilkins maintains a convenient Go module called gollvm which gives us seamless integration.

At any time we can debug the LLVM module we are generating by calling module.Dump() which prints the LLVM IR code for the module to stderr. Let's start by manually creating the IR code that we want to generate for Hello World, so we know what to work toward. We can get a head start by writing it in C and using clang to generate the IR code for us:

#include <stdio.h> char * msg = "Hello, World!

"; int main() { char * ptr = msg; while (*ptr) { putchar(*ptr); ++ptr; } }

$ clang -emit-llvm -S test.c $

Now looking at test.s:

; ModuleID = 'test.c' target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128" target triple = "x86_64-pc-linux-gnu" @.str = private unnamed_addr constant [15 x i8] c"Hello, World!\0A\00", align 1 @msg = global i8* getelementptr inbounds ([15 x i8]* @.str, i32 0, i32 0), align 8 define i32 @main() nounwind uwtable { %1 = alloca i32, align 4 %ptr = alloca i8*, align 8 store i32 0, i32* %1 %2 = load i8** @msg, align 8 store i8* %2, i8** %ptr, align 8 br label %3 ; <label>:3 ; preds = %7, %0 %4 = load i8** %ptr, align 8 %5 = load i8* %4, align 1 %6 = icmp ne i8 %5, 0 br i1 %6, label %7, label %14 ; <label>:7 ; preds = %3 %8 = load i8** %ptr, align 8 %9 = load i8* %8, align 1 %10 = sext i8 %9 to i32 %11 = call i32 @putchar(i32 %10) %12 = load i8** %ptr, align 8 %13 = getelementptr inbounds i8* %12, i32 1 store i8* %13, i8** %ptr, align 8 br label %3 ; <label>:14 ; preds = %3 %15 = load i32* %1 ret i32 %15 } declare i32 @putchar(i32)

Alright - that looks a bit different than the code we're going to generate, but it's a good start. If that looks complicated, don't worry - we're going to do 3 things to make it less so:

Read up on the language reference. Delete the stuff that seems unnecessary and see if it still works. Modify the code to look like what we want to generate.

Here's an updated version with inline comments breaking it down:

; Here we declare the text that we will print. ; "private" means that only this module can see it - we do not export this ; symbol. Always declare private when possible. There are optimizations to be ; had when a symbol is not exported. ; "constant" means that this data is read-only. Again use constant when ; possible so that optimization passes can take advantage of this fact. ; [15 x i8] is the type of this data. i8 means an 8-bit integer. @msg = private constant [15 x i8] c"Hello, World!\0a\00" ; Here we declare a dependency on the `putchar` symbol. ; When this module is linked, `putchar` must be defined somewhere, and with ; this signature. declare i32 @putchar(i32) ; Same thing for `exit`. ; `noreturn` indicates that we do not expect this function to return. It will ; end the process, after all. ; `nounwind` has to do with LLVM's error handling model. We use `nounwind` ; because we know that `exit` will not throw an exception. declare void @exit(i32) noreturn nounwind ; Note that we will be performing the final link step with gcc, which will ; automatically statically link against libc. This will provide the `putchar` ; and `exit` symbols, as well as set up the executable entry point to call `main`. define i32 @main() { ; This label statement indicates the start of a basic block. Entry: ; Here we allocate some variables on the stack. These are X, Y, and A, ; 3 of the 6502's 8-bit registers. %X = alloca i8 %Y = alloca i8 %A = alloca i8 ; Note that here we are allocating variables which are single bits. ; These represent 2 of the bits from the status register. ; After this source listing there is a table explaining each bit ; of the status register. %S_neg = alloca i1 %S_zero = alloca i1 ; Send control flow to the Reset_Routine basic block. br label %Reset_Routine Reset_Routine: ; This is the code to generate for ; LDX #$00 ; Store 0 in the X register. store i8 0, i8* %X ; Clear the negative status bit, because we just stored 0 in X, ; and 0 is not negative. store i1 0, i1* %S_neg ; Set the zero status bit, because we just stored 0 in X. store i1 1, i1* %S_zero br label %Label_loop Label_loop: ; This is the code to generate for ; LDA msg, X ; Load the value of X into %0. %0 = load i8* %X ; Get a pointer to the character in msg indexed by %0, which contains the ; value of X. %1 = getelementptr [15 x i8]* @msg, i64 0, i8 %0 ; Read a byte of memory located at the pointer we just computed into %2. %2 = load i8* %1 ; Store the byte we just loaded into %A, which is the variable we have ; allocated for A. store i8 %2, i8* %A ; Now we need to set the negative status bit correctly. ; The byte of memory we just loaded into %A is negative if ; and only if the highest bit is set. ; Perform a bitwise AND with 1000 0000. %3 = and i8 128, %2 ; Test if the result is equal to 1000 0000. %4 = icmp eq i8 128, %3 ; Save the answer to the negative status bit. store i1 %4, i1* %S_neg ; Now we need to set the zero status bit correctly. ; Test if the byte is equal to zero. %5 = icmp eq i8 0, %2 ; Store the answer to the zero status bit. store i1 %5, i1* %S_zero ; This is the code to generate for ; BEQ loopend %6 = load i1* %S_zero ; If zero bit is set, go to Label_loopend. Otherwise, go to AutoLabel_0 br i1 %6, label %Label_loopend, label %AutoLabel_0 AutoLabel_0: ; This is the code to generate for ; STA $2008 %7 = load i8* %A ; Convert the 8-bit integer that we just loaded from A into ; a 32-bit integer to match the signature of `putchar`. %8 = zext i8 %7 to i32 %9 = call i32 @putchar(i32 %8) br label %AutoLabel_1 AutoLabel_1: ; This is the code to generate for ; INX %10 = load i8* %X %11 = add i8 %10, 1 store i8 %11, i8* %X ; Set the negative status bit correctly. %12 = and i8 128, %11 %13 = icmp eq i8 128, %12 store i1 %13, i1* %S_neg ;Set the zero status bit correctly. %14 = icmp eq i8 0, %11 store i1 %14, i1* %S_zero ; This is the code to generate for ; JMP loop br label %Label_loop Label_loopend: ; This is the code to generate for ; LDA #$00 store i8 0, i8* %A ; Clear the negative status bit. store i1 0, i1* %S_neg ; Set the zero status bit. store i1 1, i1* %S_zero ; This is the code to generate for ; STA $2009 %15 = load i8* %A %16 = zext i8 %15 to i32 call void @exit(i32 %16) noreturn nounwind ; Terminate this basic block with `unreachable` because ; exit never returns. unreachable ; Generate dummy basic blocks for the ; interrupt vectors, because we don't support them yet. IRQ_Routine: unreachable NMI_Routine: unreachable }

In this code we use S_neg and S_zero , 2 of the status register bits. These bits, along with the other status bits that we did not mention yet, are updated after certain instructions and used for things such as branching. Here is a full description of all the status bits:

Bit Mask Variable Name Description 0000 0001 S_carry Used for arithmetic and bitwise instructions, typically to make it easier to deal with integers that are larger than 8 bits. We don't need to deal with this yet. BCC and BCS use this status bit to decide whether to branch. 0000 0010 S_zero When a computation results in a value that is equal to zero, this status bit is set. Otherwise, it is cleared. BEQ and BNE use this status bit to decide whether to branch. 0000 0100 S_int This bit indicates whether interrupts are disabled. You can use SEI to set this bit, and CLI to clear this bit. More on interrupts later. 0000 1000 S_dec Normally, this bit would toggle decimal mode on and off. However, the NES disables this feature of the CPU, so it effectively does nothing. You can use SED to set this bit, and CLD to clear this bit. 0001 0000 S_brk This bit is set when a BRK instruction has been executed and an interupt has been generated to process it. We'll ignore this bit for now. 0010 0000 - This bit is unused. It remains 0 at all times. 0100 0000 S_over When a computation results in an invalid two's complement value, this bit is set. Otherwise, it is cleared. BVC and BVS use this to decide whether to branch. 1000 0000 S_neg When a computation results in a negative two's complement value, this bit is set. Otherwise, it is cleared. BPL and BMI use this status bit to decide whether to branch.

Let's make sure our goal LLVM code does what we expect:

$ llc -filetype=obj hello.llvm $ gcc hello.llvm.o $ ./a.out Hello, World! $

llc converts the LLVM IR code into a native object file, and then gcc does the final link step, statically linking against libc to hook up main , putchar , and exit . By default, gcc creates an executable named a.out , which we run, and viola!

The next step is to generate this code from our disassembly. With the help of gollvm we will:

Create a LLVM module. Declare our dependency on putchar and exit . Create the main function and allocate stack variables for the registers. Perform one pass over the AST to identify labeled data and create global variables. We save the index of label name to global variable in a map. Perform a second pass over the AST to generate the basic blocks and save an index of label name to basic block in a map. Final pass over the AST to generate code inside of the basic blocks.

Here is a structure that contains the state we need while compiling:

type Compilation struct { // Keep track of the warnings and errors // that occur while compiling. Warnings []string Errors []string // program is our abstract syntax tree - // we will make several passes over it during // the compilation process. program *Program // LLVM variable which represents the module we // are creating. mod llvm.Module // This is the object that we will use to do all the // code generation. It's used to create every kind // of statement. builder llvm.Builder // Reference to our main function. mainFn llvm.Value // References to the functions we declared, so we // can call them. putCharFn llvm.Value exitFn llvm.Value // References to the variables we allocate on the stack // so we can use them in instructions. rX llvm.Value rY llvm.Value rA llvm.Value rSNeg llvm.Value rSZero llvm.Value // Maps label name to the global variables that we add, // for when the code loads data from a label. labeledData map[string]llvm.Value // Maps label name to basic block, so that when // code branches to another label, we can branch to // the relevant basic block. labeledBlocks map[string]llvm.BasicBlock // Keeps track of the basic block we are currently // generating code for, if any. currentBlock *llvm.BasicBlock // Keeps a reference to the reset routine basic block // so we know where to first jump to from the entry point. resetBlock *llvm.BasicBlock }

With this structure, we can execute our plan:

func (p *Program) Compile(filename string) (c *Compilation) { llvm.InitializeNativeTarget() c = new(Compilation) c.program = p c.Warnings = []string{} c.Errors = []string{} c.mod = llvm.NewModule("asm_module") c.builder = llvm.NewBuilder() defer c.builder.Dispose() c.labeledData = map[string]llvm.Value{} c.labeledBlocks = map[string]llvm.BasicBlock{} // First pass to identify labeled data and create // global variables, saving the indexes in `c.labeledData`. c.identifyLabeledDataPass() // declare i32 @putchar(i32) i32Type := llvm.Int32Type() putCharType := llvm.FunctionType(i32Type, []llvm.Type{i32Type}, false) c.putCharFn = llvm.AddFunction(c.mod, "putchar", putCharType) c.putCharFn.SetLinkage(llvm.ExternalLinkage) // declare void @exit(i32) noreturn nounwind exitType := llvm.FunctionType(llvm.VoidType(), []llvm.Type{i32Type}, false) c.exitFn = llvm.AddFunction(c.mod, "exit", exitType) c.exitFn.AddFunctionAttr(llvm.NoReturnAttribute | llvm.NoUnwindAttribute) c.exitFn.SetLinkage(llvm.ExternalLinkage) // main function / entry point mainType := llvm.FunctionType(i32Type, []llvm.Type{}, false) c.mainFn = llvm.AddFunction(c.mod, "main", mainType) c.mainFn.SetFunctionCallConv(llvm.CCallConv) entry := llvm.AddBasicBlock(c.mainFn, "Entry") c.builder.SetInsertPointAtEnd(entry) c.rX = c.builder.CreateAlloca(llvm.Int8Type(), "X") c.rY = c.builder.CreateAlloca(llvm.Int8Type(), "Y") c.rA = c.builder.CreateAlloca(llvm.Int8Type(), "A") c.rSNeg = c.builder.CreateAlloca(llvm.Int1Type(), "S_neg") c.rSZero = c.builder.CreateAlloca(llvm.Int1Type(), "S_zero") // Second pass to build basic blocks. c.buildBasicBlocksPass() // Finally, one last pass for codegen. c.codeGenPass() // hook up the first entry block to the reset block c.builder.SetInsertPointAtEnd(entry) c.builder.CreateBr(*c.resetBlock) err := llvm.VerifyModule(c.mod, llvm.ReturnStatusAction) if err != nil { c.Errors = append(c.Errors, err.Error()) return } // Uncomment this to print the LLVM IR code we just generated // to stderr. //c.mod.Dump() fd, err := os.Create(filename) if err != nil { c.Errors = append(c.Errors, err.Error()) return } err = llvm.WriteBitcodeToFile(c.mod, fd) if err != nil { c.Errors = append(c.Errors, err.Error()) return } err = fd.Close() if err != nil { c.Errors = append(c.Errors, err.Error()) return } }

Here's what it the module looks like after only the first pass which identifies labeled data:

; ModuleID = 'asm_module' @Label_c000 = private global [15 x i8] c"Hello, World!\0A\00"

That's it. All we've done is created a global variable for msg (which is known as Label_c000 in our disassembly) and saved it in the c.labeledData map.

Now add in the declares and main function definition, and we get:

; ModuleID = 'asm_module' @Label_c000 = private global [15 x i8] c"Hello, World!\0A\00" declare i32 @putchar(i32) declare void @exit(i32) noreturn nounwind define i32 @main() { Entry: %X = alloca i8 %Y = alloca i8 %A = alloca i8 %S_neg = alloca i1 %S_zero = alloca i1 }

Nothing fancy here. Next let's add in the second pass to identify basic blocks:

; ModuleID = 'asm_module' @Label_c000 = private global [15 x i8] c"Hello, World!\0A\00" declare i32 @putchar(i32) declare void @exit(i32) noreturn nounwind define i32 @main() { Entry: %X = alloca i8 %Y = alloca i8 %A = alloca i8 %S_neg = alloca i1 %S_zero = alloca i1 Reset_Routine: ; No predecessors! Label_c011: ; No predecessors! Label_c01d: ; No predecessors! IRQ_Routine: ; No predecessors! NMI_Routine: ; No predecessors! }

At this point we've created basic blocks for each of our labels and saved them in our c.labeledBlocks map.

And finally, we do the actual code generation pass and add the jump from the entry block to the reset block.

The actual code generation code is surprisingly simple. It consists of a switch case, and a few calls into the llvm builder object. For example, here's the code generation code for LDX #$00:

func (i *Instruction) Compile(c *Compilation) { v := llvm.ConstInt(llvm.Int8Type(), uint64(i.Value), false) switch i.OpCode { case 0xa2: // ldx c.builder.CreateStore(v, c.rX) c.testAndSetZero(i.Value) c.testAndSetNeg(i.Value) // more cases for other immediate instructions } } func (c *Compilation) testAndSetZero(v int) { if v == 0 { c.setZero() return } c.clearZero() } func (c *Compilation) setZero() { c.builder.CreateStore(llvm.ConstInt(llvm.Int1Type(), 1, false), c.rSZero) } func (c *Compilation) clearZero() { c.builder.CreateStore(llvm.ConstInt(llvm.Int1Type(), 0, false), c.rSZero) } func (c *Compilation) testAndSetNeg(v int) { if v&0x80 == 0x80 { c.setNeg() return } c.clearNeg() } func (c *Compilation) setNeg() { c.builder.CreateStore(llvm.ConstInt(llvm.Int1Type(), 1, false), c.rSNeg) } func (c *Compilation) clearNeg() { c.builder.CreateStore(llvm.ConstInt(llvm.Int1Type(), 0, false), c.rSNeg) }

For completeness's sake, let's look at one more code gen example. Here's the code generation code for LDA msg, X:

func (i *Instruction) Compile(c *Compilation) { switch i.OpCode { case 0xbd: // LDA label, X // Look up the global module variable based on the label name. dataPtr := c.labeledData[i.LabelName] index := c.builder.CreateLoad(c.rX, "") // This is what we index into the global variable with. indexes := []llvm.Value{ llvm.ConstInt(llvm.Int8Type(), 0, false), index, } // Obtain a pointer to the element that we want to load. ptr := c.builder.CreateGEP(dataPtr, indexes, "") v := c.builder.CreateLoad(ptr, "") c.builder.CreateStore(v, c.rA) c.dynTestAndSetNeg(v) c.dynTestAndSetZero(v) // more cases for other direct-with-label-indexed instructions } } func (c *Compilation) dynTestAndSetNeg(v llvm.Value) { x80 := llvm.ConstInt(llvm.Int8Type(), uint64(0x80), false) masked := c.builder.CreateAnd(v, x80, "") isNeg := c.builder.CreateICmp(llvm.IntEQ, masked, x80, "") c.builder.CreateStore(isNeg, c.rSNeg) } func (c *Compilation) dynTestAndSetZero(v llvm.Value) { zeroConst := llvm.ConstInt(llvm.Int8Type(), uint64(0), false) isZero := c.builder.CreateICmp(llvm.IntEQ, v, zeroConst, "") c.builder.CreateStore(isZero, c.rSZero) }

After implementing codegen for the rest of the instructions, here is the module dump after the code generation pass:

; ModuleID = 'asm_module' @Label_c000 = private global [15 x i8] c"Hello, World!\0A\00" declare i32 @putchar(i32) declare void @exit(i32) noreturn nounwind define i32 @main() { Entry: %X = alloca i8 %Y = alloca i8 %A = alloca i8 %S_neg = alloca i1 %S_zero = alloca i1 br label %Reset_Routine Reset_Routine: ; preds = %Entry store i8 0, i8* %X store i1 true, i1* %S_zero store i1 false, i1* %S_neg br label %Label_c011 Label_c011: ; preds = %else, %Reset_Routine %0 = load i8* %X %1 = getelementptr [15 x i8]* @Label_c000, i8 0, i8 %0 %2 = load i8* %1 store i8 %2, i8* %A %3 = and i8 %2, -128 %4 = icmp eq i8 %3, -128 store i1 %4, i1* %S_neg %5 = icmp eq i8 %2, 0 store i1 %5, i1* %S_zero %6 = load i1* %S_zero br i1 %6, label %Label_c01d, label %else else: ; preds = %Label_c011 %7 = load i8* %A %8 = zext i8 %7 to i32 %9 = call i32 @putchar(i32 %8) %10 = load i8* %X %11 = add i8 %10, 1 store i8 %11, i8* %X %12 = and i8 %11, -128 %13 = icmp eq i8 %12, -128 store i1 %13, i1* %S_neg %14 = icmp eq i8 %11, 0 store i1 %14, i1* %S_zero br label %Label_c011 Label_c01d: ; preds = %Label_c011 store i8 0, i8* %A store i1 true, i1* %S_zero store i1 false, i1* %S_neg %15 = load i8* %A %16 = zext i8 %15 to i32 call void @exit(i32 %16) unreachable IRQ_Routine: ; preds = %Label_c01d unreachable NMI_Routine: ; No predecessors! unreachable }

Great! This looks just like the code we wanted to get, minus the comments and with a few things renamed. Let's make sure it runs as expected:

$ llc -filetype=obj hello.bc $ gcc hello.bc.o $ ./a.out Hello, World! $

At this point in the project we are able to recompile a simple "Hello, World" 6502 program with a small custom ABI into native machine code and then execute it.

Optimization

Did you notice that we never used the value of S_neg ? We only ever stored it. This is a waste of CPU cycles. We can do better. However, we don't want to completely remove the ability to compute S_neg - although this "Hello World" example does not use the value, other code might.

Optimization is an enormously complicated topic, with its own well-deserved field. Luckly, we won't have to wrap our heads around it in order to benefit. LLVM IR code is designed to be optimized. LLVM comes with state of the art optimization techniques, in the form of passes that you run on a module.

Let's run several optimization passes on the module we generate before rendering bitcode:

Optimization Name Description Constant Propagation Looks for instructions involving only constants and replaces them with a constant value instead of an instruction. Combine Redundant Instructions Combines instructions to form fewer, simple instructions. For example if you add 1 twice, it will instead add 2. Promote Memory to Register This pass allows us to load every "register" variable (X, Y, A, etc) before performing an instruction, and then store the register variable back after performing the instruction. This optimization pass will convert our allocated variables into registers, eliminating all the redundancy. Global Value Numbering Eliminates partially and fully redundant instructions, and delete redundant load instructions. Control Flow Graph Simplification Removes unnecessary code and merges basic blocks together when possible.

Here's what it looks like to add these optimization passes to our Compile code:

// ... err := llvm.VerifyModule(c.mod, llvm.ReturnStatusAction) if err != nil { c.Errors = append(c.Errors, err.Error()) return } // This creates an engine object which has useful information // about our machine as a target. engine, err := llvm.NewJITCompiler(c.mod, 3) if err != nil { c.Errors = append(c.Errors, err.Error()) return } defer engine.Dispose() pass := llvm.NewPassManager() defer pass.Dispose() pass.Add(engine.TargetData()) pass.AddConstantPropagationPass() pass.AddInstructionCombiningPass() pass.AddPromoteMemoryToRegisterPass() pass.AddGVNPass() pass.AddCFGSimplificationPass() pass.Run(c.mod) // Uncomment this to print the LLVM IR code we just generated // to stderr. //c.mod.Dump() // ...

LLVM offers many more available passes, and it's up to the user to choose which ones, and in which order, to run. Let's see how the ones chosen in the table above perform on our "Hello World" code:

; ModuleID = 'asm_module' @Label_c000 = private global [15 x i8] c"Hello, World!\0A\00" declare i32 @putchar(i32) declare void @exit(i32) noreturn nounwind define i32 @main() { Entry: br label %Label_c011 Label_c011: ; preds = %else, %Entry %storemerge = phi i8 [ 0, %Entry ], [ %6, %else ] %0 = sext i8 %storemerge to i64 %1 = getelementptr [15 x i8]* @Label_c000, i64 0, i64 %0 %2 = load i8* %1, align 1 %3 = icmp eq i8 %2, 0 br i1 %3, label %Label_c01d, label %else else: ; preds = %Label_c011 %4 = zext i8 %2 to i32 %5 = call i32 @putchar(i32 %4) %6 = add i8 %storemerge, 1 br label %Label_c011 Label_c01d: ; preds = %Label_c011 call void @exit(i32 0) unreachable }

Wow! The code looks completely different, and much simpler. But does it still run?

$ llc -filetype=obj hello.bc $ gcc hello.bc.o $ ./a.out Hello, World! $

Sure does!

Now, not only are we recompiling a simple 6502 program for native machine code, but we're actually generating highly optimized code. But it's not time to congratulate ourselves yet. This is a contrived case. Let's see if these techniques can work on an actual NES game.

Layout of an NES ROM File

The generally accepted standard for distributing NES games is the .nes format, originally started by the iNES emulator. A .nes file looks like:

16 byte header with metadata, such as: What mapper, if any, the game uses.

Whether the PPU uses vertical or horizontal mirroring. For now, don't worry so much about exactly what this means, as much as that it is a setting that the PPU needs to know about at bootup. The 32 KB of assembled code which gets loaded into $8000 - $ffff on bootup. If there is a mapper, there might be more of this program data, but we'll talk about that later. 8KB of graphics data, known as CHR-ROM. This gets loaded into the PPU on bootup. Again if there is a mapper, there might be more of this graphics data.

If you're looking for a reference for the gory details of this format, see the Nesdev Wiki or the official specification.

The first "real" ROM I tried to recompile is this demo ROM made by Chris Covell in 1998 called Zelda Title Screen Simulator. If you run this ROM in an emulator, you get a title screen that looks nearly identical to Zelda 1:

Now that we're recompiling a real ROM, we'll have to do several more things to make it work:

Include the graphics data in the binary.

Include the mirroring setting (vertical or horizontal) in the binary.

Support code gen for more instructions.

Include a runtime to create a window and render the video to the screen.

Including the graphics data and mirroring setting in the binary is trivial. We get all the information we need when we read the ROM file; all we need to do is convert it to global variables in our LLVM module:

// In this snippet, `program` is an object which contains // the information we loaded from the ROM file. // chrRom is [][]byte - n banks of 8KB. At this point, // we only support 1 bank. if len(program.chrRom) != 1 { panic("Only 1 bank of CHR ROM is supported.") } dataLen := 0x2000 * len(chrRom) chrDataValues := make([]llvm.Value, 0, dataLen) int8type := llvm.Int8Type() for _, bank := range chrRom { for _, b := range bank { chrDataValues = append(chrDataValues, llvm.ConstInt(int8type, uint64(b), false)) } } chrDataConst := llvm.ConstArray(llvm.ArrayType(llvm.Int8Type(), dataLen), chrDataValues) // `c.mod` is our `llvm.Module` object. `c` is an instance of our // `Compilation` struct from a previous code listing. chrDataGlobal := llvm.AddGlobal(c.mod, chrDataConst.Type(), "rom_chr_data") // Setting the linkage to private here. We'll figure out what to do // with this information soon. chrDataGlobal.SetLinkage(llvm.PrivateLinkage) chrDataGlobal.SetInitializer(chrDataConst) chrDataGlobal.SetGlobalConstant(true) // Now we add the global variable specifying the mirroring setting. mirroringValue := llvm.ConstInt(llvm.Int8Type(), uint64(program.Mirroring), false) mirroringGlobal := llvm.AddGlobal(c.mod, mirroringValue.Type(), "rom_mirroring") mirroringGlobal.SetLinkage(llvm.PrivateLinkage) mirroringGlobal.SetInitializer(mirroringValue) mirroringGlobal.SetGlobalConstant(true)

The next step to getting our selected ROM to work is to support the additional instructions that this program uses. Let's unpack that rom and see what we're up against:

$ ./jamulate -unrom roms/Zelda.NES loading roms/Zelda.NES disassembling to roms/Zelda $ ls -l roms/Zelda -rw-rw-r-- 1 andy andy 8192 May 24 23:53 chr0.chr -rw-rw-r-- 1 andy andy 7977 May 24 23:53 prg.asm -rw-rw-r-- 1 andy andy 485 May 24 23:53 Zelda.jam

The code to unpack a .nes file into CHR data, PRG data, and metadata is left as an exercise to the reader.

chr0.chr is binary data; it is the 8KB of graphics data mentioned before which is loaded into the PPU on bootup.

Zelda.jam is where I decided to put the program metadata in a simple key-value text format:

# output file name when this rom is assembled filename=Zelda.NES # see http://wiki.nesdev.com/w/index.php/Mapper mapper=0 # 'Horizontal', 'Vertical', or 'FourScreenVRAM' # see http://wiki.nesdev.com/w/index.php/Mirroring mirroring=Horizontal # whether SRAM in CPU $6000-$7FFF is present sram=true # whether the SRAM in CPU $6000-$7FFF, if present, is battery backed battery=false # 'NTSC', 'PAL', or 'DualCompatible' tvsystem=NTSC # assembly code prg=prg.asm # video data chr=chr0.chr

In retrospect, I think it would have been simpler to have the assembler support special declarations for this metadata rather than having a separate file. But hey, it works.

You may notice there is some metadata there which is never addressed in this article. For the purposes of this project, we don't need to think about SRAM, battery, or tvsystem. None of the examples we look at use them.

Without further ado, let's see how the disassembler fared:

.org $c000 .db "Zelda Simulator, ", $a9, "1998 Chris Covell (ccovell@direct.ca)" Reset_Routine: CLD SEI Label_c039: LDA $2002 BPL Label_c039 LDX #$00 STX $2000 STX $2001 DEX TXS LDY #$06 STY $01 LDY #$00 STY $00 LDA #$00 Label_c052: STA ($00), Y DEY BNE Label_c052 DEC $01 BPL Label_c052 LDX #$3f STX $2006 LDX #$00 STX $2006 LDX #$27 LDY #$20 Label_c069: STX $2007 DEY BNE Label_c069 LDX #$3f STX $2006 LDX #$00 STX $2006 LDX #$00 LDY #$20 Label_c07d: LDA Label_c101, X STA $2007 INX DEY BNE Label_c07d LDX #$23 STX $2006 LDX #$c0 STX $2006 LDX #$00 LDY #$40 Label_c095: LDA Label_c121, X STA $2007 INX DEY BNE Label_c095 LDX #$20 STX $2006 LDX #$00 STX $2006 LDX #$00 LDY #$00 Label_c0ad: LDA Label_c261, X STA $2007 INX DEY BNE Label_c0ad LDX #$00 LDY #$00 Label_c0bb: LDA Label_c361, X STA $2007 INX DEY BNE Label_c0bb LDX #$00 LDY #$00 Label_c0c9: LDA Label_c461, X STA $2007 INX DEY BNE Label_c0c9 LDX #$00 LDY #$c0 Label_c0d7: LDA Label_c561, X STA $2007 INX DEY BNE Label_c0d7 LDX #$00 STX $2003 LDX #$00 LDY #$00 Label_c0ea: LDA Label_c161, X STA $2004 INX DEY BNE Label_c0ea LDA #$90 STA $2000 LDA #$1e STA $2001 Label_c0fe: JMP Label_c0fe Label_c101: .db $36, $0d, $00, $10, $36, $17, $27, $0d .db $36, $08, $1a, $28, $36, $30, $31, $22 .db $36, $30, $31, $11, $36, $15, $15, $15 .db $36, $08, $1a, $28, $36, $30, $31, $22 Label_c121: .db $05, $05, $05, $05, $05, $05, $05, $05 .db $08, "jZZZZ", $9a, $22, $00, "fUUUU", $99, $00 .db $00, $6e, $5f, $55, $5d, $df, $bb, $00 .db $00, $0a, $0a, $0a, $0a, $0a, $0a, $00 .db $00, $00, $c0, $30, $00, $00, $00, $00 .db $00, $00, $cc, $33, $00, $00, $00, $00 .db $00, $20, $fc, $f3, $00, $00, $f0, $f0 Label_c161: .db $27, $ca, $02, $28, $2f, $cb, $02, $28 .db $27, $cc, $02, $30, $2f, $cd, $02, $30 .db $27, $d6, $02, $50, $2c, $d6, $02, $a0 .db $27, $cc, $42, $c8, $2f, $cd, $42, $c8 .db $27, $ca, $42, $d0, $2f, $cb, $42, $d0 .db $31, $d2, $02, $57, $31, $d4, $02, $5f .db $3f, $d4, $02, $24, $41, $d4, $02, $63 .db $4f, $d6, $02, $2c, $4f, $d2, $02, $cb .db $57, $ce, $02, $73, $5f, $cf, $02, $73 .db $57, $d0, $02, $7b, $5f, $d1, $02, $7b .db $67, $d2, $02, $7a, $7b, $d4, $02, $90 .db $7b, $d6, $02, $bc, $82, $d2, $02, $50 .db $77, $cb, $82, $28, $7f, $ca, $82, $28 .db $77, $cd, $82, $30, $7f, $cc, $82, $30 .db $77, $cd, $c2, $c8, $7f, $cc, $c2, $c8 .db $77, $cb, $c2, $d0, $7f, $ca, $c2, $d0 .db $af, $a3, $00, $50, $af, $a5, $00, $58 .db $af, $a7, $00, $60, $af, $a9, $00, $68 .db $b7, $b2, $00, $50, $b7, $b4, $00, $58 .db $b7, $b6, $00, $60, $b7, $b8, $00, $68 .db $c9, $c2, $00, $50, $d1, $c3, $00, $50 .db $c9, $c4, $00, $58, $d1, $c5, $00, $58 .db $c9, $c6, $00, $60, $d1, $c7, $00, $60 .db $c9, $c8, $00, $68, $d1, $c9, $00, $68 .db $d9, $c2, $00, $50, $e1, $c3, $00, $50 .db $d9, $c4, $00, $58, $e1, $c5, $00, $58 .db $d9, $c6, $00, $60, $e1, $c7, $00, $60 .db $d9, $c8, $00, $68, $e1, $c9, $00, $68 .db $67, $a0, $03, $58, $67, $a0, $03, $60 .db $67, $a0, $03, $68, $67, $a0, $03, $70 .db $67, $a0, $03, $78, $67, $a0, $03, $80 .db $67, $a0, $03, $88, $00, $b0, $00, $00 Label_c261: .db "$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$", $e0, $d5, "$$$$$$$$$$$$$$$$$$$$$$$$$$$$", $d4, $e0, $dc, $d7 .db "$$$$$$$$$$$$$$$$$$$$$$$$$$$$", $d6, $dd, $dc, $ee, "$$$$$$$$$$$$$$$$$$$$$$$$$$$$", $d6, $db .db $de, $d7, $24, $24, $24, $e6, $e4, $e5 .db $e4, $e5, $e4, $e5, $e4, $e5, $e4, $e5 .db $e4, $e5, $e4, $e5, $e4, $e5, $e4, $e5 .db $e4, $e5, $e6, $24, $24, $24, $d6, $db .db $dc, $d7, $24, $24, $24, $e2, "$$$$$$$$$$$$$$$$$$$$", $e3 .db $24, $24, $24, $d6, $df, $de, $ee, $24 .db $24, $24, $e3, "$$qrstuvwxyyyz{$$$$$", $e2, $24, $24, $24 .db $d6, $db Label_c361: .db $de, $d8, $ef, $24, $24, $e2, $24, $7c .db $7d, $7e, $7f, $80, $81, $82, $83, $84 .db $85, $86, $87, $88, $89, $8a, $8b, $24 .db $24, $24, $e3, $24, $24, $24, $d6, $df .db $dc, $da, $d7, $24, $24, $e3, $24, $8c .db $8d, $8e, $8f, $90, $91, $92, $93, $94 .db $95, $96, $97, $98, $99, $9a, $9b, $9c .db $24, $24, $e2, $24, $24, $d4, $d9, $db .db $dc, $d9, $ee, $24, $24, $e2, $24, $9d .db $9e, $9f, $a0, $a1, $a2, $a3, $a4, $a5 .db $a6, $a7, $a8, $a9, $aa, $ab, $ac, $ad .db $ae, $24, $e3, $24, $24, $d6, $db, $df .db $de, $db, $d7, $24, $24, $e3, $70, $af .db $b0, $b1, $b2, $b3, $b4, $b5, $b6, $b7 .db $b8, $b9, $ba, $bb, $bc, $bd, $be, $bf .db $c0, $24, $e2, $24, $24, $d6, $db, $db .db $dc, $dd, $d7, $24, $24, $e2, "$$$$$$$", $c1 .db $c2, $c3, $c4, $c5, "$$$$$$$$", $e3, $24, $24 .db $d6, $db, $df, $de, $db, $ee, $24, $24 .db $e3, $24, $c6, $c7, $c8, $c8, $c8, $24 .db $c9, $ca, $cb, $cc, $cd, $c8, $c8, $c8 .db $c8, $e8, $e9, $d3, $24, $e2, $24, $24 .db $d6, $db, $db, $dc, $db, $d7, $24, $24 .db $e2, "$$$$$$$$$", $ce, $cf, "$$$$$", $ea, $eb, $ec .db $24, $e3, $24, $24, $d6, $db, $df, $dc .db $db, $d7, $24, $24, $e3, "$$$$$$$$$", $d1, $d2 .db "$$$$$$$$$", $e2, $24, $24, $d6, $db, $db Label_c461: .db $dc, $d8, $e1, $d5, $24, $e6, $e4, $e5 .db $e4, $e5, $e4, $e5, $e4, $e5, $e4, $e5 .db $e4, $e5, $e4, $e5, $e4, $e5, $e4, $e5 .db $e4, $e5, $e6, $24, $d4, $e1, $d9, $dd .db $dc, $da, $dc, $d7, "$$$$$$$$$", $f0, $01, $09 .db $08, $06, $24, $17, $12, $17, $1d, $0e .db $17, $0d, $18, $24, $d6, $dc, $db, $df .db $dc, $da, $dc, $ee, "$$$$$$$$$$$$$$$$$$$$$$$$", $d6, $de, $ed .db $dd, $dc, $da, $de, $d7, "$$$$$$$$$$$$$$$$$$$$$$$$", $d6, $de .db $db, $dd, $e1, $d9, $dc, $ed, $e0, $ef .db $24, $24, $19, $1e, $1c, $11, $24, $1c .db $1d, $0a, $1b, $1d, $24, $0b, $1e, $1d .db $1d, $18, $17, $24, $24, $24, $d6, $d8 .db $e1, $d9, $dd, $ed, $de, $d8, $e1, $e1 .db $d5, "$$$$$$$$$$$$$$$$$$$$$", $d6, $da, $dd, $ed, $dd, $db .db $de, $da, $dc, $dd, $d8, $e0, $e0, $ef .db $24, $24, $24, $24, $d4, $e0, $e0, $d5 .db "$$$$$$$$", $d4, $ef, $da, $da, $df, $db, $df .db $db, $dc, $da, $de, $df, $da, $dc, $dd .db $db, $26, $26, $26, $26, $da, $dc, $dd .db $ed, $e0, $e0, $ef, $24, $d4, $e0, $e0 .db $e0, $d9, $db, $da, $da, $df, $db Label_c561: .db $ed, $d8, $e1, $d9, $de, $d8, $e1, $d9 .db $df, $db, $26, $26, $26, $26, $da, $dc .db $d8, $e1, $d9, $dc, $d8, $e0, $d9, $dc .db $dd, $dd, $d8, $e1, $e1, $d9, $dd, $ed .db $ed, $da, $dd, $ed, $de, $da, $dc, $db .db $dd, $db, $26, $26, $26, $26, $da, $de .db $da, $dc, $db, $de, $da, $dd, $ed, $dc .db $dc, $dd, $da, $dc, $dc, $db, $dd, $ed .db $ed, $da, $dd, $d8, $e1, $da, $dc, $db .db $df, $db, $26, $26, $26, $26, $da, $d8 .db $d9, $dc, $db, $dc, $da, $d8, $e1, $d9 .db $de, $df, $da, $d8, $e1, $e1, $d9, $e1 .db $ed, $d9, $df, $da, $dc, $da, $de, $db .db $dd, $db, $26, $26, $26, $26, $da, $da .db $ed, $de, $d8, $e1, $e1, $d9, $dc, $db .db $de, $d8, $e1, $d9, $dd, $dd, $db, $dc .db $df, $db, $df, $da, $dc, $db, $de, $db .db $dd, $db, $26, $26, $26, $26, $da, $da .db $db, $dd, $da, $de, $d8, $e1, $d9, $db .db $de, $da, $dd, $db, $de, $df, $d8, $e1 .db $df, $db, $df, $da, $dc, $db, $de, $db .db $dd, $db, $26, $26, $26, $26, $da, $da .db $db, $dd, $da, $de, $da, $dd, $db, $db .db $de, $da, $dd, $db, $de, $df, $dc, $e1 NMI_Routine: LDX #$00 STX $2005 STX $2005 RTI IRQ_Routine: RTI .org $fffa, $00 .dw NMI_Routine .dw Reset_Routine .dw IRQ_Routine

Looking at this disassembly, several things come to mind:

Our ASCII detection did a good job with that string at the beginning. Nice.

This code introduces only a few new instructions. It won't be too cumbersome to add code gen to support this.

It looks like the disassembler was able to disassemble the entire program. You can see at the end there is an infinite loop and then data follows.

ASCII detection found some false positives. Oh well. It makes no difference to the execution of the program.

The IRQ interrupt is not used, but it looks like the NMI interrupt might be used.

Let's see if we can once again delay dealing with interrupts at all. Modify the NMI_Routine block so that it looks like this:

NMI_Routine: RTI

Good thing we bothered to make an assembler. Now we can put the ROM back together, run it in an emulator, and see if it still works:

$ ./jamulate -rom roms/Zelda/Zelda.jam building rom from roms/Zelda/Zelda.jam saving rom Zelda.NES

Again, writing the code to pack the rom back together into an .nes file is left as an exercise to the reader.

Now we observe the ROM running in an emulator and see if it still works:

Ha! It works but the sword is bent. That's actually pretty convenient - now when we want to make interrupts work, we know how to test them. But like true software developers, we're going to ignore any and all problems as long as possible.

The only thing left to do then, is to add code gen support for the new instructions. This leads us to our first challenge.

Challenge: Runtime Address Checks

Looking at the disassembled code for Zelda Title Screen Simulator, there is one instruction that stands out as a bit more tricky to recompile than the others:

STA ($00), Y

This is an indirect instruction. It tells the CPU to:

Interpret the values in $0000 and $0001 as a memory address. Add the value of register Y to the address to compute a new address. Store the value of register A to the new address.

Quite a useful instruction, but it poses a challenge for recompilation. Because values in memory addresses $0000 and $0001 could be anything, only at runtime will we learn which memory address is about to be updated.

Recall that the NES uses memory mapped I/O. This means that this instruction could be saving a value in memory, but it could also be talking to the PPU, the APU, the game pad controller, or even a mapper. Since this one instruction could be doing any of these things depending on runtime state, we will have to add a runtime address check.

Here's what the code generation looks like for STA-indirect-Y:

func (i *Instruction) Compile(c *Compilation) { switch i.OpCode { // ... more cases for other indirect instructions case 0x91: // sta indirect y // load the base address at the address indicated by the instruction baseAddr := c.loadWord(i.Value) // load the 8-bit value of register Y rY := c.builder.CreateLoad(c.rY, "") // zero extend the 8-bit value of Y to 16 bits // so that we can add it to baseAddr rYw := c.builder.CreateZExt(rY, llvm.Int16Type(), "") // compute the pointer to the address to store a byte addr := c.builder.CreateAdd(baseAddr, rYw, "") // load the value of register A rA := c.builder.CreateLoad(c.rA, "") // call dynStore with our computed address and value of A. // dynStore will figure out what to do with the value and address. c.dynStore(addr, rA) } } func (c *Compilation) loadWord(addr int) llvm.Value { // Load a little endian word. // Load the low byte. ptrByte1 := c.load(addr) // Load the high byte. ptrByte2 := c.load(addr + 1) // Zero extend the 8-bit values to 16-bit. ptrByte1w := c.builder.CreateZExt(ptrByte1, llvm.Int16Type(), "") ptrByte2w := c.builder.CreateZExt(ptrByte2, llvm.Int16Type(), "") // Shift the high byte left by 8. shiftAmt := llvm.ConstInt(llvm.Int16Type(), 8, false) word := c.builder.CreateShl(ptrByte2w, shiftAmt, "") // Bitwise OR the high and low bytes together and return that. return c.builder.CreateOr(word, ptrByte1w, "") } func (c *Compilation) load(addr int) llvm.Value { // This function is used to load a byte from an address that we know // at compile-time. switch { default: c.Errors = append(c.Errors, fmt.Sprintf("reading from $%04x not implemented", addr)) return llvm.ConstNull(llvm.Int8Type()) case 0x0000 <= addr && addr < 0x2000: // 2KB working RAM. mask because mirrored maskedAddr := addr & (0x800 - 1) indexes := []llvm.Value{ llvm.ConstInt(llvm.Int16Type(), 0, false), llvm.ConstInt(llvm.Int16Type(), uint64(maskedAddr), false), } ptr := c.builder.CreateGEP(c.wram, indexes, "") v := c.builder.CreateLoad(ptr, "") return v case 0x2000 <= addr && addr < 0x4000: // PPU registers. mask because mirrored switch addr & (0x8 - 1) { case 2: return c.builder.CreateCall(c.ppuReadStatusFn, []llvm.Value{}, "") case 4: return c.builder.CreateCall(c.ppuReadOamDataFn, []llvm.Value{}, "") case 7: return c.builder.CreateCall(c.ppuReadDataFn, []llvm.Value{}, "") default: c.Errors = append(c.Errors, fmt.Sprintf("reading from $%04x not implemented", addr)) return llvm.ConstNull(llvm.Int8Type()) } // ... There are more cases to handle; here we only include WRAM and the PPU. } panic("unreachable") } func (c *Compilation) dynStore(addr llvm.Value, val llvm.Value) { // runtime memory check storeDoneBlock := c.createBlock("StoreDone") x2000 := llvm.ConstInt(llvm.Int16Type(), 0x2000, false) inWRam := c.builder.CreateICmp(llvm.IntULT, addr, x2000, "") notInWRamBlock := c.createIf(inWRam) // this generated code runs if the write is happening in the WRAM range maskedAddr := c.builder.CreateAnd(addr, llvm.ConstInt(llvm.Int16Type(), 0x800-1, false), "") indexes := []llvm.Value{ llvm.ConstInt(llvm.Int16Type(), 0, false), maskedAddr, } ptr := c.builder.CreateGEP(c.wram, indexes, "") c.builder.CreateStore(val, ptr) c.builder.CreateBr(storeDoneBlock) // this generated code runs if the write is > WRAM range c.selectBlock(notInWRamBlock) x4000 := llvm.ConstInt(llvm.Int16Type(), 0x4000, false) inPpuRam := c.builder.CreateICmp(llvm.IntULT, addr, x4000, "") notInPpuRamBlock := c.createIf(inPpuRam) // this generated code runs if the write is in the PPU RAM range maskedAddr = c.builder.CreateAnd(addr, llvm.ConstInt(llvm.Int16Type(), 0x8-1, false), "") badPpuAddrBlock := c.createBlock("BadPPUAddr") sw := c.builder.CreateSwitch(maskedAddr, badPpuAddrBlock, 7) // this generated code runs if the write is in a bad PPU RAM addr c.selectBlock(badPpuAddrBlock) c.createPanic("invalid store address: $%04x

", []llvm.Value{addr}) ppuCtrlBlock := c.createBlock("ppuctrl") sw.AddCase(llvm.ConstInt(llvm.Int16Type(), 0, false), ppuCtrlBlock) c.selectBlock(ppuCtrlBlock) c.builder.CreateCall(c.ppuCtrlFn, []llvm.Value{val}, "") c.builder.CreateBr(storeDoneBlock) ppuMaskBlock := c.createBlock("ppumask") sw.AddCase(llvm.ConstInt(llvm.Int16Type(), 1, false), ppuMaskBlock) c.selectBlock(ppuMaskBlock) c.builder.CreateCall(c.ppuMaskFn, []llvm.Value{val}, "") c.builder.CreateBr(storeDoneBlock) oamAddrBlock := c.createBlock("oamaddr") sw.AddCase(llvm.ConstInt(llvm.Int16Type(), 3, false), oamAddrBlock) c.selectBlock(oamAddrBlock) c.builder.CreateCall(c.oamAddrFn, []llvm.Value{val}, "") c.builder.CreateBr(storeDoneBlock) oamDataBlock := c.createBlock("oamdata") sw.AddCase(llvm.ConstInt(llvm.Int16Type(), 4, false), oamDataBlock) c.selectBlock(oamDataBlock) c.builder.CreateCall(c.setOamDataFn, []llvm.Value{val}, "") c.builder.CreateBr(storeDoneBlock) ppuScrollBlock := c.createBlock("ppuscroll") sw.AddCase(llvm.ConstInt(llvm.Int16Type(), 5, false), ppuScrollBlock) c.selectBlock(ppuScrollBlock) c.builder.CreateCall(c.setPpuScrollFn, []llvm.Value{val}, "") c.builder.CreateBr(storeDoneBlock) ppuAddrBlock := c.createBlock("ppuaddr") sw.AddCase(llvm.ConstInt(llvm.Int16Type(), 6, false), ppuAddrBlock) c.selectBlock(ppuAddrBlock) c.builder.CreateCall(c.ppuAddrFn, []llvm.Value{val}, "") c.builder.CreateBr(storeDoneBlock) ppuDataBlock := c.createBlock("ppudata") sw.AddCase(llvm.ConstInt(llvm.Int16Type(), 7, false), ppuDataBlock) c.selectBlock(ppuDataBlock) c.builder.CreateCall(c.setPpuDataFn, []llvm.Value{val}, "") c.builder.CreateBr(storeDoneBlock) // this generated code runs if the write is >= 0x4000 // There are more cases to handle; here we only show // handling WRAM and the PPU. c.selectBlock(notInPpuRamBlock) c.createPanic() // done. X_X c.selectBlock(storeDoneBlock) } func (c *Compilation) createIf(cond llvm.Value) llvm.BasicBlock { // Create a conditional branch along with the 2 target blocks. // Returns the else block and set the current block to the then block. elseBlock := c.createBlock("else") thenBlock := c.createBlock("then") c.builder.CreateCondBr(cond, thenBlock, elseBlock) c.selectBlock(thenBlock) return elseBlock } func (c *Compilation) createBlock(name string) llvm.BasicBlock { bb := llvm.InsertBasicBlock(*c.currentBlock, name) bb.MoveAfter(*c.currentBlock) return bb } func (c *Compilation) selectBlock(bb llvm.BasicBlock) { c.builder.SetInsertPointAtEnd(bb) c.currentBlock = &bb } func (c *Compilation) createPanic() { bytePointerType := llvm.PointerType(llvm.Int8Type(), 0) ptr := c.builder.CreatePointerCast(c.runtimePanicMsg, bytePointerType, "") c.builder.CreateCall(c.putsFn, []llvm.Value{ptr}, "") exitCode := llvm.ConstInt(llvm.Int32Type(), 1, false) c.builder.CreateCall(c.exitFn, []llvm.Value{exitCode}, "") c.builder.CreateUnreachable() }

Let's take a look at what the IR code might look like for this instruction. First, a simplified assembly source file:

.org $c000 Reset_Routine: STA ($00), Y STA $2009 ; exit NMI_Routine: RTI IRQ_Routine: RTI .org $fffa .dw NMI_Routine .dw Reset_Routine .dw IRQ_Routine

And then the LLVM module that would be generated, before optimizations:

; ModuleID = 'asm_module' @wram = private global [2048 x i8] zeroinitializer @panicMsg = private global [51 x i8] c"panic: attempted to write to invalid memory address" declare i32 @putchar(i32) declare i32 @puts(i8*) declare void @exit(i32) noreturn nounwind declare i8 @rom_ppu_read_status() declare void @rom_ppu_write_control(i8) declare void @rom_ppu_write_mask(i8) declare void @rom_ppu_write_address(i8) declare void @rom_ppu_write_data(i8) declare void @rom_ppu_write_oamaddress(i8) declare void @rom_ppu_write_oamdata(i8) declare void @rom_ppu_write_scroll(i8) define i32 @main() { Entry: %X = alloca i8 %Y = alloca i8 %A = alloca i8 %SP = alloca i8 %S_neg = alloca i1 %S_zero = alloca i1 %S_dec = alloca i1 %S_int = alloca i1 br label %Reset_Routine Reset_Routine: ; preds = %Entry %0 = load i8* getelementptr inbounds ([2048 x i8]* @wram, i8 0, i8 0) %1 = load i8* getelementptr inbounds ([2048 x i8]* @wram, i8 0, i8 1) %2 = zext i8 %0 to i16 %3 = zext i8 %1 to i16 %4 = shl i16 %3, 8 %5 = or i16 %4, %2 %6 = load i8* %Y %7 = zext i8 %6 to i16 %8 = add i16 %5, %7 %9 = load i8* %A %10 = icmp ult i16 %8, 8192 br i1 %10, label %then, label %else then: ; preds = %Reset_Routine %11 = and i16 %8, 2047 %12 = getelementptr [2048 x i8]* @wram, i16 0, i16 %11 store i8 %9, i8* %12 br label %STA_done else: ; preds = %Reset_Routine %13 = icmp ult i16 %8, 16384 br i1 %13, label %then2, label %else1 then2: ; preds = %else %14 = and i16 %8, 7 switch i16 %14, label %BadPPUAddr [ i16 0, label %ppuctrl i16 1, label %ppumask i16 3, label %oamaddr i16 4, label %oamdata i16 5, label %ppuscroll i16 6, label %ppuaddr i16 7, label %ppudata ] BadPPUAddr: ; preds = %then2 %15 = call i32 @puts(i8* getelementptr inbounds ([51 x i8]* @panicMsg, i32 0, i32 0)) call void @exit(i32 1) unreachable ppuctrl: ; preds = %then2 call void @rom_ppu_write_control(i8 %9) br label %STA_done ppumask: ; preds = %then2 call void @rom_ppu_write_mask(i8 %9) br label %STA_done oamaddr: ; preds = %then2 call void @rom_ppu_write_oamaddress(i8 %9) br label %STA_done oamdata: ; preds = %then2 call void @rom_ppu_write_oamdata(i8 %9) br label %STA_done ppuscroll: ; preds = %then2 call void @rom_ppu_write_scroll(i8 %9) br label %STA_done ppuaddr: ; preds = %then2 call void @rom_ppu_write_address(i8 %9) br label %STA_done ppudata: ; preds = %then2 call void @rom_ppu_write_data(i8 %9) br label %STA_done else1: ; preds = %else %16 = call i32 @puts(i8* getelementptr inbounds ([51 x i8]* @panicMsg, i32 0, i32 0)) call void @exit(i32 1) unreachable STA_done: ; preds = %ppudata, %ppuaddr, %ppuscroll, %oamdata, %oamaddr, %ppumask, %ppuctrl, %then %17 = load i8* %A %18 = zext i8 %17 to i32 call void @exit(i32 %18) br label %NMI_Routine NMI_Routine: ; preds = %STA_done unreachable IRQ_Routine: ; No predecessors! unreachable }

With this framework in place, we can recompile instructions that store to memory addresses only known at runtime. Code generation for the other instructions in this assembly program is straightforward at this point.

So now we can generate a new, native program to run. But how do we get the video to actually display on the screen?

This question brings us to the next challenge.

Challenge: Parallel Systems Running Simultaneously

To find out how to get something rendering on the screen, I looked at Fergulator - an already-working NES emulator written in Go.

Fergulator correctly emulates Super Mario Brothers 1 as well as Zelda Title Screen Simulator, so it will certainly work for our purpose - understanding how the video gets onto the screen.

The answer is easy to find in this well factored codebase. Looking at the main loop in machine.go , the core logic is revealed:

cycles = cpu.Step() for i := 0; i < 3*cycles; i++ { ppu.Step() } for i := 0; i < cycles; i++ { apu.Step() }

The CPU runs one instruction and returns how many cycles the instruction took to run, and then the PPU is stepped for 3 times as many cycles, and the APU is stepped the same number of cycles as the CPU.

A problem presents itself here. Our recompiled code replaces the CPU stepping, but we still have the PPU and the APU code to reckon with. We can start by eliminating audio from the equation and dealing with sound later. But no matter how you spin it, the fact remains that the PPU and the CPU run independently of one another, and at differing speeds.

You can choose to recompile program code and run that as the main loop, but then after every instruction you must emulate the PPU for 3 times as many cycles as the instruction took. Or you can choose to run the PPU as the main loop, but then after the appropriate amount of cycles you must emulate the CPU for one instruction.

One of the systems must be emulated.

This is how I solved the problem:

Have code generation call rom_cycle , an external function, after every instruction completes, passing the number of cycles the instruction took as a parameter.

, an external function, after every instruction completes, passing the number of cycles the instruction took as a parameter. Bundle in a runtime with the generated executable which implements rom_cycle and emulates the PPU for the appropriate amount of steps.

It is a shame that we have to do some amount of emulation in this project, where the goal is to statically recompile as much as possible. The solution to this challenge represents a slight compromise to the project's integrity. Yet we press on.

Given this solution, I ported the PPU code as well as the SDL and OpenGL front-end code from Fergulator to a small C runtime, which is compiled with clang. Here is a snippet explaining the rom_cycle and main functions:

#include "rom.h" #include "assert.h" #include "ppu.h" #include "SDL/SDL.h" #include "GL/glew.h" Ppu* p; // This function is called by the generated module after every instruction. void rom_cycle(uint8_t cycles) { // Check the SDL event loop and quit if the Close event occurs. flush_events(); // Step the PPU for 3 times number of cycles that just finished. for (int i = 0; i < 3 * cycles; ++i) { Ppu_step(p); } } // This function is our new main entry point. We rename the // main rom entry point to `rom_start` so that we can call it // from this function. int main(int argc, char* argv[]) { // Create a new instance of the PPU emulator core. p = Ppu_new(); // The PPU code will call `render` when there is a frame ready to display // on the screen. The `render` function performs the SDL and OpenGL // calls to render the frame to the window. p->render = &render; // Remember that in the generated rom module, we export the ROM mirroring // setting after reading it from the .nes file. Here we use it to configure // the nametable code. Nametable_setMirroring(&p->nametables, rom_mirroring); // We currently only support ROMs with 1 CHR bank. assert(rom_chr_bank_count == 1); // In the generated rom module, we have the CHR ROM data in memory, as well // as `rom_read_chr`, an exported function which will copy the data to a // pointer. We use that to initialize the video RAM. rom_read_chr(p->vram); // This does the SDL and OpenGL setup such as creating the display window. init_video(); // Here we call into the main entry point to the recompiled ROM code. rom_start(); // Free up memory associated with the PPU emulator instance. Ppu_dispose(p); } uint8_t rom_ppu_read_status() { return Ppu_readStatus(p); } void rom_ppu_write_control(uint8_t b) { Ppu_writeControl(p, b); } void rom_ppu_write_mask(uint8_t b) { Ppu_writeMask(p, b); } void rom_ppu_write_oamaddress(uint8_t b) { Ppu_writeOamAddress(p, b); } void rom_ppu_write_address(uint8_t b) { Ppu_writeAddress(p, b); } void rom_ppu_write_data(uint8_t b) { Ppu_writeData(p, b); } void rom_ppu_write_oamdata(uint8_t b) { Ppu_writeOamData(p, b); } void rom_ppu_write_scroll(uint8_t b) { Ppu_writeScroll(p, b); }

Notice that we include rom.h . This is a header file that defines the contract that the generated rom module will fulfill:

#include "stdint.h" enum { ROM_MIRRORING_VERTICAL, ROM_MIRRORING_HORIZONTAL, ROM_MIRRORING_SINGLE_UPPER, ROM_MIRRORING_SINGLE_LOWER, }; uint8_t rom_mirroring; uint8_t rom_chr_bank_count; // write the chr rom into dest void rom_read_chr(uint8_t* dest); // starts executing the PRG ROM. // this function returns when the the program exits. void rom_start(); // called after every instruction with the number of // cpu cycles that have passed. void rom_cycle(uint8_t); // PPU hooks uint8_t rom_ppu_read_status(); void rom_ppu_write_control(uint8_t); void rom_ppu_write_mask(uint8_t); void rom_ppu_write_oamaddress(uint8_t); void rom_ppu_write_oamdata(uint8_t); void rom_ppu_write_scroll(uint8_t); void rom_ppu_write_address(uint8_t); void rom_ppu_write_data(uint8_t);

When we build the final executable file, we can link against this runtime to get a fully operational executable with a video display.

After adding runtime compilation instructions, make does this:

$ make go tool yacc -o jamulator/y.go -v /dev/null jamulator/asm6502.y /home/andy/gocode/bin/nex -e jamulator/asm6502.nex clang -o runtime/main.o -c runtime/main.c clang -o runtime/ppu.o -c runtime/ppu.c clang -o runtime/nametable.o -c runtime/nametable.c ar rcs runtime/runtime.a runtime/main.o runtime/ppu.o runtime/nametable.o go build -o jamulate main.go $

Next we can implement a command which will read a .nes ROM file and perform the recompilation process:

$ ./jamulate -recompile roms/Zelda.NES loading roms/Zelda.NES Disassembling... Decompiling to /tmp/302858262/prg.bc... llc -o /tmp/302858262/prg.o -filetype=obj /tmp/302858262/prg.bc gcc /tmp/302858262/prg.o runtime/runtime.a -lGLEW -lGL -lSDL -lSDL_gfx -o roms/Zelda Done: roms/Zelda $ file ./roms/Zelda ./roms/Zelda: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.24, BuildID[sha1]=0xa79ee93a78a4745b60ba9d10a58b209d756e8ffa, not stripped $

And so here we have it: an executable binary that depends on SDL and OpenGL, which supposedly will execute the Zelda Title Screen Simulator when run. Does it work?

Note: There is an obvious discrepancy between the screenshot and the output listed above. All output is real; the difference is that the article is organized, ordered, and simplified for clarity and understanding. I did not necessarily write the code in the same order that it is listed in this article.

So we made a small concession to solve this challenge, and we have a simple demo NES ROM running natively. Next let's see if we can fix the dent in that sword.

Challenge: Handling Interrupts

In NES programming, after any instruction, it is possible that the program counter is yanked away from the next expected instruction and instead sent to a predefined location. This is called an interrupt.

There are 3 kinds of interrupts in NES programming:

Reset Occurs when the user presses the reset button. This is also where the program counter starts when the NES powers on. IRQ Stands for Interrupt ReQuest. This interrupt can only be fired if a game uses a mapper or uses the BRK instruction. This request can be enabled and disabled with the CLI and SEI instructions. NMI Stands for Non-Maskable Interrupt because there are no instructions to enable and disable this interrupt. It occurs when the vertical blank begins.

When an interrupt occurs, the program counter and status register are pushed onto the stack, and the program counter is set to the location defined by the interrupt vector table, as seen in code listings above:

.org $fffa .dw NMI_Routine .dw Reset_Routine .dw IRQ_Routine

When a program has finished processing an interrupt, it typically executes the RTI instruction to return control flow back to where it was before the interrupt occurred.

We don't yet have to deal with handling the reset button being pressed, mappers, or games which execute the BRK instruction, so let's work on solving this problem for the NMI interrupt only.

The PPU performs the vertical blanking which signals the NMI interrupt, and the PPU emulation code is in the runtime. In order to handle NMI interrupts correctly, we have to be ready to jump to the interrupt routine after every single instruction.

We already call into the runtime by calling rom_cycle after every instruction. This is where the PPU emulation is performed, which is where the NMI interrupt is generated. Given this, a natural solution might be something like this:

Have rom_cycle return a value indicating which interrupt, if any, has occurred.

return a value indicating which interrupt, if any, has occurred. After every instruction, check if an interrupt has occurred, and if so, jump to the interrupt routine. Otherwise, continue.

This solution is rather ugly, however. Branching after every instruction decreases execution speed and executable leanness. Further, it leaves a critical problem unsolved: how to jump back to where the program counter was before the interrupt occurred. The solution I came up with instead feels better, but it comes with a caveat. Here's the idea:

Switch the register variables from being allocated on the stack in rom_start , to global variables in the generated module.

, to global variables in the generated module. Update rom_start , the main entry point, to take a parameter indicating which interrupt vector to execute.

, the main entry point, to take a parameter indicating which interrupt vector to execute. In the runtime, when rom_start is called for the first time, pass Reset as the interrupt vector to execute.

is called for the first time, pass Reset as the interrupt vector to execute. In the generated rom_start code, insert code at the beginning of the NMI interrupt routine block to push the program counter and the processor status to the stack.

code, insert code at the beginning of the NMI interrupt routine block to push the program counter and the processor status to the stack. In rom_cycle , if there is no interrupt, return as normal. However, if there is an interrupt, call rom_start , passing the correct interrupt vector to execute.

, if there is no interrupt, return as normal. However, if there is an interrupt, call , passing the correct interrupt vector to execute. Generate a return statement for the RTI instruction.

The beauty of this solution is that it uses the real native stack for interrupts, in what is probably the most efficient and elegant way to get the desired behavior.

The caveat is that if the game uses RTI for its side-effects instead of the usual "return from interrupt" behavior, the executable will unexpectedly exit. Further, if the game simulates returning from an interrupt without using RTI , the game will crash due to a stack overflow.

Acknowledging these weaknesses, let's plow ahead until we are forced to solve this problem a different way.

Let's see what it looks like to implement the plan.

Update rom_start to take a parameter indicating which interrupt vector to execute:

func (p *Program) Compile(filename string) (c *Compilation) { // ... mainType := llvm.FunctionType(llvm.VoidType(), []llvm.Type{llvm.Int8Type()}, false) c.mainFn = llvm.AddFunction(c.mod, "rom_start", mainType) c.mainFn.SetFunctionCallConv(llvm.CCallConv) // ... }

Add a reset vector enum to rom.h:

enum { ROM_INTERRUPT_NONE, ROM_INTERRUPT_NMI, ROM_INTERRUPT_RESET, ROM_INTERRUPT_IRQ, };

In the runtime, when rom_start is called for the first time, pass Reset as the interrupt vector to execute. Also, in rom_cycle , if there is no interrupt, return as normal. However, if there is an interrupt, call rom_start , passing the correct interrupt vector to execute:

int interruptRequested = ROM_INTERRUPT_NONE; void rom_cycle(uint8_t cycles) { flush_events(); for (int i = 0; i < 3 * cycles; ++i) { Ppu_step(p); } int req = interruptRequested; if (req != ROM_INTERRUPT_NONE) { interruptRequested = ROM_INTERRUPT_NONE; rom_start(req); } } void vblankInterrupt() { interruptRequested = ROM_INTERRUPT_NMI; } int main(int argc, char* argv[]) { p = Ppu_new(); p->render = &render; // The PPU code will call this function when an NMI interrupt occurs. p->vblankInterrupt = &vblankInterrupt; Nametable_setMirroring(&p->nametables, rom_mirroring); assert(rom_chr_bank_count == 1); rom_read_chr(p->vram); init_video(); // Start the rom executing at the Reset interrupt routine. rom_start(ROM_INTERRUPT_RESET); Ppu_dispose(p); }

In the generated rom_start code, insert code at the beginning of the NMI interrupt routine block to push the program counter and the processor status to the stack:

func (c *Compilation) addNmiInterruptCode() { c.builder.SetInsertPointBefore(c.nmiBlock.FirstInstruction()) // * push PC high onto stack // * push PC low onto stack c.pushWordToStack(c.builder.CreateLoad(c.rPC, "")) // * push processor status onto stack c.pushToStack(c.getStatusByte()) } func (c *Compilation) pushWordToStack(word llvm.Value) { high16 := c.builder.CreateLShr(word, llvm.ConstInt(llvm.Int16Type(), 8, false), "") high := c.builder.CreateTrunc(high16, llvm.Int8Type(), "") c.pushToStack(high) low16 := c.builder.CreateAnd(word, llvm.ConstInt(llvm.Int16Type(), 0xff, false), "") low := c.builder.CreateTrunc(low16, llvm.Int8Type(), "") c.pushToStack(low) } func (c *Compilation) pushToStack(v llvm.Value) { // write the value to the address at current stack pointer sp := c.builder.CreateLoad(c.rSP, "") spZExt := c.builder.CreateZExt(sp, llvm.Int16Type(), "") addr := c.builder.CreateAdd(spZExt, llvm.ConstInt(llvm.Int16Type(), 0x100, false), "") c.dynStore(addr, v) // stack pointer = stack pointer - 1 spMinusOne := c.builder.CreateSub(sp, llvm.ConstInt(llvm.Int8Type(), 1, false), "") c.builder.CreateStore(spMinusOne, c.rSP) } func (c *Compilation) getStatusByte() llvm.Value { // zextend s7z := c.builder.CreateZExt(c.builder.CreateLoad(c.rSNeg, ""), llvm.Int8Type(), "") s6z := c.builder.CreateZExt(c.builder.CreateLoad(c.rSOver, ""), llvm.Int8Type(), "") s4z := c.builder.CreateZExt(c.builder.CreateLoad(c.rSBrk, ""), llvm.Int8Type(), "") s3z := c.builder.CreateZExt(c.builder.CreateLoad(c.rSDec, ""), llvm.Int8Type(), "") s2z := c.builder.CreateZExt(c.builder.CreateLoad(c.rSInt, ""), llvm.Int8Type(), "") s1z := c.builder.CreateZExt(c.builder.CreateLoad(c.rSZero, ""), llvm.Int8Type(), "") s0z := c.builder.CreateZExt(c.builder.CreateLoad(c.rSCarry, ""), llvm.Int8Type(), "") // shift s7z = c.builder.CreateShl(s7z, llvm.ConstInt(llvm.Int8Type(), 7, false), "") s6z = c.builder.CreateShl(s6z, llvm.ConstInt(llvm.Int8Type(), 6, false), "") s4z = c.builder.CreateShl(s4z, llvm.ConstInt(llvm.Int8Type(), 4, false), "") s3z = c.builder.CreateShl(s3z, llvm.ConstInt(llvm.Int8Type(), 3, false), "") s2z = c.builder.CreateShl(s2z, llvm.ConstInt(llvm.Int8Type(), 2, false), "") s1z = c.builder.CreateShl(s1z, llvm.ConstInt(llvm.Int8Type(), 1, false), "") // or s0z = c.builder.CreateOr(s0z, s1z, "") s0z = c.builder.CreateOr(s0z, s2z, "") s0z = c.builder.CreateOr(s0z, s3z, "") s0z = c.builder.CreateOr(s0z, s4z, "") s0z = c.builder.CreateOr(s0z, s6z, "") s0z = c.builder.CreateOr(s0z, s7z, "") return s0z }

Generate a return statement for the RTI instruction:

func (i *Instruction) Compile(c *Compilation) { switch i.OpCode { // ... other cases case 0x40: // RTI implied // Restore the old status register values from the stack. c.pullStatusReg() // We get the PC from the stack, but since we rely on the native // stack, we trash the value. _ = c.pullWordFromStack() // This will call the `rom_cycle` function. RTI always takes // 6 cycles. c.cycle(6) // Generate a return statement. c.builder.CreateRetVoid() // So that when another block is encountered, we do not // create an unconditional branch to it. c.currentBlock = nil // ... other cases } } func (c *Compilation) cycle(count int) { v := llvm.ConstInt(llvm.Int8Type(), uint64(count), false) c.builder.CreateCall(c.cycleFn, []llvm.Value{v}, "") }

And so here we have it: an elegant but imperfect solution to the interrupt problem. Let's see if it fixes the bent sword:

Looks like progress to me! At this point the project can recompile a small, simple title screen demo NES program. The real challenge awaits: can it be made to work for a real NES game?

There are many games to choose from. Ideally we would pick one that poses fewer additional challenges. One filter we can apply is to eliminate games that use mappers, since we have hitherto ignored mapper support entirely.

This limits the choices significantly. The only games worth noting that do not use a mapper are:

Of these, there is an obvious answer to which game we should support first, which, of course, is Super Mario Brothers 1.

Our next challenge is revealed when we crack open the disassembly for SMB.

Challenge: Detecting a Jump Table

Here's a small section of Super Mario Brothers 1 disassembly:

Label_8212: LDA $0770 JSR Label_8e04 AND ($82), Y .db $dc, $ae, $8b, $83, $18, $92

Notice anything peculiar?

After the value at $0770 is loaded into register A, and we return from the Label_8e04 subroutine, there is an uncommon indirect AND instruction, followed by data. What could possibly be happening here?

Super Mario Brothers 1 is using a common assembly programming technique called a dynamic jump table. Take a look at the Label_8e04 subroutine:

Label_8e04: ASL TAY PLA STA $04 PLA STA $05 INY LDA ($04), Y STA $06 INY LDA ($04), Y STA $07 JMP ($0006)

Notice that although this label is jumped to with JSR , it never uses the RTS instruction. Let's break it down further into readable pseudocode:

; Dynamic Jump Table. Call this label with JSR ; so that the old PC is on the stack. ; Immediately following the JSR statement should be ; .dw statements indicating the labels to jump to ; depending on the value of register A. Label_8e04: ; Register A holds the index of the label that we wish to jump to. ; Multiply A by 2 because each table entry is 2 bytes. ASL ; A = A * 2 ; The useful indirect instructions use Y as the index, and we need ; to repurpose A. TAY ; Y = A ; Since this label was called with JSR, the old PC is on the top ; of the stack. Here we get the lower byte since this is a little ; endian system. PLA ; A = Stack.Pop() ; Save the lower byte of the old PC into memory. STA $04 ; Memory[$04] = A ; Get the higher byte of the old PC off the stack. PLA ; A = Stack.Pop() ; Save the higher byte of the old PC into memory. STA $05 ; Memory[$05] = A ; JSR pushes the address - 1 of the next instruction to the stack. ; So we add 1 to Y to get the index of the first byte of the jump ; destination. INY ; Y = Y + 1 ; Get the first byte of the jump destination. LDA ($04), Y ; A = Memory[Memory[$04] + Y] ; Save the first byte of the jump destination. STA $06 ; Memory[$06] = A ; Increment Y to get the index of the 2nd byte of the jump destination. INY ; Y = Y + 1 ; Get the 2nd byte of the jump instruction. LDA ($04), Y ; A = Memory[Memory[$04] + Y] ; Save the 2nd byte of the jump instruction. STA $07 ; Memory[$07] = A ; Jump to the location that was just constructed. JMP ($0006) ; Jump to address at $0006 - $0007

If we know that Label_8e04 is a jump table, we can mark the bytes following the JSR as .dw labels which enables us to further disassemble the program. The disassembled snippet from earlier would look like this:

Label_8212: LDA $0770 JSR Label_8e04 .dw Label_8231 .dw Label_aedc .dw Label_838b .dw Label_9218

Without this jump table detection, the bytes at each of those labels remain .db statements, unable to be decoded. This is problematic because our strategy currently depends on all instructions being completely disassembled so that they can be decompiled and recompiled.

This is not a mere corner-case either. This technique is repeated in many games, including Super Mario Brothers 3 and Pac-Man.

Fortunately, it is straightforward to identify and process a jump table like this without changing too much code. I solved it with a state machine - essentially pattern-matching or a regular expression for instructions:

func (d *Disassembly) detectJumpTable(addr int) bool { const ( expectAsl = iota expectTay expectPlaA expectStaA expectPlaB expectStaB expectInyC expectLdaC expectStaC expectInYD expectLdaD expectStaD expectJmp ) state := expectAsl var memA, memC int for elem := d.prog.elemAtAddr(addr); elem != nil; elem = elem.Next() { switch state { case expectAsl: i, ok := elem.Value.(*Instruction) if !ok { return false } if i.OpCode != 0x0a { return false } state = expectTay case expectTay: i, ok := elem.Value.(*Instruction) if !ok { return false } if i.OpCode != 0xa8 { return false } state = expectPlaA case expectPlaA: i, ok := elem.Value.(*Instruction) if !ok { return false } if i.OpCode != 0x68 { return false } state = expectStaA case expectStaA: i, ok := elem.Value.(*Instruction) if !ok { return false } if i.OpCode != 0x85 && i.OpCode != 0x8d { return false } memA = i.Value state = expectPlaB case expectPlaB: i, ok := elem.Value.(*Instruction) if !ok { return false } if i.OpCode != 0x68 { return false } state = expectStaB case expectStaB: i, ok := elem.Value.(*Instruction) if !ok { return false } if i.OpCode != 0x85 && i.OpCode != 0x8d { return false } if i.Value != memA+1 { return false } state = expectInyC case expectInyC: i, ok := elem.Value.(*Instruction) if !ok { return false } if i.OpCode != 0xc8 { return false } state = expectLdaC case expectLdaC: i, ok := elem.Value.(*Instruction) if !ok { return false } if i.OpCode != 0xb1 { return false } if i.Value != memA { return false } state = expectStaC case expectStaC: i, ok := elem.Value.(*Instruction) if !ok { return false } if i.OpCode != 0x85 && i.OpCode != 0x8d { return false } memC = i.Value state = expectInYD case expectInYD: i, ok := elem.Value.(*Instruction) if !ok { return false } if i.OpCode != 0xc8 { return false } state = expectLdaD case expectLdaD: i, ok := elem.Value.(*Instruction) if !ok { return false } if i.OpCode != 0xb1 { return false } if i.Value != memA { return false } state = expectStaD case expectStaD: i, ok := elem.Value.(*Instruction) if !ok { return false } if i.OpCode != 0x85 && i.OpCode != 0x8d { return false } if i.Value != memC+1 { return false } state = expectJmp case expectJmp: i, ok := elem.Value.(*Instruction) if !ok { return false } if i.OpCode != 0x6c { return false } if i.Value != memC { return false } return true } } return false }

Given this detection function, it is a matter of adding logic to the JSR disassembly. If a jump table is detected, mark the following bytes as .dw label statements. Otherwise, continue marking the next address as an instruction as usual.

After adding jump table detection, it looks like all the program instructions are disassembled successfully. But there are still some tricks up the assembly programmers' sleeves.

Challenge: Indirect Jumps

We just looked at a dynamic jump table implementation which included this instruction:

JMP ($0006)

This is problematic because we do not know what will be in the memory addresses $0006 and $0007 until the instruction is actually executed.

One solution is to use our own jump table. We can use our knowledge of the address of each label to create a basic block to jump to when an indirect jump is encountered:

func (c *Compilation) addDynJumpTable() { c.dynJumpBlock = llvm.AddBasicBlock(c.mainFn, "DynJumpTable") c.builder.SetInsertPointAtEnd(c.dynJumpBlock) pc := c.builder.CreateLoad(c.rPC, "") // Here, panic block causes a runtime error and crashes. sw := c.builder.CreateSwitch(pc, panicBlock, len(c.dynJumpAddrs)) for addr, block := range c.dynJumpAddrs { addrVal := llvm.ConstInt(llvm.Int16Type(), uint64(addr), false) sw.AddCase(addrVal, block) } }

And when a JMP indirect 