Compilers Time Tues & Thurs, 10:45 AM-12:05 PM Location WEB 2230 Instructor Matt Might (MEB 3450) Office hours After lecture, by appointment, or whenever my door is open.

TAs Limou Wang (MEB 4152), Petey Aldous (MEB 3375) Limou hours 2:00pm - 3:00pm (Mon, Wed), 1:00pm - 2:00pm (Fri) Petey hours 9:30am - 10:30am (Mon, Wed), 12:30pm - 1:30pm (Mon, Wed)

Group Utah Compilers, Spring 2013 Blog blog.might.net (I'll post lecture notes here.)

Description The field of compilers concerns the automated translation of high-level programming languages into low-level code. This course covers the design and implementation of compilers, and it explores related topics such as interpreters, virtual machines and runtime systems. This is not an easy course, but it is a rewarding one for those who finish. Few programmers will ever have to work on a production compiler, but knowing how compilers work is an asset. It makes you a better programmer. Knowing how a compiler thinks will change how you code. Once you know what compilers can and cannot optimize, you will target the intersection of human and mechanical comprehension. Your code will look better. Your code will run better. Many programming tasks can be framed in terms of interpretation or translation. Compilers teaches you how to solve tasks like these properly. Many companies develop an internal programming or scripting language. Most of them get it wrong. If you've taken compilers, you can fix it. Finally, if you want to develop your own programming language, take compilers in order to spare the world some accidental indecency.

Schedule

Grading There are two ways to earn a grade in this class: tests and projects. Your grade is your tests grade or your projects grade, whichever is higher. (You do not have to take the final if you are satisfied with your projects.) Exams Exams will be impossible to finish in the alloted time. You should answer those questions which you know best. The weight of each question is proportional to the percentage of students that failed to answer it or answered it incorrectly. The result will be curved.

Collaboration policy For the projects, you are allowed to pair with a partrner and submit a single, joint assignment. You may also work alone. I encourage the formation of study groups to discuss the theory and algorithms behind the projects. You may not share code with anyone other than your partner. If you want to learn it, you have to implement it. Violators of the collaboration policy will be dealt with under the University's academic misconduct policy, and they will fail the course.

How to cheat in this class I've caught so many clueless attempts at cheating in my courses that I feel the need to provide a guide to students. You may not use code you found on the web, even if modified.

You may not use code you found on the web, even if it's GPL'd.

Do not turn in code with a matching md5sum for your friend's assignment.

You can't simply change comments and spacing. I can tokenize your input to eliminate these differences.

You can't just change variable names. I'm going to diff a token stream of your assignment against others.

a token stream of your assignment against others. You can't just move definitions around. Same reason as before.

You can't copy only part of an assignment. Same reason as before. If you're going to cheat, you need to rewrite the assignment from scratch.

Projects This semester, we'll be building a Python 3.1 compiler. You can use any language you want to do the implementation. I strongly encourage learning and using a functional language such as Racket, Scala, OCaml/SML, Clojure or Haskell. Anything else is a knife in a gunfight. Lexical analyzer Parser High-level translator Normalizer & CPS converter Low-level translator Register allocator & Assembly code emitter Bonus opportunities Valid until the week of finals. +5% to P1 grade: handle unicode.

+5% to P1 grade: handle tabs in input. Use an exploit on caprica to gain root access: +10% for a local user exploit; +15% for a remote exploit (e.g. breaking in via apache). You must exploit a vulnerability (e.g. buffer overflow) for Ubuntu 12.10 on caprica to gain root; that is, you can't steal my laptop while it has an open ssh connection to caprica to claim the prize. You must write up a short summary of the vulnerability and how you exploited it. (You may use a prepackaged tool for exploitation.) Mail the summary to me for approval and then to the class. Each individual exploit may only be claimed once, and the first to exploit wins. To signal that you have claimed root, modify the message of the day.

Project 1: A lexer for Python Due: Friday, Feb 1, 11:59 PM Utah time Your task is to construct a lexical analyzer for the Python programming language. Python is a relatively simple language to tokenize, with the notable exception of its significant whitespace. For reference, the lexical specification for Python is available as part of the official documentation. Guidelines You can use any language you want for this project, but you must be able to run your program under Linux or Mac OS X. If the TA or I can't get your code to run, we'll ask you to help us get it running. Requirements Include a README.txt file that contains: your name; a list of software dependencies necessary to run your assignment; a description of what you got working, what's partially working and what's completely broken; and a manifest briefly describing each file you've turned. Every source code file you turn in needs to have your name and student number at the top and the bottom in comments. The program should use a Makefile: make should compile the program (if compilation is necessary). make run should accept input on STDIN and send output to STDOUT . make test should run any test cases you used. The program should output one s-expression per line for each token in the input. There are eight kinds of s-expressions you can emit: (NEWLINE) -- for a logical newline. (INDENT) -- for a logical increase in indentation. (DEDENT) -- for a logical decrease in indentation. (ID name) -- for an identifier, where name is a Racket-compatible literal string. (LIT value) -- for a literal value, where value is a Racket-compatible literal value. (If it's a string, it must be properly escaped.). (KEYWORD symbol) -- for an keyword, where symbol is a Racket-compatible literal symbol. (PUNCT text) -- for operators and delimiters, where text is a Racket string containing the text of the operator or delimiter. (ENDMARKER) -- for the end of the input. If you encounter a lexical error, print (ERROR "explanation") and quit. Simplifications You do not have to handle multiple character-set encodings. Assume the input is ASCII. (This nullifies the need to handle regular strings and byte strings differently.) Assume an identifier matches the regular expression [A-Za-z_][A-Za-z_0-9]* You should not join adjacent strings together. That's easier to do with a later pass. If a tab appears as an indentation character, it is an error. (You do not have to bump it to the nearest multiple of 8.) Hints and tips You can use a lexer-generator like flex or Racket's lexer for this project. (You'll find flex preinstalled on most Unix systems.) If you go this route, be sure to exploit the ^ regular expression pattern for matching the start of a line. You could also write your own lexical analyzer from scratch using existing regex support on with derivatives of regular expressions. It's not unreasonable to write a lexer by hand for a language like Python. If you use a language that supports regular expressions, you can get the longest matching prefix of a string for the regular expression re with /^(re)/ . The special pattern ^ matches the start of the input (or in some cases, the start of a line). You may find it easier to write a two-pass lexical analyzer: the first pass could resolve signifcant whitespace, and the second could handle proper tokenization. For this project, using a functional language holds advantages over languages like C and Java, but it does not hold an advantage over a lexer generator like flex. Example The input: # Comment 1 # Comment 2 # Factorial: def fact( x\ ): if x == -1: return 1.j elif x ==0: return 1 else: return x* fact(x - 1) s = "foo\ \\

\'\"" fact(20) should produce: (KEYWORD def) (ID "fact") (PUNCT "(") (ID "x") (PUNCT ")") (PUNCT ":") (NEWLINE) (INDENT) (KEYWORD if) (ID "x") (PUNCT "==") (PUNCT "-") (LIT 1) (PUNCT ":") (NEWLINE) (INDENT) (KEYWORD return) (LIT +1.i) (NEWLINE) (DEDENT) (KEYWORD elif) (ID "x") (PUNCT "==") (LIT 0) (PUNCT ":") (NEWLINE) (INDENT) (KEYWORD return) (LIT 1) (NEWLINE) (DEDENT) (KEYWORD else) (PUNCT ":") (NEWLINE) (INDENT) (KEYWORD return) (ID "x") (PUNCT "*") (ID "fact") (PUNCT "(") (ID "x") (PUNCT "-") (LIT 1) (PUNCT ")") (NEWLINE) (DEDENT) (DEDENT) (ID "s") (PUNCT "=") (LIT "foo\\

\'\"") (NEWLINE) (ID "fact") (PUNCT "(") (LIT 20) (PUNCT ")") (NEWLINE) (ENDMARKER) Code The input:should produce: You may find the sample lexers written in flex to be useful. Reference implementation A reference implementation is available online: pylex. Submission instructions Copy your project to /home/unid/pylex/ on caprica.ucombinator.org . It should be possible to run your code with: cd /home/unid/pylex make make run < test1.py > test1.py.out make run < test2.py > test2.py.out make run < test3.py > test3.py.out ...

Project 2: A parser for Python Due: Wed, Feb 27 11:59 PM Utah time Your task is to construct a parser for the Python programming language. Unusually among programming languages, Python is straightforward to parse. In fact, it is one of the few popular programming languages with an LL(1) grammar, making it amenable to hand-written recursive descent parser. Of course, it is also possible to use a parsing tool. Guidelines You can use any language you want for this project, but you must be able to run your program under Linux or Mac OS X. If the TA or I can't get your code to run, we'll ask you to help us get it running. Requirements You must parse the simplified Python grammar. You should assume that Python source is coming in -- not pre-lexed input. If you need to use a working lexer, you can assume that the program pylex is in the current PATH, and that it matches the behavior of the web app by consuming a file on STDIN and printing its tokens as S-Expressions line-by-line on STDOUT. The output S-Expression should fit within grammar for the concrete AST. Include a README.txt file that contains: your name; a list of software dependencies necessary to run your assignment; a description of what you got working, what's partially working and what's completely broken; and a manifest briefly describing each file you've turned. Every source code file you turn in needs to have your name and student number at the top and the bottom in comments. The program should use a Makefile: make should compile the program (if compilation is necessary). make run should accept input on STDIN and send output to STDOUT . make lex should run the lexer on STDIN . make parse should run the parser on STDIN . make test should run any test cases you used. The program should output one s-expression containing the concrete parse tree on success or #f on failure. Simplifications Use the simplified grammar provided. Hints and tips Do not write the parser by hand, unless you have a lot of time. I recommend that you use yacc, Racket's parser generator a derivative-based parsing tool. Check out the documentation for derp. If you want write your lexer and parser in a different language, you can tie them together with a Unix pipe, like so: $ pylex | pyparse You can complete the entire assignment by only modifying the file python-ast.grm.sx file provided with the stub code. You just add derp reductions to the rules in this file. Add a -n flag to pylex that suppresses the ENDMARKER token. To target subsets, change file_input in pyparse.rkt to the name of the production you want to target. Code A reference web app is available. You may find the sample parsers useful. If you take the Racket route, stub code is available for parsing with derivatives. Submission instructions Copy your project to /home/unid/pyparse/ on caprica.ucombinator.org . It should be possible to run your code with: cd /home/unid/pyparse make make run < test1.py > test1.py.out make run < test2.py > test2.py.out make run < test3.py > test3.py.out pylex < test1.py | make parse > test1.py.out pylex < test2.py | make parse > test2.py.out pylex < test3.py | make parse > test3.py.out ...

Project 3: A high-level translator for Python Due: Wed March 27 11:59 PM Utah time Your task is to construct a translator from the subset of Python from Project 2 into a high-level intermediate representation (HIR). The high-level IR shares many features with Python, including dynamic typing and higher-order procedures. To focus on key concepts, some additional parts of Python have been removed for this project. Guidelines You can use any language you want for this project, but you must be able to run your program on the class server. (Non-functional languages are ill-advised unless you have copious free time.) You will have chance to slightly modify your assignment after submission to comply with the auto-grader. Requirements You must translate the same simplified subset of Python from the previous project, but assume it is in the output format of project 2. If you need to use a working lexer, you can assume that the program pylex is in the current PATH, and that it matches the behavior of the web app by consuming a file on STDIN and printing its tokens as S-Expressions line-by-line on STDOUT. If you need to use a working parser, you can assume that the program pyparse is in the current PATH, and that it matches the behavior of the reference app by consuming a tokenized input on STDIN and printing the AST on stdout. The output should fit within grammar for HIR. Include a README.txt file that contains: your name; a list of software dependencies necessary to run your assignment; a description of what you got working, what's partially working and what's completely broken; and a manifest briefly describing each file you've turned in. Every source code file you turn in needs to have your name and student number at the top and the bottom in comments. The program should use a Makefile: make should compile the program (if compilation is necessary). make trans should accept the output of project 2 on STDIN and send output to STDOUT . make run should accept a Python on STDIN and send output to STDOUT . make test should run any test cases you used. The program should output one S-expression containing the HIR code on success or print an error message containing the word error on failure. Your output need not match the reference implementation exactly. It should be semantically equivalent (unless the reference implementation has a bug). You can also test your output against a Python 3.x interpreter, but note that these may differ in how they display objects using print . Simplifications Since project 2 snipped out the class system, you can't create objects, but you should still assume that you can access and set their fields with .name . (Note: The code to interpret HIR fails immediately with an error message if these constructs are invoked.) For numbers, assume only integers are used in the source and at run-time. (No floating point, no imaginary, no complex.) All try blocks have exactly one except clause, and no else or finally clauses, i.e.: try: main code except: failure code A raise can raise any value (not just exceptions). The single except clause for a try block does not specify which exception it catches, nor does it name it. You cannot delete a variable from the local scope. You can only delete fields and indices. That is, you can write: del a[i] or del a.f but not: del x Of course, you could also write: del f(3)["foo"][i] You can assume all global variables are defined at the top level before their first reference. If you want to handle global variables first defined locally using the global construct, talk to Matt. That is, you have to handle: y = 10 def f(x): global y y = x but not: def f(x): global y y = x Hints and tips In this project, the expressive advantage of using a functional language like Racket or Haskell over a language like C or Java starts to become substantial. Those wishing to complete this project in a non-functional language need to start immediately and ask for help early! A prominent feature in functional languages is the ability to pattern-match on data structures. This ability is the killer advantage in this project. If you want write your lexer, parser and translator in different languages, you can tie them together with a Unix pipe, like so: $ pylex | pyparse | ./pytrans Break the transformation down into a transformer for each kind of Python term: let transform-program transform an entire Python program into HIR; let transform-stmt transform an individual Python stmt into HIR; let transform-exp transform an individual Python expression into HIR. Use only very small programs for testing (initially). Start with print(3) and work up from there. Examine the output of the reference implementation on these small programs to unravel its logic. Code This code is provided without warranty or guarantee of correctness. If you find a discrepancy between this code and the project description, notify Matt. A reference web application is available. To execute HIR code, you can prepend hir-header.rkt to the result of your translation, and then execute it in racket. ( hir-header.rkt provides macros and functions to define all of elements of HIR in terms of Racket.) If you want, you can start the project with stub code; this file illustrates a common design pattern for compilers--a transformation function for each class of term. In this case, it contains transform-program , transform-stmt and transform-exp . You can grab a reference implementation from pytrans_rkt.zo. It's compiled Racket bytecode for Racket 5.3.1. To run, use racket pytrans_rkt.zo and send input on STDIN. Submission instructions Copy your project to /home/unid/pytrans/ on caprica.ucombinator.org . It should be possible to run your code with: cd /home/unid/pytrans make make run < test1.py > test1.py.out make run < test2.py > test2.py.out make run < test3.py > test3.py.out pylex < test1.py | pyparse | make trans > test1.py.out pylex < test2.py | pyparse | make trans > test2.py.out pylex < test3.py | pyparse | make trans > test3.py.out ...

Project 4: A CPS translator for Python Due: Wed April 10 11:59 PM Utah time Your task is to construct a translator from the subset of Python in Project 3 into continuation-passing style (CPS). Of course, the expected way of accomplishing this is to translate HIR into CPS. It is tedious to translate HIR directly to CPS. It is strongly recommended that you push HIR into a desugaring phase--dropping it down to LIR first. Guidelines You can use any language you want for this project, but you must be able to run your program under Linux or Mac OS X. Once again, non-functional languages are at a steep expressiveness disadvantage, and will require substantially more time and code invested. Please plan and request help accordingly. If the TA or I can't get your code to run, we'll ask you to help us get it running. Requirements You must translate the same simplified subset of Python from the Project 2 less the additional restrictions from Project 3. You should assume that Python source is coming in, not pre-lexed, pre-parsed input or pre-translated input. If you need to use a working lexer, you can assume that the program pylex is in the current PATH, and that it matches the behavior of the web app by consuming a file on STDIN and printing its tokens as S-Expressions line-by-line on STDOUT. If you need to use a working parser, you can assume that the program pyparse is in the current PATH, and that it matches the behavior of the web app by consuming a tokenized input on STDIN and printing the AST on stdout. If you need to use a working HIR translator, you can assume that the program pytrans is in the current PATH, and that it matches the behavior of the bytecode app by a Python AST on STDIN and printing the HIR AST on stdout. The output should fit within the grammar for CPS. Include a README.txt file that contains: your name; a list of software dependencies necessary to run your assignment; a description of what you got working, what's partially working and what's completely broken; and a manifest briefly (a few words) describing each file you've turned. Every source code file you turn in needs to have your name and student number at the top and the bottom in comments. The program should use a Makefile: make should compile the program (if compilation is necessary). make run should accept Python code on STDIN and send CPS code to STDOUT . make test should run any test cases you used. The program should output one S-expression containing the CPS code. The output for invalid input is undefined, but it is recommended that you print an error message containing the word error on failure. Your output need not match the reference implementation exactly. It should be semantically equivalent (unless the reference implementation has a bug). When CPS-converting, the continuation argument always comes last, even for primitive operations. You can also test your output against a Python 3.x interpreter. Simplifications All object- and field-related operations have been removed from the language. Hints and tips This project is optimally split into two phases: desugaring and CPS conversion. It is strongly recommended that you first translate HIR into LIR, and then translate LIR to CPS. Even if you don't understand the define-syntax form in detail, you may learn a great deal about how to desugar from hir-header.rkt . This is particularly true for the desugaring of try . A substantial harness is provided in the stub code. Even if you don't plan to use Racket, it is recommended that you look at the harness. In particular, try: $ make test-cps $ make test-lir The stub code contains a desugarer with all of the cases stubbed out. It also contains the "top level" for the reference CPS converter. Pattern-matching tree transforms once again dominate this project. Reading material Of the reading material for the course, the following are most helpful: Continuation-passing style (wikipedia)

Programming with continuations by example

Continuation-passing style by example

On One-Pass CPS Transformations (the higher-order transform) Code This code is provided without warranty or guarantee of correctness. If you find a discrepancy between this code an the project description, notify Matt. To execute LIR code, you can prepend lir-header.rkt to the result of your translation, and then execute it in racket. To execute CPS code, you can prepend cps-header.rkt to the result of your translation, and then execute it in racket; for example, you could do: $ make run < myfile.py > myfile.cps $ cat cps-header.rkt myfile.cps > myfile.rkt $ racket myfile.rkt You can grab a reference desugarer from pydesugar_rkt.zo. You can grab a reference cps converter from pycps_rkt.zo. These are compiled bytecode for Racket 5.3 Submission instructions Copy your project to /home/unid/pycps/ on caprica.ucombinator.org . It should be possible to run your code with: cd /home/unid/pycps make make run < test1.py > test1.py.cps make run < test2.py > test2.py.cps make run < test3.py > test3.py.cps pylex < test1.py | pyparse | pytrans | make cps > test1.py.cps pylex < test2.py | pyparse | pytrans | make cps > test2.py.cps pylex < test3.py | pyparse | pytrans | make cps > test3.py.cps ...