John Regehr recently solicited advice for what an introductory compilers class should cover in 2020. To save you the trouble of reading through the whole thread, I’ll quote/summarize a bit before throwing in my two cents.

Parsing’s definitely not the coolest part of compilation but it is important. I’d want students to understand the following about parsing:

Various approaches exist: hand-rolled, generated, combinators, PEGs. Focus on parser generators, and give coverage to the others with an eye towards the tradeoffs involved. Err towards broad coverage, because faced with a parsing task, they should understand both what they can do and what others might do.

do and what others do. The approaches come with tradeoffs about up-front work versus automated checking/guarantees (much like type systems). Parser generators sometimes complain for silly-seeming reasons (do I care if my grammar is k=1 vs k=2?) but can also catch actual unintended ambiguity in grammars. PEGs and combinators eliminate annoying complaints about grammar structure, but sometimes by silently choosing the “wrong” parse tree.

Having a specification distinct from implementation is appealing for parsers just as with any other domain.

The choice between LL/LALR/LR is (in practice) mostly an implementation detail that affects the how the input grammar needs to be structured.

LL parsers correspond directly to the sort of recursive descent parser they might write by hand. I sometimes find it handy to reason about grammar conflicts by thinking about what “goes wrong” in the corresponding recursive descent code. (“Oh, when the parser sees symbol X, it doesn’t know if it should start parsing production Y or production Z”). LR parsers are recursive ascent, which I (still) find to be a trickier mental model when debugging grammar conflicts. I’d want students to get some exposure to shift/reduce vs reduce/reduce errors but I’m not sure quite how much.

Parsers for programming languages don’t need to be 100% strict in what language they accept; it’s often better to shift work to the rest of the compiler. (Most(ly) useful for non-hand-rolled approaches). For example, instead of encoding operator precedence in the structure of the grammar itself, the generated parser can build flat lists of operator expressions which then get structured with a separate operator precedence pass. Another example is that by unifying certain lexical categories, such as number and identifier, errors for incorrect programs come from the compiler rather than the parser generator. A third example is the mini-syntax for strings, with escapes and interpolations and such.

I think it’s illustrative to show how e.g. ANTLR’s EBNF-style syntax sugar for grammars, which I find quite natural, can be desugared. Definitely a “yo dawg” moment.

Parsing a single well-defined language is (mostly) a solved problem with strong (and very pretty!) underlying theories, but parsing as a whole is not completely solved. Example: embedding/composing languages is common, especially on the web, but support in existing tools is not great so it’s mostly done by hand.

Unparsing (both serialization in general, and also faithful pretty-printing) has a much larger gap between theory and practice. Automated formatters are common for Go and Rust and C++, but there’s not much research/theory or tooling support for doing it well.

For the rest of the compiler, the pass-based architecture with a few judiciously chosen IRs seems like an obvious choice. A few more specific thoughts:

The details of abstract interpretation and Futamura projections are for sure advanced material, but I think intro students should get some intuition of the connection between compilation & interpretation, and how static analysis approximates dynamic semantics. It doesn’t have to be much: take a concrete snippet of code and show how the same operations are happening just at different phases/times.

of the connection between compilation & interpretation, and how static analysis approximates dynamic semantics. It doesn’t have to be much: take a concrete snippet of code and show how the same operations are happening just at different phases/times. Show how changing (properties of) the input language can make the compiler’s significantly job easier or harder. Potential examples: the difficulty of parallelization and vectorization for Fortran vs C vs NESL, or modularity and dependency management in C vs Go, or error messages for a language with local vs global type inference.

One concern that looms larger than in the past is latency, especially in the distributed systems context. The most obvious connection is with the runtime and GC, but the compiler can also play a role; many languages now feature generators, coroutines, or async/await, with compiler support.

Optimizations are fun, especially “heroic” optimizations, but they’re also a nice confluence of several camps/viewpoints: “let’s do it cuz it’s cool” versus “is the complexity worth it” versus “is the compile time hit worth it” versus “does this make the performance model more fragile”. And, of course, they’re a great starting point for exploring the ways in which modern architectures differ from historical assumptions.

I think it’s interesting that optimizations can have negative runtime cost, by reducing the workload on the remainder of the compiler (similar story with GC: the overhead of GC barriers can be more than offset by reducing the GC’s workload). Chez enforced this “pay their own way” rule.

Ultimately, a big question is what not to cover in an intro class. I’d go so far as to defer most “backend” issues, including data flow analysis (!) to an advanced class. Likewise, ignore pretty much all details related to concurrency and parallelism. It’s a tough balance between giving enough information to be enticing without going too deep nor too thin. I’d also avoid CYK/Earley/GLR/derivative-based parsers and most details of LR parsing’s internals. Coverage of lexing probably depends on what they’ve seen in theory of computation.

In my undergrad software engineering class, we ended up using ANTLR to parse a very ad-hoc DSL. In retrospect, the experience was valuable precisely because it was less clear-cut how to separate the input into lexical vs syntactic categories. Not super actionable, I suppose, but to the extent that students will go on to do parsing, they probably won’t be given a clean language design to start from.

An intro compilers class shouldn’t get too into the weeds on “adjacent” topics like type checking, but it would be nice to convey some of the range of what can be done with static analysis: substructural typing in Rust, definite assignment in Java/Swift, Extended Static Checking, maybe a whiff of SMT-based refinement type checking (Refined TypeScript?) Another nice example for “things you can do with static analysis but not so much with runtime checks” could be static enforcement of constant timing (e.g. for cryptographic code). Students should also understand that it’s become more common to separate type checking from compilation. For example, JavaScript has tools that do compilation without typechecking (Babel, webpack), and checking without compilation (ESLint) and both together (TypeScript).

Likewise, this is too far afield to spend much time on, but it might be worth taking a few minutes from a lecture to give a high level overview of how compilers integrate with modern software engineering practice and infrastructure: CI/CD, coverage-guided fuzzing, hermetic builds, massively parallel builds, distributed tracing, etc.

time on, but it might be worth taking a few minutes from a lecture to give a high level overview of how compilers integrate with modern software engineering practice and infrastructure: CI/CD, coverage-guided fuzzing, hermetic builds, massively parallel builds, distributed tracing, etc. Bootstrapping is a fun puzzle, and meshes well with Reflections on Trusting Trust.