I was convinced the idea of parallelizing and distributing Scala typechecking was an interesting project to hack on and I was hoping somebody else would take up the baton. After a month of waiting I realized “Oh wait, there’s around fifteen people in the world who can readily run an experiment on a different implementation of Scala typechecking and I know all of them”. And I was the only hobo in that circle — everybody else is busy with their projects.

In my previous blog post I described the idea of computing outline types that would carry just enough information to figure out an efficient plan for parallelizing and distributing full Scala typechecking. You can think of computing outline types as a pre-typechecking phase. The big unknown to me was: can pre-typechecking be done in a fraction of the time it takes to fully typecheck a Scala program? If pre-typechecking turned out to be too slow then my idea from the blog post would turn out to be a flop.

Scala typechecking is a complicated problem and it’s a labour intensive area to experiment in. The prospect of investing several weeks of work into prototyping my idea and having just a negative validation as a result didn’t sound exciting. I knew that the negative outcome was the most likely one. I’ve been in business of speeding Scala compilation times since 2012, first working with Martin Odersky and later on my own.

I was looking for ways to hedge my risk. I realized I could reframe my problem into a winning-only, twofold question:

How to build a highly parallel typechecker for Scala, or does

Scala have a fundamental language design flaw that prevents one from being built for?

Either Scala community would win by learning about a path towards a high performance Scala typechecker that anybody programming in Scala wants. Or there would be a lesson of high interest to the entire programming language community in what kind of language feature to avoid in a programming language they are designing.

Armed with two different sets of lenses for looking at my challenge, I capped myself at 15 days of full time work to investigate.

Outline types are cheap

I’ll kill the suspense right off for you: Kentucky Mule pre-typechecks sources at the speed of 4.4 million lines of Scala code per second on a warm JVM. I got the units right, it’s that fast. On a cold JVM, it can process the entire scalap sources (2k lines of code) and summarize API-level dependencies in around 600ms. This time includes JVM’s startup time, loading Kentucky Mule’s bytecode, parsing time and actual processing.

I’ve got some good news for all Scala programmers out there:

Scala does not come with a language design flaw that prevents a fast compiler to be written for.

The pre-typechecking Kentucky Mule does is a small subset of real typechecking Scala compiler performs. It’s natural to ask: does Kentucky Mule have a better architecture and implementation or is it simply cheating but not doing the same work as Scala compiler does?

Kentucky Mule repo on GitHub contains a simple benchmark that brings both implementations very close to each other in terms of work they need to do. In that benchmark Scala source is very simple and Kentucky Mule’s pre-typechecking is essentially the same as real typechecking. Kentucky Mule performs at speed of 15 million lines of code per second and the Scala compiler at 7 thousand lines of code per second. In this specific case, Kentucky Mule is over three orders of magnitude faster than the Scala compiler.

The performance gap is so large that it’s interesting to explore what I did differently in Kentucky Mule.

Different architecture

When I was prototyping the basics of the typechecking in Kentucky Mule, I departed from the typical way of computing types and handling the dependencies between them.

Many compilers have a concept of a completer designated as a fundamental unit of computation in the typechecking process. The Scala compiler, the dotty compiler and the Java compiler have all a concept of a type completer at the center of its typechecking implementation.

A type completer is a deferred (lazy) computation that is triggered when someone asks for a type of an entity. For example, if a method call is typechecked, the typechecker will ask the symbol representing the method for its type. If this particular symbol hasn’t been inquired for its type before, the associated type completer will be executed and the computed type will be stored. Once a type is completed, an associated type completer is discarded.

At the very high level the type checking consists of setting up the right completers for symbols and then forcing all completers according to the dynamic graph of dependencies between types and other entities in the typechecked program.

At the first sight, the idea of having lazy computations for all steps of the typechecking process looks elegant. However, at closer inspection completers as implemented in Scala and Dotty turn out to be problematic. Let me list a few issues.

Leaky abstraction

The completer is supposed to hide an order in which types are computed. You just ask an entity for its type and either it has it already, or it will be computed for you and you don’t need to think about it. However, types in any non-trivial programming language permit many kinds of cyclic references. What happens when you have a cyclic dependency between completers? You get either some form of an infinite recursion (usually observed as an StackOverflowError) or logical errors when one of the completers observes a half-baked type and derives wrong results based on a wrong state of its dependencies. This issue is real and is a life-sustaining source of bugs. Here’s one instance of code that tries to handle cycles in Dotty:

The long comment refers to implementation details of other completers (we don’t even know which ones exactly). Type completers as a simple deferred computations do not offer any mechanism for dealing with cycles. An ad-hoc measures are invoked that cause the type completer abstraction to leak.

Deep stacktraces

The design of type completers triggering other type completers and blocking on the result of dependent computation promotes deep stack traces. Each step of forcing a type completer is adding at least a few method calls to the stack. It’s common to see very deep stack traces while the Scala’s typechecking is running.

Deep method call stacks are not handled well by JVM’s JIT compiler. This results in a suboptimal runtime performance.

Measuring performance is hard

The profiling tools for JVM are also not handling deep stacktraces well enough. The Java tooling for optimizing performance can be thrown out of the window because it doesn’t produce any actionable results for Scala compiler or any other program written in the style resembling Scala compiler.

A large graph of lazy computations makes performance accounting hard to do: you don’t know if the completer you’re looking at is slow or maybe one out of thousand completers that happen to be forced is slow.

Unpredictable performance model

A programmer working with type completers have a hard time building a mental model for performance characteristics of the code she is writing.

When an inquiry for a type can take anywhere between 1ms and 1s depending how large portion of the completer graph is forced, it’s the right time to give up on reasoning about code’s performance.

If a programmer observes her completer to wait on other entity that takes 1 second to compute its type, she has no tools to help her determine whether it’s a genuine performance bug or an accidental order of computation she is observing.