18 months ago I wrote about some work I did to speed up the Rust compiler (rustc). I’ve recently taken this work up again. Also, in the meantime rustc’s build system has been replaced and its benchmark suite has been overhauled. So it’s a good time for an update.

Getting the code

The steps for getting the rustc code haven’t changed. First, I fork the main Rust repository on GitHub. Then I make two local clones: a base clone that I won’t modify, which serves as a stable comparison point (rust0), and a second clone where I make my modifications (rust1). I use commands something like this:

user=nnethercote for r in rust0 rust1 ; do git clone https://github.com/$user/rust $r cd $r git remote add upstream https://github.com/rust-lang/rust git remote set-url origin git@github.com:$user/rust done

Building the Rust compiler

The compiler’s build system is complex with many possible invocations and configurations. I’ll cover the absolute minimum information required to understand how I’ve been using it.

(UPDATE, December 2019: I have updated this to account for config.toml changes since this post was written.)

First, you need a config.toml file, which sits at the root of the repository and dictates the compiler’s configuration. I used the provided config.toml.example as a starting point, cut it down a lot, and ended up with the following.

# Last updated in December 2019 [llvm] release-debuginfo = true [rust] debuginfo-level = 1

This configuration matches the compiler that ships (in particular, in terms of optimization levels), with the exception that it has maximal debug info present, which ensures that profiles are as detailed as possible.

Building rustc can be confusing, particularly because of the multiple compiler stages and the terminology around them. Here is the command I use.

(UPDATE, September 2020: I have updated this to account for the src directory being renamed compiler .)

./x.py build --stage 2 compiler/rustc

This produces a stage 2 compiler that can handle procedural macros and so is suitable for profiling the benchmark suite.

The benchmark suite

rustc’s benchmark suite is called rustc-perf. It consists of two parts:

collector: The 24 benchmark programs, 14 of which are “real” code (i.e. crates used in real applications), and 10 of which are toy programs or stress test microbenchmarks. Also, the harness code that runs and measures them.

site: Code for displaying measurements as a website.

The test harness is very thorough. For each benchmark, it measures Debug, Opt, and Check (no code generation) invocations. Furthermore, within each of those categories, it does five or more runs, including a normal build and various kinds of incremental builds. A full benchmarking run measures over 400 invocations of rustc. Note however that only the compilation of the final crate is measured in each benchmark; compilation of dependent crates is not included.

rustc-perf was created primarily for the perf.rust-lang.org site, which tracks rustc’s performance over time. I recently modified it so it could be used to compare two builds of rustc on a local machine, which is a fundamental operation when optimizing. This can be done by running the suite and site locally and navigating to the local “compare” page, which looks like this:

Note that rustc-perf uses perf-stat for its measurements, so the benchmarking functionality currently only works on Linux.

I also extended rustc-perf so the benchmarks can be run under a profiler. I implemented support for perf-record, Cachegrind, and DHAT, because they are the profilers that I am most familiar with; it isn’t hard to add support for other profilers (including non-Linux ones). The advantage of integrating support for a profiler intto rustc-perf is that it gets the profiler invocations underneath the Cargo invocations, ensuring that the right rustc invocations are measured.

Wins

Here are the improvements I’ve made to rustc over the past few weeks.

#49993: Cachegrind’s output showed that derived methods for the Token type were taking up significant time. This PR changed the ‘#’ counts in raw Lit variants from usize to u16 , which reduced the size of Token from 32 to 24 bytes, speeding up some of the runs for coercions and html5ever by 1%.

#50051: Rust’s Option::ok_or function transforms an Option<T> value into a Result<T, E> . It’s a bit of a footgun… as the docs say:

Arguments passed to ok_or are eagerly evaluated; if you are passing the result of a function call, it is recommended to use ok_or_else , which is lazily evaluated.

DHAT’s output showed that one hot allocation site involved the creation of a String while getting the value of the MIRI_BACKTRACE environment variable. This seemed to me like a strange thing to be happening frequently, and it turns out that it was at the end of a chain of calls that were only needed in the (rare) error case of an ok_or call. This PR changed the code to use a trivial closure with ok_or_else , speeding up runs for a lot of benchmarks, the best by 6%.

#50052: DHAT’s output showed that the char_lit function, which parses “\u{…}” literals, was doing a lot of allocations while stripping out ‘_’ chars. This PR avoided that allocation, speeding up various runs — particularly ones for regex, futures, clap, coercions, hyper, and encoding — the best by 6%.

#50106: Cachegrind’s output showed that the nearest_common_ancestor function, which computes the lowest common ancestor of two nodes in a scope tree, was very hot. The algorithm in use constructed the full scope chain for each node, and then worked backward from the end of the two scope chains until a difference was found. This is a reasonable algorithm in many circumstances, but some ad hoc instrumentation ( eprintln! statements plus some simple post-processing) showed that the scope chains usually only differ by a handful of elements at the front and then have very long common tails, with dozens or even hundreds of elements. This PR switched to a different algorithm that looks for differences from the front of the scope chain, speeding up runs for many benchmarks, the best by 8%.

#50174: By default, Rust’s HashSet and HashMap use a hash function that is very high quality but also very slow. Therefore, the Rust compiler internally uses different types, FxHashSet and FxHashMap , which are identical to the standard ones except they use a much faster hash function. Unfortunately, it’s easy to forget about them and use the standard hash tables instead. Cachegrind’s output showed that the default hash function code (SipHash) was executed a lot, and that one particularly hot hash table (the symbol interner) was using HashMap . This PR (trivially) changed that table to an FxHashMap , speeding up runs for numerous benchmarks, the best by 5%.

#50240: Some of Rust’s standard containers, such as Vec , HashSet , and HashMap , have the nice property that by default they don’t allocate until an element is inserted. This is good because it’s surprising how often such containers are created but never inserted into. DHAT’s output showed that such behaviour would also help with a couple of the compiler’s uses of BTreeMap. I tried and failed to implement this behaviour directly in BTreeMap; according to Gankro, “BTreeMap is some of the most complex unsafe code in libstd” and “I just scared off a grizzled firefox dev explaining it“! Instead this PR introduced a thin wrapper type ( LazyBTreeMap) around BTreeMap and used it in the handful of relevant places within the compiler, speeding up the runs for several benchmarks, the best by 3%. #50266 is open to do the general fix for BTreeMap , whereupon LazyBTreeMap will be able to be removed.

#50246: Cachegrind’s output showed that a function named dump_allocs was hot for some benchmarks. This sounded to me like a logging or debugging function of some kind, and investigation confirmed that it was traversing data structures in order to build up strings that went unused in the standard case where logging is disabled. This PR (trivially) changed this function and a couple of related ones to be no-ops if logging is disabled, speeding up runs for coercions, tuple-stress, html5ever, and encoding, the best by almost 15%! This shows how not doing unclever things is often as important as doing clever things when it comes to optimizing software.

Update: It’s worth noting that I also made three or four optimization attempts that didn’t work out — where I made a change that seemed like it should help, based on profiling data, but the effect was negligible. Success isn’t guaranteed!

Future work

All of the PRs mentioned above (except for the aborted BTreeMap change) involved small, simple changes to the Rust compiler’s code. I’m not a rustc expert, but I do know how to use a couple of profilers well, and I’ve been able to make a difference. I’m sure there are more improvements of this nature to be made, and I encourage other people to try profiling rustc with their favourite profilers to see what they can find. This is valuable because rustc’s speed is something that Rust users often complain about. And it’s fun, if you like that sort of thing 🙂 I’m happy to help people, and the members of the #rustc IRC channel are very friendly and helpful.

Having said that, in a lot of cases, especially for opt builds, the majority of execution time is within LLVM, which rustc uses for code generation. Speeding up LLVM itself may be difficult, but I hope/suspect there is room for improvement in the way that rustc interacts with LLVM. If anyone has ideas on that front I’d love to hear about them.