I built a time tracking software named augr , and recently decided that it was time to make it faster. For some context, these were the results of running time augr :

~/p/u/augr ((8776075c…)|…) 1 $ time ./target/release/augr summary # snip 0.22user 0.01system 0:00.23elapsed 99%CPU (0avgtext+0avgdata 7204maxresident)k 0inputs+0outputs (0major+948minor)pagefaults 0swaps

This is before I did any optimization.

Shock! Horror! Running one command takes a whole fifth of a second? How could anyone work under these conditions!

Ahem.

So, augr didn't really need to be faster. I wanted to make it faster, because I use augr several times a day, often on my phone. I also wanted to gain some experience optimizing, as I don't have much experience doing that.

And so, my journey began.

(Just a note, I didn't actually save most of the performance reports, so the graphs shown here are not the exact graphs I saw.)

Act 1: Thought Processes

The first thing I needed to do was find the right tool for the job. augr is written in rust, so I looked for stuff along the lines of "rust profiling code".

After reading this article, I decided to use cargo-profiler .

~/p/u/augr ((8776075c…)|…) 1 $ cargo profiler callgrind --bin target/release/augr -n 10 -- summary Profiling augr with callgrind... Total Instructions...909,402,627 245,396,401 (27.0%) ???:_int_malloc ----------------------------------------------------------------------- 118,210,891 (13.0%) ???:_int_free ----------------------------------------------------------------------- 64,236,762 (7.1%) ???:__memcpy_avx_unaligned_erms ----------------------------------------------------------------------- 64,135,017 (7.1%) ???:malloc ----------------------------------------------------------------------- 41,613,000 (4.6%) ???:<toml::tokens ----------------------------------------------------------------------- 37,643,653 (4.1%) ???:malloc_consolidate ----------------------------------------------------------------------- 31,174,650 (3.4%) ???:<alloc::collections::btree::map ----------------------------------------------------------------------- 29,340,320 (3.2%) ???:free ----------------------------------------------------------------------- 25,118,543 (2.8%) ???:<alloc::collections::btree::map ----------------------------------------------------------------------- 24,456,263 (2.7%) ???:<alloc::collections::btree::map -----------------------------------------------------------------------

Immediately, this tells us that most of the program runtime is being spent on allocation. It also tells us that the allocation is most likely being done in a BTreeMap .

Hmm.

At this point I decided that I needed something to tell me more specifically where my code was slow. I kept looking for tools.

Linux perf

The next thing I tried was the Linux perf profiler. It's a very shiny looking tool for measuring the performance of a program. It even had a nice interactive report tool:

Samples: 36 of event 'cycles:ppp', Event count (approx.): 877654027 Overhead Command Shared Object Symbol 36.58% augr [kernel.kallsyms] [k] __d_lookup_rcu 9.92% augr libc-2.27.so [.] _int_free 7.95% augr libc-2.27.so [.] __memmove_avx_unaligned_erms 5.96% augr augr [.] <augr_core::repository::event::PatchedEvent as core::clone::Clone>::clone 5.95% augr libc-2.27.so [.] _int_malloc 3.97% augr augr [.] <alloc::collections::btree::map::IntoIter<K,V> as core::ops::drop::Drop>::drop 3.08% augr [kernel.kallsyms] [k] copy_page 2.46% augr augr [.] __rust_alloc 2.19% augr augr [.] <alloc::collections::btree::map::BTreeMap<K,V> as core::clone::Clone>::clone::clone_subtree 2.08% augr augr [.] <alloc::collections::btree::map::BTreeMap<K,V> as core::ops::drop::Drop>::drop 2.03% augr augr [.] <alloc::collections::btree::map::BTreeMap<K,V> as core::clone::Clone>::clone::clone_subtree 1.99% augr augr [.] <alloc::collections::btree::map::BTreeMap<K,V> as core::clone::Clone>::clone::clone_subtree 1.98% augr augr [.] toml::tokens::Tokenizer::next 1.98% augr augr [.] <toml::tokens::CrlfFold as core::iter::traits::iterator::Iterator>::next 1.98% augr libc-2.27.so [.] cfree@GLIBC_2.2.5 1.98% augr libc-2.27.so [.] malloc 1.98% augr libc-2.27.so [.] malloc_consolidate 1.98% augr augr [.] __rust_dealloc 1.98% augr augr [.] <alloc::collections::btree::map::BTreeMap<K,V> as core::ops::drop::Drop>::drop 1.98% augr augr [.] toml::ser::Serializer::emit_str 0.01% .perf-w [kernel.kallsyms] [k] native_write_msr Press '?' for help on key bindings

In the end, I decided that it wasn't the tool I needed. The report gave me about the same information as cargo-profiler . It seems like a very powerful tool, but I didn't want to take the time to learn it.

Flamer

The flame and flamer crates allow you to generate flame graphs by annotating your code or calling flame::start() and flame::end() .

They're essentially fancy crates for println debugging.

Because flame and flamer require you to mark up your code, and I wanted to save the changes I made, I made a flame_it feature as recommended in flamer 's README.

Anyway, this is the flame graph before I made any changes:

"Aha!" I said at this point. " augr appears to be slowing simply because of the number of files that need to be loaded! There are lots of clever ways to make this faster. I could make a cache file, or I could do some kind of automatic merging of files, or..."

It was at this point that I decided to take a moment and actually investigate the problem before rushing into a giant project to fix it. I added a some code to track how long loading the files took.

The little orange rectangle at the top left of the stack is how long it took to load that patch from disk. So it wasn't the number of files.

After a bit more investigating that is harder to screenshot, I determined that applying the patches was taking a majority of the time, not loading them.

Making Things Faster

So what made this code slow?

pub fn apply_patch (& mut self , patch : &Patch) -> Result<(), Vec<Error>> { // Clone state let mut new_state = self . clone (); // Make changes to state // *snip* // Save changes to state if there were no errors if errors. len () >= 1 { Err(errors) } else { self .events = new_state.events; Ok(()) } }

Oh. As it turns out, cloning your entire state for every operation is bad for performance. In addtion to that gem, there were about 20 other calls to clone in that block of code.

At this point, the I realized that the performance reports from cargo-profiler confirmed that the cloning was killing performance, especially with the other point of information about BTreeMap cloning, as the state was just a big BTreeMap .

The problem was that I couldn't make changes before I knew that the patch was valid. After a bit of thinking, I realized that patches could be verified before mutating the state. The code would ensure a patch is valid, and then just mutate the state directly, as it's already "checked the bounds", so to speak.

After making the required changes, augr ran much faster:

~/p/u/a/cli ((d4536c4e…)|…) 1 $ time ../target/release/augr # snip 0.07user 0.00system 0:00.08elapsed 100%CPU (0avgtext+0avgdata 6192maxresident)k 0inputs+0outputs (0major+714minor)pagefaults 0swaps

But We Gotta Go Faster

Let's take a look at the flamegraph, now that we've made augr faster:

Hmm. Two calls to load_patches , that look identical? It turned out that the patches were being loaded twice. I had code checking if a patch had already been applied, but that was only checked after the patch had been loaded a second time. Fixing that gave augr another boost in performance:

~/p/u/augr ((3ec64318…)|…) 1 $ time ./target/release/augr # snip 0.02user 0.00system 0:00.03elapsed 100%CPU (0avgtext+0avgdata 6364maxresident)k 0inputs+48outputs (0major+707minor)pagefaults 0swaps

At this point augr was ten times faster than when I began, clocking in at 0.02s.

Conclusion

Measuring performance was kind of fun! In this case, I just had to make two simple changes and got a 10x improvement.

Measuring was also more effective than my wild guessing. In my head, I assumed that loading so many files was going to be the bottleneck, and I thought I was going to need all kinds of crazy caching systems to make augr faster. But I didn't need anything remotely like that!

Always measure before optimizing, kids.

(P.S. After optimizing augr , I learned about a replacement for time called hyperfine ! It's much nicer, and I recommend it.)