HHVM (HipHop Virtual Machine) is a PHP virtual machine with JIT. It was created to replace the HipHop (PHP to C++ source transpiler) at the Facebook.

When all Facebook’s code runs in HHVM after 12 months of development, it was 7x slower⁸.

I suggest you to look at the presentation slides, but one thing I would like to outline:

When code makes unrelated code faster or slower, suspect caching. ~ Keith Adams “PHP on the Metal” slides

HHVM team made an impressive investigation to find the culprit: aggressively inlined memcpy . The code of the memcpy function size was 11KB on their platform and caused a cache thrashing. Useful data was evicted from the CPU cache by this code everywhere in a program.

Their solution was to implement a simpler version of memcpy that wouldn’t have several code paths depending on the size of a copied data, CPU model, etc. Simpler function’s code fit into two cache lines instead.

The new memcpy performed much worse in different memory micro-benchmarks, but it significantly improved overall application performance

By the way, keyword inline has almost nothing in common with the actual process of inlining.

Correction from the u/quicknir on Reddit:

[keyword inline] actually used as a hint in gcc and probably other compilers. In O1, only functions marked inline are considered for inlining

2. Memory, loads, and stores

As it was said, compiler’s frontend generates code like memory is infinite.

For the following function:

LLVM produces this unoptimized code:

Every single variable being allocated, stored, and immediately retrieved from memory.

While optimized code looks very reasonable:

The SSA form shines here. An optimizer looks at memory loads and stores and tries to find what actual values are. In the above example, it does it easily.

Now, let’s try slightly more complex example:

Unoptimized IR:

Note that the function’s signature was changed. The new signature allows a compiler to put arguments in registers.

Before: (Point * this, Point arg)

After: (Point * this, i64 arg.x, i64 arg.y)

Data flow becomes more complex in this example. We access several memory locations, we store intermediate results on a stack, we dealing with implicit this pointer.

The way LLVM tries to reason what’s happening is to partition memory. It attempts to reconstruct individual neighboring loads and stores back to structs and arrays.

Let’s see how optimizer will handle this.

LLVM IR with -O1 :

The insertvalue instruction above inserts a value into a member field in an array of struct value. It works by element’s index similar to the getelementptr .

<result> = insertvalue <aggregate type> <val>, <ty> <elt>, <idx>{, <idx>}* ; yields <aggregate type>

First time, we take undefined {i64, i64} complex value, and we insert %r.x at index 0:

%r_1 = insertvalue { i64, i64 } undef, i64 %r.x, 0

Then we create %r_2 from the %r_1 .

%r_2 = insertvalue { i64, i64 } %r_1, i64 %r.y, 1

All right. Now, we see no alloca — intermediate struct is gone, and now values can be stored in CPU registers.

In the opposite case, RAM access would be literally 100 times slower.

The fastest code is the one that doesn’t run.

3. Loops

A loop is where a program spends the most of its time, and loops are naturally are main optimizer targets. This is where a lot of research is going on in the compilers world.

Let’s take a look at the following loop:

In total, there are about 160 lines of unoptimized LLVM IR code (keep in mind tons of memory allocations, handful of STL function calls, C++ runtime code). So there’s a lot of stuff going on in this small loop.

Here is a small fragment of unoptimized IR:

LLVM turns these 160 lines into nearly 170 lines of entirely different IR code. We will get to it, step by step.

Assuming all the abovementioned optimization techniques stricken, collapsing memory and function abstractions. At this point, IR code of the sum function may look like this:

Again, there’s a canonical form. It is crusial for optimizer to distinguish a loop from any other control flow statement.

Now we have loop.ph — so-called loop pre-header and loop.exit nodes. Their purpose is to be the only and only loop enter and exit. The loop label has loop.ph and itself as predecessors, and loop.exit has the loop as the only predecessor.

Function’s exit label is different as we may get there skipping the whole loop.

loop.exit has this unusual phi node with only one argument (line 23):

%x.lcssa = phi i32 [ %x.next, %loop ]

This is SSA form way of saying that the x.next lives outside of a loop. So a value from within the loop could be taken.

This is the canonical loop form. After recognizing it, several optimization techniques are applied.

First of all, optimizer tries to decide whether it needs to be a loop at all. For instance, if optimizer can figure out the size of an array, loop unrolling will take place. This is something every C++ compiler is actually good at, as the SSA form naturally facilitates constants propagation.

The second class of optimizations tends to move redundant operations out of the loop, and SSA form allows this easily. There is only one loop.ph entry point and loop.exit is the one exit. Optimizer keeps track of every phi node, hence it knows what is modified inside and outside of a loop. There is no need for a programmer to cache vector.size() or similar calls into a local variable — this is no Python. Optimizer will move all unnecessary code from a loop’s body.

Another technique is related to the branch conditions inside of a loop. Generally, optimizer tends not only move condition outside of a loop but to create separate specialized loops for each case.

After abovementioned techniques are run, a loop code becomes cleaner and smaller, and that’s where it gets optimized for the maximum CPU utilization.

Modern processors have highly efficient single instruction, multiple data (SIMD) pipelines. Hence, an optimizer makes loop vectorization — a particular case of SIMD. In our example with the sum function, this simply means: execute the same add instruction several times per iteration.

An optimized code may look somewhat scary. Before you try to get through it, there’s a short intro.

There are three different loops in this code:

loop.vec32 is the main vectorized loop. Inside this loop we see add nsw <4 x i32> — add two vectors of 4 32-bit integers each (result is also a <4 x i32> vector). Moreover, this loop is unrolled, so it munches 32 x 32-bit integers per iteration. Of course, an array must have 32 or more properly aligned elements. loop.vec8 is a smaller vectorized loop that works with two <4 x i32> vectors — 8 x 32-bit integers. loop.scalar is our original loop. All it does is to sum two 32-bit integers one by one add nsw i32 . This loop is a fallback when array smaller than 8 elements.

It is hard to follow array size arithmetics — LLVM replaced costly division instructions, like div or mod , with lshr / and / or it works because right operand is a power of two. Except the size tricks, code should be pretty straightforward.

Array of length 129 will hit loop.vec32 , then will hit loop.scalar , missing the loop.vec8 branch; while having 130 elements would make it hit all of the branches. Small array of 7 or lesser elements would only go through the loop.scalar .

Where programmer kicks in

A compiler will only optimize for the average case under the “do no harm” slogan. You should know specific code paths and high-level data flow that needs to be tuned.

Benchmarking is hard

In many cases, it is not that hard to eliminate a bottleneck, but to find a cause — it is a real challenge. Non-sampling and built-in profilers may interfere with the original code and mask performance problems. Sampling profiler loses important information and can skew results due to the precision error.

The more you know, the better. Understanding your tools and environment is essential. How does profiler work? What compiler flags will affect an investigation? What they do exactly? How to reduce hardware and OS noise? What’s happening inside that long syscall? Are these instructions efficient on this CPU model? Am I measuring the performance of the empty loop?

Chandler Carruth has another wonderful keynote on benchmarking and using profiler⁹ (I can’t help to recommend this guy — it’s his bread and butter to make our C and C++ code run faster).

Understanding the assembly code doesn’t hurt. Well. It hurts less over time. Agner Fog’s site¹⁰ is a great starting point to learn hardware and low-level optimizations techniques.

Medians, averages, and 95th percentiles often mean little to none; contrary — they can wrap up all useful signals with a noise.¹¹ Interpreting benchmark results may be tricky. Learn you some statistics and the theory of experiment for great good.

RAM access control

Unless there are very CPU-intensive calculations going on, memory loads and stores become the bottleneck.

The problem is complex, and there are many aspects to look at.

Modern CPUs tend to hide from us the fact that memory is slow. Understanding of how hardware works are crucial to getting the maximum out of it.

“What every programmer should know about memory” by Ulrich Drepper¹² is a must read if you care about the performance.

Sequential RAM access is much better than random

Sequential access is so crucial that sometimes naive sort algorithms may regularly outperform the most intricately optimized ones on particular datasets. Matrix multiplication may perform times better depending on a rows/columns access order.

CryptoNote cryptocurrency algorithm tries to achieve protection from hardware miners (ASICs) by using a large chunk of memory that needs to be accessed randomly.

Random memory access hurts parallel execution in many ways.

Data Alignment

Some processors made us believe that there’s such thing as unaligned access. Sometimes we are even lucky enough not to meet the performance penalty (if we don’t touch CPU cache line boundaries).

Even though simple read and write operation may go smoothly on x86-based CPUs, atomics and SIMD instructions require strict data alignment. Unaligned memory access will trigger the hardware panic on the most ARM processors.

Some sort of serialization/deserialization code is a common source of a misaligned data.

CPU Caches

Again, there are many aspects to look at to get the optimal memory layout.

The simple rule of thumb (from the performance point of view): what is often accessed together should be kept together.

Let’s consider the following example.

Imagine we want to look up some elements based on their bit mask. Something like this:

Here we pump every element (32+ bytes) through CPU caches just to erase data from caches and throw it away on the next iteration. But we can utilize cache much better by shuffling memory regions around.

Let’s remove int mask from the Foo struct, and move all masks to a separate array:

In that way, CPU caches are not wasted on a useless (in our case) value1, value2, … , and a compiler will turn the loop body: foomasks[i] & needle into a vectorized code that munches several masks at once.

Such optimization may show substantial improvement. (Provided that an items count is high, and a mask selectivity is low).

But now we have two arrays to nurse and carry around. Code readability definitely suffered, and a handful of nasty bugs became possible.

Write memory bypassing CPU caches

If you are going to write a big amount of memory for later use, you may want to write it bypassing a CPU cache. In that case, processor issues only write instructions without prior memory reading, and what is more important, caches will not be polluted with the data you don’t need right now.

The _mm_stream_siXXX intrinsics family produces loads using a non-temporal memory hint.

Prefetch data from memory

Processors do this most of the time, but we have the upper hand here by knowing our specific use-cases and needs.

Just keep in mind that it takes time to get the data, so there is no use to prefetch data right before you need it. We have to prefetch while still having some work to be finished.

One way to prefetch the data is to issue compiler’s built-in instruction, like __builtin_prefetch on GCC or _mm_prefetch from the xmmintrin.h header.

Another way would be to start reading the desired data from another thread.

Of course, it’s all easier to say. Prefetching data upfront may cause cache thrashing and may lead to the very opposite of the expected results.

Branch prediction

Memory access and branch prediction topics are highly correlated. One thing that may come in handy is to tip a compiler with a particular branch probability. This can be done either with a famous likely/unlikely macros (GCC’s __builtin_expect ) or better, with a profile-guided optimization.

You may run a program under your typical workload and create a profile of that execution. This profile can be used as an additional input to an optimizer. Data may include jumps on a branch count, frequencies of function calls, and other information that allows a compiler to make more accurate decisions depending on a provided workload.

From my experience, however, it doesn’t give a huge performance improvement. Such optimizations should be considered as a last resort to obtain a small overall boost.

Look at the CPU performance counters

Often there is no single bottleneck, so it is unclear where even to start. In such cases, hardware CPU performance counters provide priceless information: number of mispredicted branches, number of cache misses for code and data, memory writes, etc. So instead of reducing unknown function’s latency, you may look into improving these.

The perf tool and OProfile on Linux allow to access hardware performance counters, Xcode can do it on Mac, and there are numerous tools from CPU vendors, like Intel® Performance Counter Monitor.

Make your intentions visible to a compiler

Sometimes manual optimizations, like loop unrolling, or replacing branches with a pointer arithmetics, make a significant local improvement but may surprisingly hit the overall performance.

There’s always a tradeoff between the code readability and the performance. Care to make your code more human-readable. Don’t try to impress a compiler and your fellow humans with “100 Amazing Bit Twiddling Hacks.”

All sophisticated tricks you know will be outdated at some point. Hardware behavior will change. A codebase would live its own life.

Let’s take our sum function from the loops example and call it with a constant value:

LLVM IR (-O2):

The whole program was executed at compile time. All that std::__XZZF_you_re_sick_of_this_cr*p_::__iterator__Stable_v1_XYZ_operator is gone. Vector’s heap allocations through std::allocator calls are gone — return 10 is all that’s left.

I don’t think this example is fair — SSA form makes constant propagation a breeze, but I believe it is very impressive.

Passing arguments by reference (pointer) may have hidden costs

This code may give an optimizer a hard time. Now decision: “whether to inline foo ” depends on a memory access (data dependency), and on a bar function (code dependency).

But const …

Unfortunately, there is no such thing for optimizer as a const pointer.

Imagine situations like this:

…or a pointer to pointer cast, or array tricks, or integer overflows, or buffer overflows. Not to mention the dull const_cast , there are endless possibilities to alias the same memory region in C and C++, either intentionally or accidentally.

Second thought is parallelism. If there is no shared data, there is nothing to lock on. Modern multi-core CPUs are capable of highly parallel execution, and there is no need to spend precious cycles on locks (we prefer to mine cryptocurrencies instead).

In many cases, returning arguments instead of passing them by reference enables a compiler to see a data flow and to produce more efficient code.

Do not use struct members to store intermediate results

This is another way to make things more complicated for an optimizer:

Now optimizer have to think about this pointer and reason if there are side effects. Humans also would struggle to read such code.

Pure functions are much easier to read for both: humans and compilers.

Embrace modern C/C++ stadard

C++11 and newer standards provide move semantic and return value optimization (RVO) is guaranteed.

Both, C and C++ have too many cases where behaviour of our programs rely on compiler internal implementation. Modern standards try to removing this dark corners and minimize possible damage from UB.

Also, newer compilers provide supplementary tools like: linters, checkers, formatters, etc.

Don’t do the compiler’s job

Try to spot an error in this manually unrolled loop:

The above example is from the great article from the PVS-Studio static analyzer team.¹³

Line 14: there is the a[1] left from copy-paste, where a[5] should be.

Again, loop unrollign is something C and C++ compilers can do well without any help from a programmer.

Exceptions on modern processors have close to zero overhead

Importantly, “zero overhead” is not claiming zero cost — of course using something always incurs some cost. Rather, C++’s zero-overhead principle has always meant that (a) “you don’t pay for what you don’t use” and (b) “when you do use it you can’t [reasonably] write it more efficiently by hand.” ~ Herb Sutter “Zero-overhead deterministic exceptions: Throwing values” ¹⁵

Of course, [close to] zero overhead happens only for the happy path. CPU has to execute frame finalizers and jump through the stack causing a bunch of cache misses, TLB shuffling, etc. when an exception was thrown.

You wouldn’t simulate pattern matching with exceptions anyway, would you?