Avoiding instruction cache misses



Modern processors are quite complex, with many parts having the potential to become a bottleneck. It is relatively easy to reason about the performance of short pieces of code, especially if memory effects are kept to a minimum. Both static analysis tools like LLVM MCA and microbenchmarks can provide a lot of information in such cases. However, the behaviour of the whole program is not just the sum of those small parts. As the code becomes larger and more complex other effects start appearing. One of such potential problems are excessive instruction cache misses.

Every program has different properties, and those large-scale effects will affect it differently. However, if its job is to execute complex logic on a small amount of data, the instruction cache is likely to become a problem at some point. The actual impact may vary significantly from codebase to codebase, which is why I won’t show any numbers in this article. Let’s consider this just a collection of ideas, but it is not easy to tell how much any of them will help for a given application.

First, let’s have a quick look at the processor front end. Below is a simplified diagram of how it is arranged in Skylake, the numbers between units are the maxima per cycle.

Each cycle the processor fetches up to 16 bytes from the instruction cache using information from the Branch Prediction Unit to predict the control flow. The pre-decode unit determines instruction lengths and puts up to six of them in the Instruction Queue. From the Instruction Queue, up to 5 instructions are brought each cycle to the decoders. There is one complex decoder that can handle instructions that translate to up to 4 µops and 4 simple decoders that can handle only single-µop instructions. In total, all decoders are limited to producing no more than 5 µops each cycle. Instructions that require more than 4 µops go through Microcode Sequence ROM, which emits 4 µops per cycle, while it is active, the decoders are disabled. There is also Decoded ICache ( DSB ) which caches decoded µops. It can emit up to 6 µops each cycle. All µops, regardless of their source, end up in the Instruction Decode Queue (IDQ). The Loop Stream Detector ( LSD ) detects small loops and keeps them in the queue, so that no fetched, decodes or reads from the DSB are needed during the duration of the loop. IDQ is the last part of the front end, and the queued µops continue to the back end.

From the instruction cache point of view, the front end has two weaknesses. Firstly, instructions are processed in-order which can severely limit the processor ability to hide latencies of cache misses. HyperThreading can make sure that this part of the processor still does some useful work, but it is also the source of the second problem – all resources, including the L1 instruction cache and µop cache are shared between the hardware threads.

Modern processors provide various metrics that help monitor their behaviour. However, the task of extracting the relevant data requires a proper approach if it is to be done efficiently. Top-down analysis is invaluable with helping to understand microarchitectural phenomena in large codebases. The idea is to monitor program behaviour with the PMU counters and identify the bottleneck starting with the major functional parts of the CPU and then digging deeper narrowing down on the exact source of the problem. It can be done in an automated way by tools like VTune or toplev.

FE Frontend_Bound: 39.48 +- 0.00 % Slots BE Backend_Bound: 16.19 +- 0.00 % Slots This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend... FE Frontend_Bound.Frontend_Latency: 24.92 +- 0.00 % Slots FE Frontend_Bound.Frontend_Bandwidth: 13.45 +- 0.00 % Slots FE Frontend_Bound.Frontend_Latency.ICache_Misses: 14.45 +- 0.00 % Clocks <== This metric represents fraction of cycles the CPU was stalled due to instruction cache misses... Sampling events: frontend_retired.l2_miss:pp frontend_retired.l1i_miss:pp FE Frontend_Bound.Frontend_Latency.ITLB_Misses: 8.71 +- 0.00 % Clocks This metric represents fraction of cycles the CPU was stalled due to instruction TLB misses... Sampling events: frontend_retired.stlb_miss:pp frontend_retired.itlb_miss:pp FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 8.42 +- 0.00 % Clocks_Est This metric represents fraction of cycles the CPU was stalled due to Branch Resteers... Sampling events: br_misp_retired.all_branches:u FE Frontend_Bound.Frontend_Bandwidth.MITE: 31.93 +- 0.00 % CoreClocks This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (Legacy Decode Pipeline)...

Above is an example of a toplev result. We can see that the instruction cache misses were the dominating bottleneck. Unsurprisingly instruction TLB misses also show up. On the front end bandwidth side of things, toplev points to the legacy decode pipeline. That makes perfect sense if the instructions are supplied from DSB or LSD , then there are no instruction fetches, and no cache misses.

Sometimes, the final summary may not provide sufficient information if there are changes in the code behaviour during the test. When that’s the case, a graph is likely to be a much more helpful way of presenting the results.

Tools like toplev are great for initial identification of the problem, but once that’s done what we need is a right way for comparing different solutions. Ultimately, the most important metric is the actual performance of the program in a real-life workload. toplev still can be helpful as it shows the balance between different performance-limiting factors. What also can be useful is perf stat which can show the performance counter statistics. The event most relevant for us is L1-icache-load-misses , though there are more model-specific registers that may be of interest.

Now, that we know how to diagnose excessive instruction cache misses, let’s see what can be done to deal with this problem.

Avoiding work

If the number of executed instructions is the problem, the most obvious solution is to try to reduce that number. Obviously, that’s much easier said than done, but there are some common patterns for dealing with this issue. One example would be prepared statements in databases. The general idea is that if a client knows it will send requests that have some commonality, it can tell the database engine early that the requests are going to match specific templates. This information allows the server to do as much work as possible during the preparation stage, thus reducing the amount of logic that needs to be executed for each individual request.

Extracting common patterns doesn’t have to be explicit. A server or any other kind of an application which actions depend on the user input could attempt to look for repeating patterns and cache some common parts. This is a very vague idea and most likely won’t be easily implementable in a lot of applications, but in some cases may be quite a natural solution. It also shows the main problem with the “just do less” approach – it is very application specific. On the plus, side, if this can be done, it is likely to help with the overall performance, not just the instruction cache misses.

Another potential problem is making sure that the preparation phase can really do something to help during the execution phase. If that means pre-computing some values, then it’s simple. However, if the only thing that preparation gives us is the knowledge which code paths are going to be exercised and which branches taken during the execution, it is going to be harder to take benefit from it. One option is to have some specialised code for the most common paths, C++ templates may come in handy here. If it is not easy to determine what may be the most common paths, then a just-in-time compiler may be used to generate code in the preparation stage.

Batching work

So far, we have been trying to reduce the number of executed instructions, by taking advantage of some earlier knowledge to avoiding repeating the same work. In other words, we have introduced two stages:

preparation, which is performed rarely and, consequently, less performance critical

execution, which is done many times and is expected to dominate overall performance

The way this can help with the instruction cache misses is that the execution stage, being smaller, is more likely to fit in the instruction cache. Whether this brings any measurable benefits depends highly on the type of the application and how much logic can be moved from execution to preparation stages.

We have already split our processing pipeline into preparation and execution stage. If the execution stage can fit in the instruction cache, we are done. However, often, that’s not the case. What we can do to improve the situation more is to split the execution into more stages. This time the goal is not to reuse work as it was with the preparation, but to group entities that need to have the same code executed for them. In other words, if the processing pipeline consists of steps A, B and C the idea is to separate them, add a queue in front of each of those stages, and then cycle through those stages each time handling multiple elements from the queue. Connections between the stages don’t have to be one-to-one, any directed graph is fine.

In the diagram above, there is one stage that feeds tasks to one of two stages. This could be, for example, a front-end of a database server. The first stage does some initial request processing and then, depending on whether it is a read or write, puts it in the appropriate queue. Each stage processes requests in batches, the first one warms up the instruction cache so that the subsequent ones can take advantage of it and hopefully won’t see any instruction cache misses. To make this work well we need to ensure that each stage fits in the cache and the control flow doesn’t diverge too much – we won’t benefit from warmed up cache if each request takes a different path through the code.

The architecture I’ve just described is very similar to SEDA. For the instruction cache purposes, there’s no need to run execution stages on different threads. That actually may be inadvisable since the increase in latency caused by excessive communication between threads is one of the SEDA significant problems. This can be entirely avoided with an asynchronous thread-per-core architecture if tasks are batched per thread. That’s how I did that in Scylla with quite good results.

Regardless of how batching fits in the overall program architecture, there’s still one crucial issue to watch out for. Unless the latency of an individual task is irrelevant, there needs to be a bound on how long it can stay in the queue. This is a trade-off between the throughput and latency and the right solution to this problem is going to be application specific.

Optimisation options

We have considered relatively high-level changes to the code and its architecture that may improve the instruction cache situation. It is time now too look at the lower level side of things, and more specifically what the compiler can do for us.

When it comes to instruction cache misses one optimisation that can change a lot is function inlining. On the one hand inserting a function body in its call sites may increase the code size and increase the number of instruction cache misses, but on the other hand, it allows more optimisations, avoids calling convention overhead, and improves code locality. The point is that inlining can make a significant difference in both directions, and it is definitely something worth checking.

It is also important to remember that the compilers usually allow more control than just, relatively coarse-grained, flags like -finline-functions or -finline-functions-called-once . There is also a set of tunable parameters that helps control the compiler behaviour. For example, inline-unit-growth limits the growth of the compilation unit. The actual effect depends on how the source code is organised, which may lead to quite confusing compiler behaviour if a parameter like this one is limiting the function inlining.

Another way of influencing inlining decisions is to use attributes like [[gnu::always_inline]] or [[gnu::noinline]] (and [[gnu::flatten]] for the brave). This may help in some cases, but adds clutter to the code and if the compiler is unable to make the right decisions without them then it probably means that its heuristics need more tuning.

Function cloning is also a type of optimisation that can directly affect the code size. GCC may use it to do constant propagation across functions if the callee is not inlined.

[[ gnu :: noinline ]] // disable inlining to show cloning int callee ( int x , int y ) { return x + 1000 / y ; } int caller ( int x ) { return callee ( x , 4 ); }

This gets compiled to:

callee(int, int) [clone .constprop.0]: lea eax, [rdi+250] ret callee(int, int): mov eax, 1000 cdq idiv esi add eax, edi ret caller(int): jmp callee(int, int) [clone .constprop.0]

Almost always function cloning will increase the total binary size. It may not necessarily be a problem, though. If the program is going to execute in the hot path only a few cloned versions that benefited significantly from the constant propagation, then we may expect a reduction of the instruction cache misses. Like most things in this section and, actually, the majority of this article, the effect depends on the codebase.

The function definition needs to be visible in the translation unit that contains the call site to be eligible for inlining or cloning. This, again, means that the optimisation will depend on the organisation of the source code, which isn’t really something desirable. The solution for this problem is Link-Time Optimisation, which enables inlining functions across object files as well as other optimisations.

Another kind of optimisation that may cause trouble in the CPU front end is loop unrolling. Ideally, a loop would qualify for being detected by the LSD , if not then we’d want it to at least fit in DSB . However, that’s not necessarily what the best for the back end. As a result, the compiler may be quite aggressive with loop unrolling and produce quite a lot of code. This is definitely something to watch out for.

Finally, the compiler can be asked to optimise code for size. At first, it sounds like a good idea, a smaller program is more likely to fit in the instruction cache. However, this is done at the cost of not doing some performance-oriented optimisations. The end result depends highly on how much code size reduction -Os can bring, and how vital for achieving good performance the optimisations that this flag doesn’t enable are.

Coding practices

While the compilers are prepared to do a wide range of quite aggressive optimisations, there are some limitations that they can’t easily overcome. That requires the behaviour mandated by the language standard (unless the “as-if” rule applies) or the ABI. To deal with that, the programmer needs to be aware of those restrictions, and, ideally, follow practices that lead to code that’s easier to optimise.

For example, the function calling conventions usually require non-trivial objects to be effectively passed by reference regardless of their size. Similarly, they can’t be returned in registers, and the callee needs to write the object to a buffer provided by the caller.

Let’s have a look at how passing a std::unique_ptr<int> to a function looks like on x86_64-linux-gnu :

void takes_unique_pointer ( std :: unique_ptr < int > ); void takes_unique_pointer_noexcept ( std :: unique_ptr < int > ) noexcept ; void takes_raw_pointer ( int * ); std :: unique_ptr < int > unique ; void gives_unique_pointer () { takes_unique_pointer ( std :: move ( unique )); } void gives_unique_pointer_noexcept () { takes_unique_pointer_noexcept ( std :: move ( unique )); } int * raw ; void gives_raw_pointer () { takes_raw_pointer ( std :: exchange ( raw , {})); }

This gets compiled to:

gives_unique_pointer(): push rbp sub rsp, 16 mov rax, QWORD PTR unique[rip] lea rdi, [rsp+8] mov QWORD PTR unique[rip], 0 mov QWORD PTR [rsp+8], rax call takes_unique_pointer(std::unique_ptr<int>) mov rdi, QWORD PTR [rsp+8] test rdi, rdi je .L17 mov esi, 4 call operator delete(void*, unsigned long) .L17: add rsp, 16 pop rbp ret mov rbp, rax jmp .L7 gives_unique_pointer() [clone .cold]: .L7: mov rdi, QWORD PTR [rsp+8] test rdi, rdi je .L16 mov esi, 4 vzeroupper call operator delete(void*, unsigned long) .L8: mov rdi, rbp call _Unwind_Resume .L16: vzeroupper jmp .L8 gives_unique_pointer_noexcept(): sub rsp, 24 mov rax, QWORD PTR unique[rip] lea rdi, [rsp+8] mov QWORD PTR unique[rip], 0 mov QWORD PTR [rsp+8], rax call takes_unique_pointer_noexcept(std::unique_ptr<int>) mov rdi, QWORD PTR [rsp+8] test rdi, rdi je .L24 mov esi, 4 call operator delete(void*, unsigned long) .L24: add rsp, 24 ret gives_raw_pointer(): mov rdi, QWORD PTR raw[rip] mov QWORD PTR raw[rip], 0 jmp takes_raw_pointer(int*)

That’s a lot of code. Even though a std::unique_ptr is just a single pointer with more logic associated with it, being non-trivial, it cannot be passed via registers. Moreover, the Itanium C++ ABI makes the caller responsible for destroying the objects passed by value to the caller. What that means is that this clean up needs to be repeated at each call site and can’t really be optimised since at this point the compiler doesn’t know what the callee does. Moreover, because of the exceptions, there is more than one way to leave a function via either a return or a throw. Unless the compiler can prove that no exception leaves the callee, because it is noexcept or the implementation is available and straightforward enough, an additional handler doing the appropriate cleanup needs to be emitted.

Obviously, many more things may noticeably contribute to the code size growth: restrictions on copy elision, virtual and non-virtual thunks adjusting the object pointer in calls to virtual functions, accessing thread-local variables – just to name the few. However, there are some general guidelines. Often, a lot of complexity is caused by non-trivial objects, e.g. working with a std::string_view instead of std::string may be a good idea, as long as it doesn’t result in dangling pointers, that is. Incidentally, the restrictions that constexpr imposes force the code to avoid the most expensive constructs and attempting to make as much code as possible constexpr is very likely to result in the compiler producing simpler code that’s easier to optimise.

Annotating cold code

Compilers usually allow the programmer to provide some information about the expected characteristics of the code. For our purposes the most interesting one are [[gnu::cold]] and [[gnu::hot]] attributes in GCC and Clang. They allow telling the compiler whether a function is unlikely executed or a hot spot. This affects the optimisations and the location of those functions in the output binary. Cold functions are going to be grouped together in a .text.cold section to make sure they stay out of the way of the rest of the code and hot functions are going to be grouped together in .text.hot to improve instruction cache locality.

Moreover, with -freorder-blocks-and-partition flag GCC can split functions and put cold blocks in .text.cold section. Block that contain calls to cold functions are considered cold, as well as, some other like exception handlers. Even though at a different location in memory, those cold blocks are still effectively considered to be a part of the same function so the control can flow to them with just a single jump instruction, without any need to obey the calling convention.

static constexpr int threshold = 100 ; void hot_path ( int value ); [[ gnu :: cold ]] void cold_path ( int value ); void do_work ( int value ) { if ( value > threshold ) { hot_path ( value / 5 ); return ; } cold_path ( value / 3 ); }

The code above is compiled to:

do_work(int): cmp edi, 100 jle .L2 movsx rax, edi sar edi, 31 imul rax, rax, 1717986919 sar rax, 33 sub eax, edi mov edi, eax jmp hot_path(int) do_work(int) [clone .cold]: .L2: mov eax, edi mov ecx, 3 cdq idiv ecx mov edi, eax jmp cold_path(int)

As we can see, the main part of the function contains a comparison and a conditional jump to the cold region, the rest is the hot path. Moreover, in the cold section, the compiler doesn’t transform a division by a constant into cheaper operations as it does in the hot part.

C++20 is going to gain two new attributes that would allow the programmer to annotate branches as likely or unlikely to be taken: [[likely]] and [[unlikely]] . Even now, some compiler provide extensions for that, such as __builtin_expect() , __builtin_expect_with_probability() . Their purpose is to rearrange the code in a way that optimises for the branches that are most likely to be taken. Apart from helping with the branch prediction itself, it may also improve instruction locality and reduce cache misses.

Profile-Guided Optimisation

Manually annotating code with hot/cold or likely/unlikely attributes can be a quite daunting task. There is also a matter of maintainability. As the code evolves, those properties may change, and there’s no easy way to identify the annotations that are no longer correct.

Fortunately, there is still a lot that the compiler can do for us. Profile-guided optimisation is a technique that instruments the program to monitor its behaviour and then uses the collected information to produce highly optimised binary. The burden of programmers having to maintain the annotations is gone. However, there is a risk of overfitting. If the training phase doesn’t exercise all possible workloads, those omitted may end up with unacceptable performance regressions. How much of a problem this depends on the program, the range of workloads and the differences between them. Another potential problem is the slowdown in the training phase caused by the instrumentation. This may reduce the accuracy of collected data.

There’s also BOLT, a standalone tool that optimises a program, focusing mainly on the instruction cache and the instruction TLB, by rearranging code using information from a profile collected in the training phase. Unlike PGO, profiles are created using sampling and LBR, and the overhead is much lower. BOLT can be combined with PGO and often brings additional improvements. Unfortunately, at the moment, split functions are not supported, meaning that previously discussed flag -freorder-blocks-and-partition has to be disabled.

Conclusion

There are multiple ways of improving the instruction cache miss situation. One can try to change the architecture of the program to make it more hardware-friendly. This is likely to be the most time consuming but may help in more ways than just improving instruction cache hit ratio. Other options include helping the compiler help us. Changing the optimisation options, using PGO, and adjusting the coding practices may have a noticeable effect. Often it will take a combination of those changes before the instruction cache stops being the problem, or rather, something else becomes a more critical issue.

In the end, all optimisations are some kind of a trade-off.