This blog post is based on the talk I gave at the Open Source Summit North America 2017 in Los Angeles. Let me start by thanking my employer Collabora, for sponsoring my trip to LA.

Last time I wrote about Performance Assessment, I discussed how an apparently naive code snippet can hide major performance drawbacks. In that example, the issue was caused by the randomness of the conditional branch direction, triggered by our unsorted vector, which really confused the Branch Predictor inside the processor.

An important thing to mention before we start, is that performance issues arise in many forms and may have several root causes. While in this series I have focused on processor corner-cases, those are in fact a tiny sample of how thing can go wrong for performance. Many other factors matter, particularly well-thought algorithms and good hardware. Without a well-crafted algorithm, there is no compiler optimization or quick hack that can improve the situation.

In this post, I will show one more example of how easy it is to disrupt performance of a modern CPU, and also run a quick discussion on why performance matters - as well as present a few cases where it shouldn't matter.

If you have any questions, feel free to start a discussion below in the Comments section and I will do my best to follow-up on your question.

CPU Complexity is continuously rising

Every year, new generations of CPUs and GPUs hit the market carrying an always increasing count of transistors inside their enclosures as show by the graph below, depicting the famous Moore's law. While the metric is not perfect on itself, it is a fair indication of the steady growth of complexity inside of our integrated circuits.



Figure 1: © Wgsimon. Licensed under CC-BY-SA 3.0 unported.



Much of this additional complexity in circuitry comes in the form of specialized hardware logic, whose main goal is to explore common patterns in data and code, in order to maximize a specific performance metric, like execution time or power saving. Mechanisms like Data and Instruction caches, prefetch units, processor pipelines and branch predictors are all examples of such hardware. In fact, multiple levels of data and instruction caches are so important for the performance of a system, that they are usually advertised in high caps when a new processor hits the market.

While all these mechanisms are tailored to provide good performance for the common case of programming and common data patterns, there are always cases where an oblivious programmer can end up hitting the corner case of such mechanisms, and not only write code which is unable to benefit from them, but also code which executes way worse than if there were no optimization mechanism at all.

As a general rule, compilers are increasingly great at detecting and modifying code to benefit from the CPU architecture, but there will always be cases where they won't be able to detect bad patterns and modify the code. In those cases, there is no replacement for a capable programmer who understands how the machine is designed, and who can adjust the algorithm to benefit from its design.

When does performance really matter?

The first reaction of an inexperienced developer after learning about some of the architectural issues that affect performance, might be to start profiling everything he can get his hands on, to obtain the absolute maximum capability of his expensive new hardware. This approach is not only misleading, but an actual waste of time.

In a city that experiences traffic jams every day, there is little point in buying a faster car instead of taking the public bus. In both scenarios, you are going be stuck in the traffic for hours instead of arriving at your destination earlier. The same happens with your programs. Consider an interactive program that performs a task in background while waiting for user input, there is little point in trying to gain a few cycles by optimizing the task, since the entire system is still limited by the human input, which will always be much, much slower than the machine. In a similar sense, there is little point in trying to speed-up the boot time of a machine that almost never reboots, since the reboot time cost will be payed only rarely, when a restart is required.

In a very similar sense, the speed-up you gain by recompiling every single program in your computer with the fastest compiler optimizations possible for your machine, like some people like to do, is completely irrelevant, considering the fact that the machine will spend most of the time in an idle state, waiting for the next user input.

What actually makes a difference, and should be target of every optimization work, are cases where the workload is so intensive that gaining a few extra cycles very often will result in a real increase of the computing done in the long run. This requires, first off all, that the code being optimized is actually in the critical path of performance, which means that that part of the code is actually what is holding the rest of the system back. If that is not the case, the gain will be minimum and the effort will be wasted.

Moving back to the reboot example, in a virtualization environment, where new VMs or containers boxes need to be spawned very fast and very often to respond to new service requests, it makes a lot of sense to optimize reboot time. In that case, every microsecond saved at boot time matters to reduce to overall response of the system.

The corollary of the Ahmdal's law states just that. It argues that there is little sense in aggressively optimizing a part of the program that executes only a few times, very quickly, instead of optimizing the part that occupies the largest part of the execution time. In another (famous) words, a gain of 10% of time in code that executes 90% of time is much better for the overall performance than a 90% speed up in code that executes only 10% of the time.



The dangers of premature optimization

If you think of performance penalties triggered by architectural limitations of the processor, like cache misses, you must know that these actually happen all the time, and they are very normal and expected.

In fact, every time a program starts to execute on a new memory page, walks through a large data structure, accesses a function on a different part of the code, or even when it gets pushed away to another CPU, these performance damaging events will occur. This is inherent to current computers design and while we, as performance engineers, do our best to prevent them with predictors and pre-fetching units, no one can really expect to eliminate them all - and this shouldn't even be a goal, either.

As I mentioned above, most of the time, these penalties are completely irrelevant to the overall performance of the application. A single pipeline stall waiting for the system to fetch data in the main RAM takes several cycles more than a direct L-1 cache hit, but it really doesn't matter in the big picture of things if it happens now and then. Who cares about those wasted cycles if the program is executing a unnecessary quadratic algorithm? Or if it is some interactive software that is waiting for input from the user?

Trying to deal with corner-case performance issues ahead of time is called premature optimization, and is a path for frustration. Before having actual profile information from the actual software running, it is very hard to figure out what it is going to happen on the platform level. It is much more important to write good quality algorithm to solves the problem in a maintainable manner, than to bother about performance issues that may never happen.

Most of the time, the compiler knows much better than the developer how to optimize the code for the specific architecture, and some micro-optimizations on one platform or processor generation might actually render the performance worse on a different machine. In fact, putting clever tricks in C code to attempt to optimize by hand, might actually reduce the semantic information available to the compiler, which confuses it, and can prevent it from actually optimizing the code.

As a rule of thumb, write good algorithms and only try to solve odd performance issues when they arise, and after you are confident you have enough data to justify the change.

Sometimes, on the other hand, a program actually hits a corner case of the design and performs very poorly, despite well written algorithms. On those cases, it is up to the programmer to find out what on earth is going on, and how the program can be improved.

The issue of Performance Assessment - When things go wrong

As a side effect of the increasing complexity of our computer systems, comes the challenge of understanding what is at stake when things go wrong. With so many complex mechanisms playing together at the hardware and software level, it can get quite hard to understand why a seemingly naive piece of code is performing the way it is.

Like starting a car with the hood open, sometimes the best way to understand a problem is to run your program with certain analysis tools attached, to get the right diagnostic information for the problem.

The Linux eco-system includes several of these diagnostic tools, which can attach to a running process and collect statistics about the execution, with minimal intrusion on the application being observed. A Sampling-based dynamic profiler, for instance, is usually capable of collecting samples of cache misses directly from the hardware, on a specific frequency while the test program is running, in order to estimate whether the program is making a good use of that resource.

The kind of data that can be collected by profilers varies deeply depending on the requirements of the user. For instance, one may be interested in the amount of memory used by a specific application, or maybe the number of cycles the program executed, or even how long the CPU was stuck waiting for data to be fetched from the disks. All this information is valuable when tracking performance issues, allowing the programmer to identify different kinds of bottlenecks in the code, or even to learn how to tune an application to a specific environment or workload.

The least intrusion of the tracer on the execution of the profiled application is an important requirement for an accurate analysis. With that in mind, all the main architectures implement hardware performance counters, which can be configured by the tracer with help from the kernel, to collect specific data for later analysis, which reduces the impact of the tracing process itself on the program being analyzed.

One must notice that the tools themselves help only to identify the symptoms of a performance issue. A tool might indicate a particular region of code where the execution is wasting the most time on, or an unexpected number of stalls in a specific Load instruction, but they won't tell you more than that. It is up to the developer, using his knowledge of the platform, to explain why a specific code pattern is causing that specific behavior. And for that, the programmer must not only have a good understanding of his code and data structures, but also some general understanding of computers architecture and design, as well as some knowledge of the specific machine he is working on.

Perf

The tool that first comes to mind when discussing low-level performance assessment in Linux is the perf_events tool, most commonly known as the perf command. Perf is a multi-purpose tool for collecting and analyzing performance data about a specific process or the whole system. It instruments the CPU performance counters as well as software Tracepoints to collect the data for live or later analysis.

Perf abstracts the specific hardware details of the platform allowing the performance engineer to work on a much higher level. For instance, you can simply ask perf to trace a specific event, like data cache branch misses, and let it handle the complex setup of the performance registers for each supported platform.

Next, let's take a look at an example of a program that performs really bad and how we can use the profile information provided by perf to improve the code.





Dealing with cache misses

We are used to think of the computer RAM as a huge pool of immediately available memory ready to be consumed by the processor when needed. But the truth is, while accessing the RAM is much much faster than reading data from a hard disk, it still takes order of magnitude more time than the processor is used to operate. To prevent the processor from stalling while waiting for each memory access to complete, computer designers added several layers of increasingly faster memory between the CPU and the main memory. Those layers compose the memory cache.

The closer these layers of memory are to the main processor, the faster they operate, but they also get capable of holding less and less data. A small part of the main memory, when accessed by the processor, gets replicated into each of these layers, and any further access to that region of memory is then served by the faster memory that holds that data. When the data is served by the cache instead of the main memory, we call it a Cache hit .

But there are times where the requested memory is simply not available in the cache and needs to be fetched from the deeper levels of cache or even the main memory. In that case, we call it a Cache miss , and it requires the processor to stall the execution for a few cycles, waiting for the data to be available.

Every time a new piece of memory is loaded from the main memory, it replaces a different part of memory that was already cached, which gets written back to the main memory. This process is invisible to the programmer, but since a cache miss is an expensive operation, the programmer must be careful about the way she accesses the data, in order to minimize the need for fetching the same blocks of data several times from the main memory.

One last important detail about how cache works is that, differently from the processor, which can operate on single architecture words or single bytes, the cache has a much larger granularity, meaning that it operates on a much larger block of data, which is called a cache line. Only for comparison, the computer I am using right now has a data L1 cache line of 64 bytes of size, while the architectural word is, obviously, 8 bytes..

The reason for loading an entire cache line at once during a cache miss is due to two important assumptions inherent to almost all caching systems: (1) temporal locality, which is the assumption that recently accessed data is likely to be accessed again, and (2) spacial locality, which is the assumption that data close to other data recently accessed is likely to also be accessed in the near future. By loading an entire cache line at once, the processor is placing a bet that that data will be used in the near future.

Improving the use of cache

Take a look at the code below. it performs a simple 5-bit Plane Slicing transformation by walking through the black-and-white input image, represented by a two dimensional vector of pixels, and masking some bits in every pixel. Since the data of a two dimensional array is stored linearly in the memory, which mean that elements of the same row follows each other, and each row is glued after its predecessor in memory, the direction in which we access the data becomes very important.

int plane_6(img[m][n]) { for (i = 0; i < w; i ++) { for (j = 0; j < h; j ++) { img[j][i] &= 0x20; } } }

An example input image and the corresponding output of the process is shown by the two images below:

If we walk through the image one column at a time, as show in the image below, we will access the element A_11, and then the element A_21, which is several memory addresses ahead in the cache. Eventually, we will get to the last line, A_X1, which will most definitely be in a different cache line of the first lines, and only then we will go back to the first cache line to access the element A_12. while this doesn't seem a problem on itself, if the cache is too small or two lines end up in the same line of cache, accessing the elements in this order may cause several cache misses, because every time the next element of a line is accessed, that data needs to be reloaded in the cache, since it could have been overwritten by another line.

If we try to run this code even on a small image, of 512x512 pixels, the execution time will be rather upsetting. For this image, in my system, it took over 23 seconds of wall-clock time.

Once we analyze the program with a tool like perf-stat , the root cause of slowness becomes very obvious. The relative number of cache-misses is extremely high. This is to be expected, given the way we are accessing the data.

The easiest fix would be to walk through the image one row at a time. This way, we make sure we consume the entire cache line before loading the next one, and it doesn't matter if the first line gets overwritten, because we are actually done with it by that time. This access method allows us to make a much better use of the cache, reducing the cache misses and increasing the overall performance.

To make this change, we could simply replace the line that accesses the vector with the code below (notice the change of the indexes):

img[i][j] &= 0x20;

For a visual representation, now we are walking the image like this:

And the perf-stat command shows a consistent reduction of cache misses and execution time:

This fact is not easily deductible without previous knowledge of how the caching mechanism works. But identifying this kind of hazard becomes quite trivial with a tool like perf .

Memory layout

I hope the above example helps to give a sense of how counter-intuitive performance assessment can be. On many cases, though, the craziest changes can have very interesting impacts on performance. There is a paper from the University of Colorado that I really enjoy, called Producing Wrong Data Without Doing Anything Obviously Wrong!. What I find awesome in this paper is that the authors some apparently innocuous changes on their setup, like changing the shell environment size before executing the benchmark, or changing the linking order of the libraries of the benchmark, and are able to observe a noticeable variation in the program performance!

This is a very interesting information not only for research purposes, but also because it gives an idea of how susceptible to external factors our programs really are.

In the cases presented in the paper the most reasonable explanation is that those small changes completely modified the layout of the program in memory, pushing data to different instruction and data cache lines, or branches to different prediction slots, or variables to different alignments. It becomes quite a complex problem to fix by hand, even with the help of perf .

But how much this matters? The paper authors mentioned variations of almost 10% in execution time. Well, definitely, it affects people trying to measure their code, like the researchers mentioned. There is little point in trying to fix cases like this by hand, since the smallest change in the system can affect performance. In fact, people have been working on automatic tools to do just that.

Wrapping up

Despite the curious examples, the points I would really like to emphasize are along these lines:

Performance assessment is hard and, many times, extremely counter-intuitive.

Optimize only when necessary. Avoid premature optimization at all costs.

With the increasing complexity of the current processors, the challenges are likely to increase in the future.

Thankfully, compilers and binary optimizers are getting increasingly better at avoiding pitfalls, but at some point, the programmer also shares the responsibility for performance.

There are very great tools out there which do a very good job at detecting symptoms of performance issue. Take your time to learn and use them.

But the tools only help to some extent. It is up to a skilled developer to have a good understanding of how the computer mechanisms interact with each other, in order to be able to address complex performance issues

Learn more from Gabriel Krisman Bertazi at this month's Open Source Summit Europe, as he presents "Code Detective: How to Investigate Linux Performance Issues" on Monday, October 23.



Original post