Hi, I’m Tony Albrecht, an engineer at Riot, and I like profiling and optimising things. In this article, I’m going to walk you through profiler basics and analyse some example C++ code as we profile it on a Windows machine. We’ll start at a very high level and delve deeper and deeper into the bowels of the CPU step by step. Once we identify opportunities for optimisation in this article, we’ll implement changes and explore some real-world examples from the League of Legends codebase in a future article. Let’s begin!

Code Overview

First, let's take a look at the code we’re going to profile. Our soon-to-be-optimised program is a simple little OpenGL renderer, specifically an object-oriented, hierarchical scenetree. I’ve cleverly named the base object “Object” - everything in a scene inherits from one of these base classes. In this code there are 3 things that inherit from Object: Node, Cube, and Modifier.

Cube is an Object which renders itself as a cube onscreen, Modifier is an Object which lives in the scene tree and when “Updated” will transform any Objects that have been added to it, and Node is an Object which can contain other Objects. The system is designed so that you can build a hierarchy of objects by adding Cubes to a Node and then Nodes to other Nodes. When you transform a Node (via a Modifier), all of the Objects contained within it will be transformed. With this simple system, I’ve built a tree of cubes that rotate around each other.

I’ll be the first to admit that this code is not the best implementation of a scene tree, but that’s ok - it exists purely as something for us to optimise. In fact, this code is a direct port of a PlayStation3(r) example that I wrote in 2009 for performance analysis in a document titled Pitfalls of Object Oriented Programming and this Tech Blog is, in part, a comparison against that article to see if the lessons learnt for the PS3 are applicable to the more modern hardware platforms found today.

Back to our cubes. The image above consists of about 55,000 rotating cubes. Note that I’m not profiling the rendering of the scene, just the animation and culling of the scene as well as its traversal for rendering. The libraries I used for building this sample were Dear Imgui and Vectormath from Bullet, both freely available. For profiling, I used AMD’s Code XL and a simple instrumented profiler I whipped up for this article.

Before We Get Started

Units of Measurement

Before we dive into the juicy optimisations, I want to chat about measuring performance. We often see Frames Per Second (FPS) used as a performance metric in games. While this is a useful indication of performance, it is useless when looking at parts of frames or comparing improvements between different optimisations. Consider the statement “The game now runs 20 frames per second faster!” How much faster is that?

The answer is, “that depends”. If we were running at 60fps (or 1000/60 = 16.666ms per frame), and we’re now running at 80fps (1000/80 = 12.5 milliseconds (ms) per frame) then we’ve improved performance by 16.666ms - 12.5ms = 4.166ms per frame. That’s a nice speed boost.

But if we were running at 20fps and we’re now running at 40fps, then we’ve improved by (1000/20 - 1000/40 = 50ms - 25ms = 25ms) 25ms per frame! That’s a massive improvement and might elevate a game from “unplayable” to “tolerable.” The issue with FPS is that it’s a relative measurement, so we’ll always use milliseconds. Always.

Measuring the Units

There are a few different types of profilers, each with their own strengths and weaknesses.

Instrumented

Instrumented profilers rely on programmers manually marking up sections of code that they want to measure. They work by measuring and storing the time at the beginning and end of the section to be profiled along with a unique marker used to identify that section. For example:

void Node::Update() { FT_PROFILE_FN for(Object* obj : mObjects) { obj->Update(); } }

In the case above, FT_PROFILE_FN creates an object that measures the current time at creation and then again upon destruction as it falls out of scope. Those times plus the function name are stored in an array somewhere for later inspection and visualisation. With some effort, visualisation can be done in your own code or, with a little less effort, in a tool like Chrome tracing.

Instrumented profiling is fantastic for detecting performance spikes and providing a visual sense of the performance layout of your code. With a top-down view of the performance characteristics of an application, you can quickly see which functions are generally the slowest, which ones call lots of other functions, which ones have a lot of variation in execution time, etc.

In the above example of instrumented profiling in Chrome://tracing, each coloured bar corresponds to a function. Bars nested directly beneath other bars correspond to functions called by other functions. The length of the bar is proportional to the time taken to execute that function.

While instrumented profiling provides valuable visual information, it’s not without downsides. Instrumented profiling is intrusive and will slow down the execution of your program. The more you instrument, the slower the program gets. Care must be taken when writing such a profiler to make it as fast as possible, minimising its impact on the application’s performance. Additionally, the instrumentation process is manual so you must mark up each of the functions you want measured. If you miss a slow function, you’ll find a large gap in your profile output. You won’t get per line information on performance either - you can mark up scopes easily enough, but the overhead of instrumented profiling usually dwarfs the cost of individual lines of code, making it useless in that situation.

Sampling Profilers

Sampling profilers work by interrogating the execution state of the process you want to profile. They periodically store at the Program Counter (PC), which tells you which instruction is currently being executed, and the Stack, which tells you which functions have called the function that contains the current instruction. This information is useful since the function or lines with the most number of samples will be the function or lines which are the slowest. The longer a sampling profiler is run, the more instruction samples and stacks are collected, which leads to better results.

Sampling profilers provide a fantastic low-level view of the performance characteristics of a program and require no manual instrumentation like instrumented profilers. Additionally they have automatic coverage of the entire execution state of your program. They do have a couple major drawbacks though; they don’t show per frame spikes very well, and they don’t tell us when a particular function is being called relative to other functions, so we have less hierarchical call information than a good instrumented profiler.

Specialised Profilers

Specialised profilers provide specific information about a process. Often, they are associated with a piece of hardware like a CPU or GPU that can emit particular events when certain things happen, like a cache miss or branch misprediction. Hardware manufacturers have built in the ability to measure these events to help identify the causes of performance issues and, as such, understanding these profiles can require some understanding of the hardware being used.

Game-Specific Profiling

At a much higher level, game-specific profiling might record things like the number of minions on screen or number of visible particle effects in view. This too is very useful and can help identify errors in game logic at a high level.

Profiling

Profiling an application without something to compare it against is pointless, so we need to ensure that we have a solid test case when optimising. This is harder than it sounds - modern computers don’t just run your application, they run dozens if not hundreds of other processes at the same time, switching between them all constantly. This means other processes can slow down the process you’re profiling, either via device contention (like multiple processes trying to read from disk) or CPU/GPU contention. To provide a good baseline you should perform multiple profile passes on the code you want to profile before you even begin. If you find a lot of variation in performance between runs you’ll need to figure out why and try to remove or at least reduce that variation.

Once you’ve reduced the variation as much as you can, it’s important to remember that small improvements (smaller than your variation) will be hard to measure as they could exist within the noise of your system. For example, imagine a game that runs between 14 and 18ms per frame on a given scene, averaging at around 16ms per frame. Imagine further, that you have spent 2 weeks optimising a function in this game and you now re-profile it to find that it runs at 15.5ms per frame. Have you sped it up? To be sure you’ll need to run the game many times, profiling that given scene and measuring and then averaging those results to see if the trend of frame times is better than it was before. In the case I’ll present here, I measure hundreds of frames and average them all to get a fairly reliable measurement.

Additionally, many games run on multiple threads and these threads are executed in an order determined by your hardware and operating system, which can introduce non-deterministic behavior or at least variable execution times. You need be aware of what impact these factors have on the case you’re optimising.

This is why I’ve built a small test case to profile and optimise. It’s simple enough to understand and complex enough that there’s still room for significant performance optimisation. Note that for the sake of simplicity I turn off rendering when profiling so that we’re only looking at CPU-related costs.

Profiling the Code

Below is a high level look at the code we’re going to optimise. Keep in mind that a single example can only teach us so much about profiling. You’ll certainly run into unique challenges with your own code, and I hope that this article helps you build your own framework for diagnosis.

{ FT_PROFILE("Update"); mRootNode->Update(); } { FT_PROFILE("GetWBS"); BoundingSphere totalVolume = mRootNode->GetWorldBoundingSphere(Matrix4::identity()); } { FT_PROFILE("cull"); uint8_t clipFlags = 63; mRootNode->Cull(clipFlags); } { FT_PROFILE("Node Render"); mRootNode->Render(mvp); }

I’ve added some instrumented macros, FT_PROFILE(), to different scopes in order to measure the time it takes to execute those sections of the code. We’ll go into more detail on what each section of the code does later.

When I run this code and record the data from the instrumented profile, I get the following view in Chrome://tracing:

This is the view of one frame. We can see the relative duration for each of the function calls above. Note that we can also see the calling order. If we’d instrumented the functions that these functions call, we would see them directly under the bar that corresponds to the parent function. As an example, I’ve instrumented Node::Update() and we can see the recursive structure of the calls below:

The time taken to execute a single frame of this code varies by a couple of milliseconds, so we take an average over at least a few hundred frames and use that as a baseline to compare against. As we can see below, over 297 frames, we have an average frame time of 17.5ms with some frames taking up to 19ms and some as little as 16.5ms even though each frame is doing pretty much the same thing. This is the implicit frame variation. Running this multiple times and comparing our frame times consistently gives us values around 17.5ms, so this should be a decent base time to compare against.

Running AMD’s CodeXL sampling profiler over the same code with instrumentation turned off gives the following view:

If we look at the 5 hottest functions, we see the following:

Looks like the matrix multiply is the slowest function. That would make sense, as it’s doing a lot of number crunching to perform all those rotations. If we look closer at the stack hierarchy two images up, you’ll notice that the Matrix multiply operator is called by Modifier::Update(), Node::Update(), GetWorldBoundingSphere(), and Node::Render(). Given that it’s called so frequently and from so many places, it looks like a good candidate for optimisation.

Matrix::operator*

Inspecting the multiply code with the sampling profiler, we can see a cost per line.

Unfortunately, the matrix multiply code is only one line long (effectively) so this doesn’t tell us much. Or does it?

Looking at the assembly, we can identify the function prolog and epilog.

This is the internal instruction cost of a function call. The prolog sets up the new stack space (ESP is the current stack pointer and EBP is the base pointer for the current stack frame) and the epilog clears it and returns. Every time you call a function that uses any stack space (i.e. has a local variable) and isn’t inlined, these instructions will be inserted and called.

Let’s expand the rest of the function so we can see what the matrix multiply is really doing.

Wow - that’s a lot of code! And that’s only the first page. The full function is over one kilobyte of code with around 250 to 300 instructions! Lets look closer at the start of this function.

The line above the blue highlight accounts for 10% of the total time. Why is that line much slower than the lines around it? That MOVSS instruction takes a float from the memory location eax+34h and puts it in register xmm4. The line above it does a similar thing to xmm1 yet it’s much faster. Why?

The short answer is that a cache miss is causing the slow down.

The longer answer is that the attribution of samples to individual instructions is subject to a wide range of things; modern CPUs have multiple instructions in flight at any instant in time and multiple instructions can retire per clock cycle. Even things like event-based samples can attribute events to the wrong instruction. So, when looking at samples in assembly, some logic must be applied. In the case we’re looking at, the instruction that has the most samples may not actually be the slowest instruction. What we can say with some level of certainty is that something around that line is slow. And since the CPU is doing a series of MOVs from and to memory, we can guess that the cause of the slow down is probably these MOVs. To be certain, we could run a profile with event-based sampling enabled for cache misses and see what that reports, but for now let’s trust our instincts and run with the cache miss hypothesis.

An L3 cache miss takes well over 100 cycles (hundreds in some cases), while an L2 cache miss takes about 40 cycles, though both depend heavily on the CPU hardware. By comparison, x86 instructions vary from 1 cycle to around 100 with most being less than 20 cycles (some divide instructions on some hardware are quite slow). With the Core i7 I’ve tested on, multiply, add, and even divide are on the order of a few cycles. The instructions pipeline, so multiple instructions are in process at the same time. This means a single L3 cache miss - a single load direct from memory - can take as long as executing hundreds of instructions. Quite simply, reading from memory is very slow.

Modifier::Update()

We now see that memory access slows our code down, so let’s take a step back and look at what our C++ code is doing to cause this. Our instrumented profiler showed us that Node::Update() is slow, while sampling profiler stack info exposed Modifier::Update() as a particularly slow function. That looks like a good place to start.

void Modifier::Update() { for(Object* obj : mObjects) { Matrix4 mat = obj->GetTransform(); mat = mTransform*mat; obj->SetTransform(mat); } }

Modifier::Update() runs through a vector of pointers to Objects, grabbing their transform matrix, multiplying it by the Modifier’s mTransform matrix, and then setting that transform back in the Object. Looking at the code above we can see that the code is actually copying the transform from the object to the stack, multiplying it, and then copying it back.

GetTransform() just returns a copy of mTransform while SetTransform() copies a new matrix to mTransform and sets the mDirty state of that object:

void SetDirty(bool dirty) { if (dirty && (dirty != mDirty)) { if (mParent) mParent->SetDirty(dirty); } mDirty = dirty; } void SetTransform(Matrix4& transform) { mTransform = transform; SetDirty(true); } inline const Matrix4 GetTransform() const { return mTransform; }

The internal data layout of the Object looks like this:

protected: Matrix4 mTransform; Matrix4 mWorldTransform; BoundingSphere mBoundingSphere; BoundingSphere mWorldBoundingSphere; bool m_IsVisible = true; const char* mName; bool mDirty = true; Object* mParent; };

To give a better idea of the layout, I’ve colorized a Node object’s memory by entry:

From the top, the first entry is the virtual table pointer. This is part of the C++ inheritance implementation. It’s a pointer to an array of function pointers which serve as the virtual functions for this particular polymorphic object. There’s a different vtable for Object, Node, Modifier, and any class inherited from the base class (we’ll see this in more detail in a future article). After this 4-byte pointer there’s a 64-byte array of floats; the matrix mTransform, followed by another matrix mWorldTransform and then two bounding spheres. Note that the next entry, m_IsVisible, is a single byte but takes up a full 4 bytes. This is normal, as the next entry is a pointer and needs to be at least 4-byte aligned. If we had put another boolean next to m_IsVisible, it would have packed neatly into the available 3 bytes. The mName pointer is next (4-byte aligned), then the mDirty boolean (also badly packed), then a pointer to the parent Object. That’s all of the Object-specific data. The mObjects vector after that corresponds to the Node-specific vector and is 12 bytes on this platform, though it can be a different size on other platforms.

If we step through the Modifier::Update() code we can see where the cache misses might come from.

void Modifier::Update() { for(Object* obj : mObjects) {

Firstly, the mObjects vector is an array of pointers to Objects which are dynamically allocated. Iterating through that vector is cache-friendly (the red arrows below) as each pointer is just after the last one. There are few cache misses there, but what they point to is probably not cache-friendly. Since each Object has been allocated with new, all we can say is that it’s somewhere in memory.

Once we have the Object pointer, we call GetTransform():

Matrix4 mat = obj->GetTransform();

This is an inlined function which just returns a copy of the Object’s mTransform, so the above is the equivalent of the following:

Matrix4 mat = obj->mTransform;

As you can see from the diagram above, the Objects that the mObjects array points to are scattered throughout memory. Every time we pull in a new Object and call GetTransform(), it will most likely result in a cache miss as we load in mTransform and put it on the stack. On the hardware that I’m using, a cache line is 64 bytes, so if we’re lucky and the object starts 4 bytes before a 64-byte boundary, then the entire mTransform will load into the cache at once. The more likely case is that loading mTransform will result in 2 cache misses to load the entire matrix. The arbitrary alignment of the matrix is evident from the sampling profile of Modifier::Update().

In this fragment, edx is the location of the Object, and as we know, 4 bytes from the start of the object is where mTransform starts. So this code is copying mTransform to the stack (MOVUPS copies 4 unaligned floats to a register). Note the 7% hits for the last 3 MOVUPS instructions. This indicates that the cache misses are equally spread amongst those MOVs. I’m not sure why the first MOVUPS to stack isn’t as costly as the others. I *think* what is happening is that the cost is being deferred to the subsequent MOVUPS due to the magic of instruction pipelining (if you know why, please let me know in a comment). Regardless, what we’re seeing here looks like evidence of costly memory access, so we’ll run with that.

void Modifier::Update() { for(Object* obj : mObjects) { Matrix4 mat = obj->GetTransform(); mat = mTransform*mat; obj->SetTransform(mat); } }

After the matrix multiplication, we call Object::SetTransform() which takes the result of the multiplication (newly placed on the stack) and copies it into that Object instance. The copy is pretty fast as that transform is currently cached, but the SetDirty() is slow as it’s reading the mDirty flag which is probably not in the cache. So, to test and potentially set that one byte, the CPU will have to read in another 63 bytes around it.

Conclusion

If you’ve read this far - well done! I know this can be very confusing at first, especially if you’re unfamiliar with assembly, but I thoroughly recommend that everyone take the time to look at what compilers do with the code they write. A good tool for this is the Compiler Explorer.

So far, we’ve collected some evidence that points to memory access as a major cause of our performance problems in the example code. The next step is to try and minimise our memory access costs and then measure any differences in performance to see if we are correct.

And that’s exactly what we’ll do next time. I hope you enjoyed the first part of this journey. In the next article, we’ll dig deeper into our example code, identifying and then optimizing the bottlenecks that we find. Eventually, we’ll look at some examples from the League of Legends codebase.

Stay tuned for the rest of the series:

Part 1: Measurement and Analysis (this article)

Part 2: Optimisation

Part 3: Real World Performance in League

Part 4: The Case of the Missing Milliseconds