Sometimes being hot is not cool.

When Valve’s customers have performance problems in our games we sometimes ask them to send in xperf traces for us to examine. In some cases this lets me find performance bugs that didn’t show up during our testing, and fixing these issues makes our games run faster for everyone.

In some situations, however, what I find is significantly stranger. In at least four cases this year (2013) these traces showed performance problems caused by thermal throttling – CPUs overheating enough that they throttle back their performance in order to stop themselves from melting, and the computers’ owners had no idea.

Update October, 2015: IMHO, if a CPU is advertised as having an n-core X-GHz then an inability to run all n cores at X GHz is false advertising or defective hardware. If you care about performance then you should buy computers that advertise their specs and you should hold the manufacturers to them. Thermal throttling of phones and ultra-thin laptops may be okay, but it should be documented that they cannot always run at full speed.

I want to share this because I think it’s nerdy cool/horrible, because developers should know about this possibility, and because there may be a lot of people out there whose games are running poorly because of overheating.

Thermal throttling is extremely difficult to detect in xperf traces. It is done automatically by the CPU or the motherboard, and the operating system (OS) doesn’t realize that it is happening. I use the ETW toolset for these investigations but its CPU frequency graphs show the CPU running at full throttle and its power management events say that all is well, and yet…

Thermal throttling confirmations

One might reasonably ask how I know that these computers – customer machines that I have never seen – were being thermally throttled. Before I explain what originally made me suspect thermal throttling I’ll explain the confirmations that I received after doing my analysis.

In the first case the customer was running an AMD Phenom processor. When I told the customer of my suspicions they said that they had recently replaced their processor heat sink. When they put the old heat sink back on their performance problems went away. Myth confirmed!

In the second case the thermal throttling theory was never confirmed but is still strongly suspected. Myth plausible?

In the third case the customer was running an Intel Core i3 550 processor and after analysis suggested thermal throttling I asked them to install and run RealTemp, which shows readings from the CPU’s temperature sensors. The screen to the right shows the results, which is that both cores reached temperatures of 105 degrees Celsius – hotter than the boiling point of water! The “LOG” indicators under Thermal Status indicate that the CPUs were thermally throttled. Once again, myth confirmed.

In the fourth case the customer was running an AMD FX-8150 processor. They had already run temperature monitoring software and it showed reasonable temperatures, but the trace clearly showed that the CPU was being throttled down to 30% of normal speed. This frequency throttling was confirmed using CPU-Z, and when they disabled AMD Turbo Core (thus reducing the maximum CPU speed) the problem went away. CPU throttling confirmed, cause uncertain.

In all cases the customer had initially said that the problem must be with our game because only our game was hitting these mysterious slowdowns. I don’t doubt their story, but this shows how difficult it is to find the cause of a slowdown. I rely on xperf traces because they give me enough information to identify slowdowns through science instead of through guessing.

Thermal throttling signatures

Because the CPU doesn’t tell the OS when it is thermal throttling there are no direct indicators in an xperf trace that the CPU has been slowed down. The trace will show the frame rate dropping suddenly with the game CPU bound, but that can have many causes. The first suspicious sign is that every part of the frame loop runs more slowly. Normally a frame rate drop is caused by the load on one or two systems increasing and this sudden across the board slowdown is unusual.

UIforETW has several options to make detection of thermal throttling easier. If Intel Power Gadget is installed then UIforETW will record the CPU temperature, and UIforETW also periodically measures the actual CPU frequency.

In order to interpret the data correctly it is important to understand your game’s architecture. Some systems, like rendering, will happily use all available CPU time. Other systems, perhaps game simulation or audio, run a fixed number of times per second with their workload per second roughly constant. So, when a CPU is running more slowly the tasks with a fixed workload will take longer. This leaves less time for rendering, so the amount of time spent in rendering will actually drop. It takes a careful eye to distinguish increased load (such as more simulation or audio work) from a slower CPU.

If a subsystem is taking longer to run than normal this can have many causes; CPU starvation (other threads stealing CPU time), page faults, disk I/O, or other things that would cause the CPU to not run the subsystem’s code. CPU throttling should only be suspected if a subsystem is taking more CPU time – that is, time spent executing code, not just elapsed time. Luckily, the WPA CPU usage graphs show us how much CPU time is consumed. The CPU Usage (Sampled) data shows a statistical estimate of where CPU time was spent, handy for finding which subsystems are consuming CPU time.

The CPU Usage (Precise) data gives us an extremely accurate measure of how much CPU time our process is consuming in total. This lets us be certain that we are correctly accounting for all CPU time.

Another reason why the CPU might run our code more slowly is cache contention – other code evicting our code or data from the cache. This would require that there be other code running that is using the cache heavily, and in all cases that I examined it was clear that there was not enough other code running for this to be a problem.

So, we know that the CPU is taking longer to do what we believe is a constant workload. But in order to confirm that this slowdown is caused by the processor running more slowly, we really need something else that has a constant CPU workload so that we can see if it is also running more slowly. Luckily we have just such a thing.

Audiodg

The xperf traces showed a sudden drop in frame rate and a sudden increase in CPU consumption in our game, but that could still represent a bug in our code. An extra vote was needed – something else that should have a constant CPU load so I could see if it also slowed down. I found that in the Windows Audio Device Graph Isolation process, also known as audiodg. This process does audio processing and, during normal game play, its CPU overhead is extremely stable. Sure, its overhead does change occasionally, but in my experience it is very stable during normal gameplay. So, it was very interesting when I noticed that the CPU consumption of audiodg increased by a factor of 2.4 at exactly the same time that our game started slowing down.

Enough talk – here’s a pretty picture:

This is a Windows Performance Analyzer (WPA – the xperf trace viewer) screenshot from an xperf trace which caught a customer’s CPU in the act of being throttled. The blue diamonds at the top represent frame boundaries. The game started at 100-300 fps (a solid line of diamonds) and then plummeted to about 15 fps. The jagged blue graph represents CPU consumption by the game process – the spikes on the right are variations in the number of active game threads, peaking once per frame.

The red graph along the bottom is the CPU consumption of audiodg. You can see, plain as day, that audiodg’s CPU consumption increases significantly (about 2.4x) at exactly the same time that the frame rate drops.

As an additional vote take a look at the green spikes. Those are from sidebar.exe which is updating the clock gadget once a second. After the CPU slows down it takes more CPU time to update the clock. That is what made me certain that the CPU was going slower, because the load on those two processes is quite constant, so if they are taking longer to run it must be because the CPU is slower.

Of course, one explanation for this would be that audiodg and sidebar were the cause of the problem. Correlation is sometimes causation, so maybe they were starving the game of CPU time. That is certainly something that had to be considered, but it was clearly not the case. Audiodg and sidebar went from using about 1% of CPU time to about 2.4% of CPU time – they couldn’t starve anybody.

This graph is from another customer, zoomed in to show the dramatic increase in audiodg CPU usage, from 0.79% of total CPU power to 2.37% – an increase of three times at precisely the time where the frame rate drops. Meanwhile the CPU Frequency graph says that all CPUs are running steadily at 3.6 GHz – but that just isn’t true.

Measuring CPU frequency

Diagnosing these problems is tricky, and I am a fundamentally lazy person so I decided to write some code to make my job easier. I wrote some test code that measures the frequency of a processor. This code starts by creating one high priority thread for each logical core on the system. Every five seconds these threads wake up. They call a function that does 500,000 dependent integer adds, which on any modern processor should take 500,000 clock cycles. They use QueryPerformanceCounter to time the code and infer the clock frequency. Because this function will sometimes be slowed down by an interrupt I call it seven times and retain the fastest clock frequency, which I emit into the xperf trace. It’s crude, but effective. Here are some actual results from a machine that was hitting performance problems, graphed in Excel:

UIforETW contains an updated and improved version of this frequency measurement code, sampling the frequency every three seconds. And, with WPA 10 you can graph the results inside WPA.

I find it remarkable how stable the results are – except when the CPU suddenly and catastrophically dropped to 27% of its previous clock rate, sending the frame rate plummeting. The drop in frequency was perfectly correlated with the game performance drop, increased audiodg CPU usage, and real-time frequency monitoring from CPU-Z.

What to do?

If you suspect your PC is not performing as well as it should then it is worth checking to see if your CPU is overheating or otherwise being throttled. There are a number of tools which can help you do this. Keep in mind that RealTemp (Intel only) is the only one that I have used, and I’m not actually endorsing any of these, but here are a few options:

SpeedFan – http://www.almico.com/speedfan.php

RealTemp – http://www.techpowerup.com/realtemp/

CoreTemp – http://alcpu.com/CoreTemp/

Speccy – http://www.piriform.com/speccy

Intel® Extreme Tuning Utility (with thermal throttling graph!) – http://www.intel.com/content/www/us/en/motherboards/desktop-motherboards/desktop-boards-software-extreme-tuning-utility.html

Intel Power Gadget – https://software.intel.com/en-us/articles/intel-power-gadget-20

In the most recent case the temperature monitoring tools insisted that all was well, but the CPU frequency was still dropping during gameplay. It is fine for your CPU’s frequency to drop when your machine is under light load, or when running on battery – that saves a lot of power which can extend your battery life, reduce your power bill, and keep your house cool. It’s also fine if your machine doesn’t stay at its Turboboost or Turbo Core frequency (temporarily raised frequencies) for long. However your CPU should be able to maintain its rated frequency under load. If it cannot then your machine is not behaving correctly, either due to bad design or defective parts. Therefore, even if your machine is not overheating you may want to try monitoring its CPU frequency to see if it is dropping when your game performance drops. To be precise, if you are running a game that is CPU bound and your CPU frequency drops when game performance drops then the reduced frequency is probably the problem. Many of the temperature monitoring tools can display CPU frequency, or you can try one of these tools:

If you suspect your CPU is overheating then there are a few steps that you can try:

Open the case and check for dust, especially on the heat sink, fans, and the vents to the outside – your CPU can only be cooled effectively if cool air comes from outside the case and is pulled over the heat sink by the fan – dust can be removed manually or with compressed air

Make sure your computer is not in an enclosed space – a computer in a stereo cabinet may not get enough cool air

If you have replaced the heat sink then be sure that it is rated for your processor, is firmly attached and is using the recommended thermal paste. CPUs need to dissipate a lot of heat and tiny obstacles can slow this process

Unfortunately some computer cases are just designed poorly. If your case is badly designed then your CPU may be trying to cool itself with recirculated hot air. In one test a poorly designed case was fixed by simply adding a plastic tube that directed cool air from the case vent to the CPU fan and this lowered the CPU temperature by 20-25 degrees Celsius! However I don’t recommend trying to fix poorly designed cases yourself – buying a case that is properly designed is a better option.

Please share your experiences with finding (and fixing!) overheating problems.

Extrapolating from anecdotes

I’m sure that there are a lot of unsuspecting people with this problem, but I have no idea how many because it’s tough to extrapolate from my highly biased sample to the computing population at large. It’s tempting to write a test that will proactively look for this problem, but since thermal throttling is workload dependent it is impossible for such a test to say whether the games that you play will trigger thermal throttling.

Data from a range of customers playing games showed several percent were being significantly thermally throttled.

I’m hopeful that this post will raise awareness of the issue and that the suggestions will let users detect whether or not they are hitting this problem. In the end I suspect that most causes of per game performance are, in fact, due to bugs in the game or other less esoteric causes – but sometimes you need to look to your hardware.

Code for measuring CPU frequency

The code for measuring CPU frequency can now be found in UIforETW, right here.

I hope that some day the ETW code in Windows that provides the CPU frequency will be fixed to detect thermal throttling – and a temperature provider would also be nice. Chapter 14 in the Intel Software Developer’s Manual, Volume 3A:, System Programming Guide, Part 1 would be a good starting point… UIforETW now has the ability to measure CPU temperature directly, as long as Intel Power Gadget is installed. Recommended.

Other reading

If you liked this bit of investigative reporting then you might want to look at other discoveries from the series, such as how a driver caused 4 GB of RAM to be allocated, and why Alt+Tab behaves badly for some people.

If you want to do this type of investigation then read all about xperf, practice, and investigate the next performance slowdown you see.

The reddit discussion of this post can be found here.