Updated 2018-10-30: the problem is fixed in the todays PTS patch. Cheers to the devs!

TL;DR; The current versions of the Quake Champions (October 2018 patch) may show bad performance if your Windows use HPET for timekeeping. You can check if it's your case by running a benchmark from this page. If you got dozens of ns you are good (Windows use TSC for timekeeping), if you got thousands — most likely Windows use HPET. In this case check the useplatformclock flag. If useplatformclock is false and you still have slow QPC, try to upgrade your motherboard firmware (this helped in my case) and disable CPU overclocking.

QC and QPC

Context

I used to have pretty good FPS in QC, something like 160-180 in Duels and 120-150 in modes with 8 players. It wasn’t exactly “buttery smooth” but I was really ok and I was happy with it. The last (September) patch promised even more optimizations in the patchnotes and I looked forward to it. To my surprise after the update I got FPS dropped to about 60% of what it was before:

Something was obviously wrong, particularly when most players say that the patch improved performance for them. I checked my settings everywhere — in the game, nvidia control panel, overclocking. Everything was exactly the same way it was before the patch. Nothing had changed in my PC between the time I played previous version with good FPS and I run the new version with sub-100 FPS. Just to be on the safe side (and to compare performance between win7 and win10) I took an empty SSD and installed everything from scratch (windows 10, all the latest drivers, the game, no additional software at all). FPS was the same, way below 100 in 8-player modes and only about 120 on an empty map. At the moment I was almost sure the problem is not in my PC. I posted my story to r/QuakeChampions and Bethesda forum and found out that I’m not alone with this problem. There were also reports in other threads:

https://bethesda.net/community/post/1234640 (my report)

https://goo.gl/nvtFqY

https://bethesda.net/community/post/1179328

https://goo.gl/uDagpw

https://goo.gl/j1sQm5

https://bethesda.net/community/post/1234669

https://bethesda.net/community/post/1234687

https://bethesda.net/community/post/1235050

https://bethesda.net/community/post/1235263

https://bethesda.net/community/post/1236690

https://bethesda.net/community/post/1180022

https://bethesda.net/community/post/1237459

...

Also a friend of mine reported the same. So there was a small fraction of players who didn’t experience promised optimizations but rather got their game unplayable. What was really strange there was no single common thing between the reports — problem appeared on different CPUs and GPUs. Intel, AMD, Nvidia, it doesn’t matter. I have Ryzen 1500X@3.9GHz, 16 Gb dual channel 3200MHz RAM, GTX 1060 3Gb, ASRock AB350M-HDV motherboard. My friend has Intel and still has the same FPS drop.

Identifying the problem

In the beginning I tried to experiment my settings. I set quality to the lowest and changed render scale to 25%. FPS didn’t changed at all. MSI Afterburner was showing that GPU is used by 15-20% only. However, all of 8 logical CPU cores (4 physical with x2 SMT) were used by 90-100%. This told me that we can exclude graphics subsystem from our suspicions. FPS was heavily bottlenecked by CPU. So the next step was to download some profilers and see what is happening. I don’t use a profiler every day so I just choose some names I liked from a wikipedia page :D I tried GlowCode, CodeXL, and Intel VTune. The first one was almost useless, but the other two showed me pretty the same picture. Here is a screenshot from CodeXL:

Almost half of the time CPU was busy with one function — NtQueryPerformanceCounter (you can see that a function named KeSyncronizeExecution in on the first place but it’s actually mostly called from inside of NtQueryPerformanceCounter). It doesn’t look good, does it?

What is the QueryPerformanceCounter?

I will refer to this function as QueryPerformanceCounter or just QPC because NtQueryPerformanceCounter is in fact an internal Windows function which shouldn’t be used by programs directly. QueryPerformanceCounter is an actual function the Windows API offers and the game calls.

Performance counter is a high resolution clock which allows to measure accurate time intervals, usually with a resolution of at least 0.3 microseconds (BTW, note that I’m saying “resolution”, not “accuracy”. There is a difference). It can be used for analyzing performance of a program (someone can measure accurately how long do different parts of a program execute), it is also useful for measuring time passed between game frames (engine needs to know it to calculate physics), for implementing frame rate limiter etc. — everything where you need to know a precise time value. There are several hardware devices which can provide high resolution time stamp for the operating system or a user program (game). Modern PCs probably have at least three of them: HPET (High Precision Event Timer), ACPI PM timer and TSC (Time Stamp Counter). All have their special aspects and drawbacks. For example, it’s harder to access a timer from several CPU cores when the timer is single for the whole system (case of HPET and ACPI PM). On the other hand, when there is a separate timer for each CPU core, it’s hard (or even impossible) to make them run exactly with the same speed. All timers will return different results on different cores (case of TSC and Invariant TSC). Some timers even depend on current CPU frequency, which is constantly changing independently in every core (case of TSC). Such aspects can cause all kind of strange bugs. For example, it may happen (and happens regularly) that two consecutive queries to the TSC timer show that time gone… backwards. Nothing magical. Just operating system decided to move a thread from one core to another (for performance or power saving optimization) and the cores happen to have different counters. QuertyPerformanceCounter function provided by Windows takes into account all this and guaranty that programmers can use it freely on any core in any number of concurrent threads. QPC will always give you correct non-decreasing time stamps. But the price for it may be high. Or may be not. It depends fully on available hardware, HAL (drivers and firmware) and Windows (some settings are under user control, but you can’t really force QPC to work in a particular way). Here is an article by Microsoft with more technical details.

Measuring performance of performance measuring :D

I become curious how long does a call to QPC take in my system. Of course I tried to find the answer on the web at first. I found an article saying about dramatic change in call length with a hardware change. It was 11 nanoseconds on one system and 2490 nanoseconds on another (220 times difference!). I wrote a small program which executes QPC one million times and measures time taken. And I got about… 2500 nanoseconds per call! Almost exactly the same value. The article also says that resetting windows helped to bring the result to 22 ns. But I already had a freshly installed Windows 10 on another disk so I just tested QPC there. Unfortunately, it was 2500 ns there as well. I tried to switch on/off usage of HPET both in BIOS(UEFI) settings and windows (bcdedit /set useplatformclock command). By the way, there is another function — QueryPerformanceFrequency. It returns the frequency at which the QPC works (resolution, if you want). QueryPerformanceFrequency showed me that changing these settings in fact have an effect — frequency switches between ~3 and ~14 MHz. But in both cases a call to QPC lasted 2500 ns, in both windows 7 and windows 10. I reviewed every possible setting in UEFI, googled tons of forums but I was unable to force windows to run QPC faster than this.

Confirming the guess

But is it really a problem? Would Quake run faster if QPC run faster? At this point I wasn’t sure enough, I needed to check it. In the first place I wanted to know how often does Quake call QPC function. I tried to get this info from profiler, but no luck (probably I didn’t dig deep enough). Then I thought: what if I could intercept all calls to QPC? This way I can easily count how often it is called or even replace system QPC with some kind of my timer implementation. Shortly after I found Microsoft Detours. It allows you to inject code into any DLL function, including system DLLs like kernel32.dll (where QPC resides). Not so shortly after :D I was able to run my custom code every time when QPC is called and I could also replace it completely. At first I tested it with my previous program (benchmark that runs QPC million times). It showed that QPC is called about 400000 times per second (not a surprise — one call is 2.5 us and everything is in one thread). The next step was to finally check it with Quake. It didn’t work at first and I almost gave up. My code just didn’t inject. Turned out that Steam has some protection against it. I downloaded Bethesda Launcher, installed the game and everything worked. Frequency of calls to QPC by Quake was… about 300000-400000 times per second. I have no idea why the game calls QPC so often. Maybe the developers added some kind of performance statistics collection and measure literally every line of code (I don’t think so). Maybe they constantly pull QPC to implement precise sleeps (loop until QPC strike particular value) or to do some other alignment of some game events. Anyway, it’s was not a good decision. Per Microsoft,

While RDTSC is much faster than QueryPerformanceCounter, since the latter is an API call, it is an API that can be called several hundred times per frame without any noticeable impact. (Nevertheless, developers should attempt to have their games call QueryPerformanceCounter as little as possible to avoid any performance penalty.)

Several hundred times per frame, not several thousand! And I guess this advice should be much more strict for such game as Quake where performance of the game affect performance of a player very much. By the way, QueryPerformanceFrequency is also called insanely often. It’s not that harmful because it happens to be fast. But according to Microsoft it should be actually called only once:

The frequency of the performance counter is fixed at system boot and is consistent across all processors so you only need to query the frequency from QueryPerformanceFrequency as the application initializes, and then cache the result.

Fixing

I don’t have QC source code :), so I can’t make it call QPC less often. But I could try to make QPC fast and see what happen. My timer will count backwards from time to time and break all the physics in the game? Ok, no problem, I’m not going to play it, I just want to check how it could perform with a fast QPC. So I replaced original QPC code with a single assembler instruction (little understatement. there some more instructions to move the result etc. :)) — RDTSC. My benchmark showed 10ns per call now (and sometimes negative duration per call :D). Now we are talking! Then I run Quake. Result was astonishing. Just compare

Vanilla (original, not patched) game:

Game with RDTSC instead of QPC:

Twice as much FPS in menu… That was the moment I understood it’s time to open Discord and write to community managers :D I started a match then. 180 FPS! I have never seen such numbers since the patch. Physics and animations were completely broken however — I couldn’t move a step without teleporting. Not surprisingly with the time jumping backwards. I just started doing some screenshots and the game crashed. The next half of hour it was crashing almost immediately after loading so I couldn’t make screenshots or record video. I probably just was very lucky the first time to run it for a whole minute. So started to think how to implement fast non-decreasing thread-safe timer. But shortly after I discovered that Quake calls QPC only from one thread! It made things trivial, because the only problem with the timer now was Windows moving the thread from one core to another. I added a call to SetThreadAffinityMask(GetCurrentThread(), 1) to attach the thread firmly to one (the first) core and run the game. This time everything went perfect. Animations and movement were smooths, no crashes. And 180+ FPS. Wow.

Before:

After:

I played two matches this way and it was awesome. Almost no drops below 140 FPS in 8-player match. No problems with stability. Today I played about 5 matches more and everything was fine. Here is one game recorded.

Should developers incorporate my fix into the game?

I don’t think so, it’s a dirty fix. Here is an article from Microsoft describing why you should prefer QPC. What should be really done is reducing number of calls to QPC at least to several dozens per frame. That is more than enough to run physics. If developers need to profile big number of small functions in the code, RDTSC may be used, because outliers (wrong samples) can be easily ignored, and threads shouldn’t move from core to core while measuring if you want to have meaningful results anyway.

To cut story short, I would do the following:

1. Reduce number of calls to QPC as much as possible 2. Reduce number of calls to QPF to 1 3. Create a custom function for code profiling if needed. It may use QPC if it's fast on the particular system (the game could benchmark it at the start) or if Invariant TSC isn’t available (the game should check CPUID). If QPC is slow and Invariant TSC is available, use RDTSC or switch off profiling completely.

Will I share the fix to everyone who affected by the problem?

I don’t currently plan to do so because I’m not sure it will work good on other systems and I don’t have time to provide support for everybody if it doesn’t work. I believe the developers will find a better way in the nearest future. Also I have to improve some things if I want to distribute the fix. For example, I just measured my TSC frequency once and hardcoded it. Portable fix must measure the actual frequency every time. If I find time and motivation to do it, I will maybe share a binary with a disclaimer “Run at your own risk. Don’t ask me if it doesn’t work”.

T9UnSeen aka AHCIH