Application Profiling Tells the Story

We might have found another, and more relevant, data point to answer the 1080p gaming concerns for Ryzen.

It should come as no surprise to anyone that has been paying attention the last two months that the latest AMD Ryzen processors and architecture are getting a lot of attention. Ryzen 7 launched with a $499 part that bested the Intel $1000 CPU at heavily threaded applications and Ryzen 5 launched with great value as well, positioning a 6-core/12-thread CPU against quad-core parts from the competition. But part of the story that permeated through both the Ryzen 7 and the Ryzen 5 processor launches was the situation surrounding gaming performance, in particular 1080p gaming, and the surprising delta that we see in some games.

Our team has done quite a bit of research and testing on this topic. This included a detailed look at the first asserted reason for the performance gap, the Windows 10 scheduler. Our summary there was that the scheduler was working as expected and that minimal difference was seen when moving between different power modes. We also talked directly with AMD to find out its then current stance on the results, backing up our claims on the scheduler and presented a better outlook for gaming going forward. When AMD wanted to test a new custom Windows 10 power profile to help improve performance in some cases, we took part in that too. In late March we saw the first gaming performance update occur courtesy of Ashes of the Singularity: Escalation where an engine update to utilize more threads resulted in as much as 31% average frame increase.

As a part of that dissection of the Windows 10 scheduler story, we also discovered interesting data about the CCX construction and how the two modules on the 1800X communicated. The result was significantly longer thread to thread latencies than we had seen in any platform before and it was because of the fabric implementation that AMD integrated with the Zen architecture.

This has led me down another hole recently, wondering if we could further compartmentalize the gaming performance of the Ryzen processors using memory latency. As I showed in my Ryzen 5 review, memory frequency and throughput directly correlates to gaming performance improvements, in the order of 14% in some cases. But what about looking solely at memory latency alone?

At the outset of the Ryzen product rollout, AMD cautioned media that some traditional synthetic benchmark applications might need to be updated to properly show the performance of a brand-new, ground-up architecture like Zen. Included in that are applications like SiSoft Sandra and AIDA64 that look at memory bandwidth, latency, cache speeds, etc. But in truth, based on my conversations with several benchmark developers, memory latency testing is one of the more straight forward tests. (Even SiSoftware felt confident enough in its testing to write an editorial evaluating Ryzen performance in April.)

There are three tests in the Sandra suite for memory latency: full random, in-page and sequential. Sequential memory testing shows the lowest times of latency because it measures numerically sequential memory location access times. Those sequential accesses are easily prefetched by a modern CPU and thus are cached. Full random testing is exactly what it says as well – a fully random memory walk that will counter the TLB and pre-fetch systems, resulting in the worst-case scenario for memory latency.

In-page testing is more complex in that it attempts to balance between full random and sequential. Ryzen and Kaby Lake can map about 6MB of memory (1,536 4K pages) but as soon as the application wants to reference more, each access will miss the TLB window and force a page-walk, adding more accesses and TLB miss cost to the latency. With the in-page test, Sandra attempts to minimize page walks by randomly accessing data in a smaller-than-TLB window, then moving on to another full window. This assures the latency test it will perform a page walk page and not once per access.

This graph shows the results from SiSoft Sandra’s memory latency test as well as an Intel Memory Latency Checker I tossed in for good measure. Clearly, the actual memory latency of an AMD Ryzen processor is slower that of Intel Kaby Lake.

The Ryzen 7 1800X is slower in all three methods from Sandra, but is proportionally slower with the in-page result, coming in 3.6x slower than the Core i7-7700K. By comparison, under the full random scenario, the Ryzen 7 1800X is 56% slower. Even on the sequential test, the Ryzen part is 45% slower.By comparison, the Intel Memory Latency Checker puts the latency comparison somewhere in between SiSoft Sandra’s fully random and in-page result. The Ryzen 7 1800X reports roughly twice the latency (92% slower) of the 7700K.

These numbers are an improvement over the launch results that many media reviews were seeing with Ryzen 7. AMD worked closely with the motherboard vendors to find ways to optimize the BIOS and default settings to improve memory efficiency and roundtrip latency. This was something that was on-going from launch day of Ryzen 7, through the Ryzen 5 release and honestly continues today. AMD still wants to bump up default supported memory speeds (that will, by definition, improve memory latencies) and help spread knowledge that buying faster than DDR4-2400 memory is the best course of action for AMD Ryzen buyers.

It is also worth noting that I ran these tests with the Ryzen 7 platform at slightly tighter / faster timings, though both running at 2400 MHz. The same Corsair memory was running at a 1T command rate on the AMD system while we had it set to 2T on Intel. (This was simply a result of out-of-box settings with this memory on each platform and mirrors the settings we used in our initial Ryzen 7 and Ryzen 5 processor reviews in March.)

Using Intel vTune to Measure Application Sensitivity

With that data in hand, I wanted to profile different applications and games to determine how much of an impact we would expect memory bandwidth or latency to have on them. Intel’s vTune application is built for exactly this – it runs counters in the background and measures the impact of each instruction, and memory request on the system. Intel vTune is used by software developers to see how their applications perform and to optimize. Getting the best information out of this kind of tool requires very specialized knowledge of the architecture and because of that, vTune does not work on Ryzen processors. And since AMD still doesn’t have a publicly available toolset to optimize the Zen architecture or to provide the kind of data that vTune provides, we have to limit part of our exploration to Intel platforms.

In this example result page you can see essentially any kind of metric you would like to gauge, and if you dig down even deeper, you can analyze any application on a per-function, per-instruction basis. While I don’t have time (or the background) to detail everything, there are interesting results to see. Anything that is in red font is assumed to have a negative effect on application performance, though to what degree depends on any number of factors. Take the CPI rate as an example (clocks per instruction), an average result across the 120s of captured system profiling. A result of 1.323 is considered poor. This is from one of our tested games; a result from Handbrake for example is 0.755, running at a higher than one instruction executed per clock rate. Looking under the processor back end and the memory bound rate, a result of 45% indicates that in 45% of processor clock cycles, the memory system was the limiting factor of performance. As we narrow it down further, you can see that memory latency shows a 29% instruction sensitivity, meaning 29% of the active clock cycles were waiting on a memory request to return. While we do not expect this to ever be zero, we found that game workloads tend to show much higher dependency on memory latency than most other benchmarked application workloads.

The CPU usage histogram is also interesting to look at in benchmarks and games. In this title, the average CPU utilization is just under 4 threads, indicated as “poor” with the default Intel vTune profile. You would like to see this weighted more towards the right of the graph, indicating that all 8 threads of the Core i7-7700K are being utilized for that particular workload.

While slightly out of order, this seems like a good point to mention an important characteristic of memory latency. Depending on simultaneous multi-threading utilization and coding practices, you can “hide” memory latency by distributing work in such a way that the system is rarely dependent on outstanding memory requests at any given point. Lower core/thread utilization is not always indicative of higher resulting sensitivity to memory latency, but it is more likely that a program that is optimized to use threads more efficiently on any given architecture will be less susceptible to any memory latency deficiencies of a given processor design. For an application to prevent dependency on memory latency, it would have to focus on data layout optimizations, access patterns and even software prefetching. These are non-trivial design goals and would likely require a lot of development effort on the complex data structures found in games.

What does a range of applications and games, typical of those used in reviews, show when looking specifically at the memory latency sensitivity of the workload through Intel vTune?

Games, with the lone exception of the Civilization 6 graphics test, show a fairly high memory latency dependency, ranging from 21.7% to 29.3%. In comparison, general applications, with the exception of WinRAR, show very low latency dependency, in the mid- to low-teens. WinRAR is an interesting example as the specific workload uses a dictionary file that is quite large, often and repeatedly exceed any TLB table sizes of the Kaby Lake processors.