Through the looking glass

The new 3DMark Time Spy lets us compare asynchronous compute performance across GPUs. What interesting stuff did we find?

Futuremark has been the most consistent and most utilized benchmark company for PCs for quite a long time. While other companies have faltered and faded, Futuremark continues to push forward with new benchmarks and capabilities in an attempt to maintain a modern way to compare performance across platforms with standardized tests.

Back in March of 2015, 3DMark added support for an API Overhead test to help gamers and editors understand the performance advantages of Mantle and DirectX 12 compared to existing APIs. Though the results were purely “peak theoretical” numbers, the data helped showcase to consumers and developers what low levels APIs brought to the table.

Today Futuremark is releasing a new benchmark that focuses on DX12 gaming. No longer just a feature test, Time Spy is a fully baked benchmark with its own rendering engine and scenarios for evaluating the performance of graphics cards and platforms. It requires Windows 10 and a DX12-capable graphics card, and includes two different graphics tests and a CPU test. Oh, and of course, there is a stunningly gorgeous demo mode to go along with it.

I’m not going to spend much time here dissecting the benchmark itself, but it does make sense to have an idea of what kind of technologies are built into the game engine and tests. The engine is based purely on DX12, and integrates technologies like asynchronous compute, explicit multi-adapter and multi-threaded workloads. These are highly topical ideas and will be the focus of my testing today.

Futuremark provides an interesting diagram to demonstrate the advantages DX12 has over DX11. Below you will find a listing of the average number of vertices, triangles, patches and shader calls in 3DMark Fire Strike compared with 3DMark Time Spy.

It’s not even close here – the new Time Spy engine has more than a factor of 10 more processing calls for some of these items. As Futuremark states, however, this kind of capability isn’t free.

With DirectX 12, developers can significantly improve the multi-thread scaling and hardware utilization of their titles. But it requires a considerable amount of graphics expertise and memory-level programming skill. The programming investment is significant and must be considered from the start of a project.

3DMark Time Spy is Beautiful

If you haven’t seen the 3DMark Time Spy demo yet, it’s worth checking out the embedded video I created below. It is running on a pair of GTX 1080 cards in SLI at 2560×1440, and is captured externally; not with any on-system tools.

I also put together a compilation of the benchmarks tests that run separately from the demo itself. They were run on the same GTX 1080 SLI setup, though I have enabled vertical sync on the system to reduce the on-screen tearing from the capture. (This does mean the frame rates you see on the screen are not indicative of any kind of Time Spy score.)

Performance Results – Testing Asynchronous Compute

One of the more interesting aspects for me with Time Spy was the ability to do a custom run of the benchmark with asynchronous compute disabled in the game engine. By using this toggle we should be able to get our first verified data on the impact of asynchronous compute on AMD and NVIDIA architectures.

Here is how Futuremark details the integration of asynchronous compute in Time Spy.

With DirectX 11, all rendering work is executed in one queue with the driver deciding the order of the tasks.

With DirectX 12, GPUs that support asynchronous compute can process work from multiple queues in parallel.

There are three types of queue: 3D, compute, and copy. A 3D queue executes rendering commands and can also handle other work types. A compute queue can handle compute and copy work. A copy queue only accepts copy operations.

The queues all race for the same resources so the overall benefit depends on the workload.

In Time Spy, asynchronous compute is used heavily to overlap rendering passes to maximize GPU utilization. The asynchronous compute workload per frame varies between 10-20%. To observe the benefit on your own hardware, you can optionally choose to disable async compute using the Custom run settings in 3DMark Advanced and Professional Editions.

I gathered data using our normal GPU test bed and the latest beta drivers from both NVIDIA and AMD.

I ran 8 different GPU configurations through Time Spy with and without asynchronous compute enabled to see what kind of performance differences we saw as a result.

Click for a Larger Version

Let’s start with our basic 3DMark Time Spy results. These show clearly that the GeForce GTX 1080 is the fastest single GPU card on the market, followed by the GeForce GTX 1070. The GTX 980 and 970 have decent showings, though they are definitely on their way out of the market. The AMD Fury X competes somewhere between the GTX 980 and the GTX 1070, falling 10% behind the current lowest priced Pascal part. The R9 Nano does very well against the GTX 980, beating it by 11%.

AMD’s Radeon RX 480 based on Polaris does well against the GTX 970 and nearly matches the performance of the GTX 980! This is a good sign for the company’s new $199-239 graphics card.

This next graph is more complex – it combines the results above with our scores with asynchronous compute disabled.

Click for a Larger Version

First, an explanation of the data: the blue bar is the graphics score in Time Spy with asynchronous compute enabled, the red bar is the graphics score with asynchronous compute disabled, and the green text show us how much scaling each GPU configuration sees going from async disabled to enabled. The higher the scaling shown on the green line, the more advantage that asynchronous compute offers for that graphics card and platform.

Let’s start with the positive results here, in particular with AMD hardware. Both Fiji and Polaris see sizeable gains with the inclusion of asynchronous compute. The Fury X is 12.89% faster with async enabled, the R9 Nano is 11.06% faster and the RX 480 is 8.51% faster. This backs up AMD’s claims that the fundamental architecture design of GCN was built for asynchronous compute, with dedicated hardware schedulers (two in the case of Polaris) included specifically for this purpose.

NVIDIA’s Pascal based graphics are able to take advantage of asynchronous compute, despite some AMD fans continuing to insist otherwise. The GeForce GTX 1080 sees a 6.84% jump in performance with asynchronous compute enabled versus having it turned off. The gap for the GTX 1070 is 5.42%. The scaling is less than we saw on both Fiji and Polaris from AMD, which again indicates that AMD has engineered GCN around asynchronous compute more than NVIDIA has with Pascal.

I did add in GTX 1080 SLI scores to this graph to show that asynchronous compute scaling drops dramatically, while also demonstrating the scaling capability of Time Spy. Adding in the second GTX 1080 results in a score that is 80% better than a single card; that’s a excellent scaling and much better than I expected for early DX12 results.

Now, let’s talk about the bad news: Maxwell. Performance on 3DMark Time Spy with the GTX 980 and GTX 970 are basically unchanged with asynchronous compute enabled or disabled, telling us that the technology isn’t being integrated. In my discussion with NVIDIA about this topic, I was told that async compute support isn’t enabled at the driver level for Maxwell hardware, and that it would require both the driver and the game engine to be coded for that capability specifically.

To me, and this is just a guess based on history and my talks with NVIDIA, I think there is some ability to run work asynchronously in Maxwell but it will likely never see the light of day. If NVIDIA were going to enable it, they would have done so for the first wave of DX12 titles that used it (Ashes of the Singularity, Hitman) or at the very least for 3DMark Time Spy – an application that the company knows will be adopted by nearly every reviewer immediately and will be used for years.

Why this is the case is something we may never know. Is the hardware support actually non-existent? Is it implemented in a way that coding for it is significantly more work for developers compared to GCN or even Pascal? Does NVIDIA actually have a forced obsolescence policy at work to push gamers toward new GTX 10-series cards? I think that final option is untrue – NVIDIA surely sees the negative reactions to a lack of asynchronous compute capability as a drain on the brand and would do just about anything to clean it up.

Regardless of why, the answer is pretty clear: NVIDIA’s Maxwell architecture does not currently have any ability to scale with asynchronous compute enabled DX12 games.

Closing Thoughts

Frequent PC Perspective readers will know that we do not put much weight on 3DMark scores when making our final recommendations for graphics cards, but the data the benchmark provides can be very useful for helping differentiate specific features as well as giving users a “spot check” to compare their own hardware to results in our reviews. Futuremark’s new 3DMark Time Spy is a great addition to this suite of tools and being one of the first dedicated DX12 tests will ensure that thousands of users will be looking forward to running the test on their own hardware this week. If nothing else, it provides a gorgeous new demonstration to show off your new GPU hardware to your family and friends!

You can pick up the free demo or the full version over on Steam or at Futuremark’s website.