Our first DX12 Performance Results

We got a couple of days with the new Futuremark 3DMark API Overhead Feature Test. How do NVIDIA and AMD fare?

Late last week, Microsoft approached me to see if I would be interested in working with them and with Futuremark on the release of the new 3DMark API Overhead Feature Test. Of course I jumped at the chance, with DirectX 12 being one of the hottest discussion topics among gamers, PC enthusiasts and developers in recent history. Microsoft set us up with the latest iteration of 3DMark and the latest DX12-ready drivers from AMD, NVIDIA and Intel. From there, off we went.

First we need to discuss exactly what the 3DMark API Overhead Feature Test is (and also what it is not). The feature test will be a part of the next revision of 3DMark, which will likely ship in time with the full Windows 10 release. Futuremark claims that it is the "world's first independent" test that allows you to compare the performance of three different APIs: DX12, DX11 and even Mantle.

It was almost one year ago that Microsoft officially unveiled the plans for DirectX 12: a move to a more efficient API that can better utilize the CPU and platform capabilities of future, and most importantly current, systems. Josh wrote up a solid editorial on what we believe DX12 means for the future of gaming, and in particular for PC gaming, that you should check out if you want more background on the direction DX12 has set.

One of DX12 keys for becoming more efficient is the ability for developers to get closer to the metal, which is a phrase to indicate that game and engine coders can access more power of the system (CPU and GPU) without having to have its hand held by the API itself. The most direct benefit of this, as we saw with AMD's Mantle implementation over the past couple of years, is improved quantity of draw calls that a given hardware system can utilize in a game engine.

Draw calls are, in a concise way of putting it, a request from the CPU (and the game engine running on it) to draw and render an object. There are typically thousands of draw calls being placed every frame in a modern game but each of those requests adds a level of overhead to the system, limiting performance in some extreme cases. As that draw call count rises, game engines can become limited by that API overhead. New APIs like Mantle and DX12 reduce that overhead by giving the developers more control. The effect is one clearly shown by Stardock and the Oxide Engine – a game without draw call overhead limits can immediately, and drastically, change how a game functions and how a developer can create new and exciting experiences.

Click to Enlarge

This new feature test from Futuremark, which will be integrated into an upcoming 3DMark release, measures API performance by looking at the balance between frame rates and draw calls. The goal: find out how many draw calls a PC can handle with each API before the frame rate drops below 30 FPS.

At a high level, here is how the test works: starting with a small number of draw calls per frame, the test increases the number of calls in steps every 20 frames until the frame rate drops below 30 FPS. Once that occurs, it keeps that draw call count and measures frame rates for 3 seconds. It then computes the draw calls per second (frame rate multiplied by draw calls per frame) and the result is displayed for the user.

Click to Enlarge

In order to ensure that the API is the bottleneck in this test, the scene is built procedurally with unique geometries that have an indexed mesh of 112-127 triangles. There is no post-processing and the shaders are very simple to make sure the GPU is not a primary bottleneck.

There are three primary tests the application runs through for all hardware, and a fourth if you have Mantle-capable AMD hardware. First, a DirectX 11 pass is done in a single-threaded method where all draw calls are made from a single thread. Another DX11 pass is made in multi-threaded method where all draw calls are divided evenly between a number of threads equal to one less than the number of addressable cores. That balance leaves one dedicated core for the display driver.

The DX12 and Mantle paths in the feature test are, of course, multi-threaded and utilize all cores available. They divide the draw calls even between the total thread count.

First 3DMark API Overhead Feature Test Results

Our test system was built around the following hardware:

Intel Core i7-5960X

ASUS X99-Deluxe

16GB Corsair DDR4-2400

ADATA SP910 120GB SSD

The GPUs we used for this short feature test are the reference NVIDIA GeForce GTX 980, an ASUS R9 290X DirectCU II, the MSI GeForce GTX 960 100ME and a Sapphire R9 285 Tri-X. Driver revision for NVIDIA hardware was 349.90 and for AMD we used 15.200.1012.2.

For our GTX 980 and R9 290X results, you'll see a number of scores. The Haswell-E processor was run in its stock state (8 cores, HyperThreading on) to get baseline numbers but we also started disabling cores on the CPU in order to get some idea of the drop off as we reduce the amount of processor horsepower available to DirectX 12. As you'll no doubt see, six cores appears like it will be plenty to maximize draw call capability.

Let's digest our results.

Click to Enlarge

First on the bench is the GeForce GTX 980 and the results are immediately impressive. Even using the best-case for DirectX 11 multi-threading, our system can only handle 2.62 million draw calls per second, just over 2x the score from the single-threaded DX11 result. However, DX12 sees a substantial increase in efficiency, reaching as high as 15.67M draw calls per second, which is an increase of nearly 6x! While you should definitely not expect to see 6x improvements in gaming performance when DX12 titles begin to ship late this year, the additional CPU headroom that the new API offers means that developers can be beginning planning next-generation game engines accordingly.

For our core count reduction, we see that 8 cores with HyperThreading, 8C with no HT and 6C without HT all result in basically the same maximum draw call throughput. Once we drop to 4C, we decrease the peak draw call rate by nearly 24%. A move to a dual-core system falls to 7.22M draw calls per second, resulting in another 74% drop. Finally, at 1-core, the draw calls hit only 4.23M per second. We will still need to test other CPU platforms to see how they handle both CPU core and CPU clock speed scaling but it appears that even high end quad-core rigs will have more than enough performance headroom to stretch DX12's legs.

Click to Enlarge

Our results with the Radeon R9 290X in the same platform look similar. We see a peak draw call rate of 19.12M per second on DX12 but an even better result under Mantle, hitting 20.88M draw calls per second. That shouldn't surprise us: Mantle was written specifically for the AMD GPU architecture and drivers while DX12 has to be more agnostic to function on AMD, Intel and NVIDIA GPU hardware. Clearly the current implementation of drivers from AMD is doing quite well, besting the maximum draw count rate of the GTX 980 by 4M per second or so. That said, comparisons across GPU platforms at this point is less relevant than you might think. More on that later.

DX12 draw call performance remains basically the same across 8C with HT on, 8C and 6C testing, but it drops by about 33% with the move to a quad-core configuration. On Mantle, we do see a small but measurable 11% drop going from 8-cores to 6-cores but with it is also the only result that scales UP when given the full 8-cores on the Core i7-5960X.

Interestingly, AMD shows little to no scaling between the DX11 single threaded and DX11 multi-threaded scores with the API Overhead Feature Test, which gives credence to the idea that AMD's current driver stack is not as optimized for DX11 gaming as it should be. The DX12 results are definitely forward looking and things could shift in that area, but the DX11 results are very important to gamers and enthusiasts today – so these are results worth considering.

I also did some testing with a couple of more mainstream GPUs: the GTX 960 and the R9 285. The results here are more than a bit surprising:

Click to Enlarge

The green bar is the stock performance of our platform with the GTX 980, the blue bar is the stock GTX 960, but the yellow bar in the middle shows the results with a reasonably overclocked GTX 960 card. (We hit 1590 MHz peak GPU clock and a 2000 MHz memory clock.) At stock settings, the GTX 960 shows a 60% drop from the GTX 980 when it comes to peak draw calls; that's not totally unexpected. However, with a modest overclock on the mainstream card, we were able to record a DX12 draw call rate of 15.36M, only 2% slower than the GTX 980!

Now, clearly we do not and will never expect the in-game performance of the GTX 980 and GTX 960 to be within a margin of 2%, even with the latter heavily overclocked. No game available today shows that kind of difference – in fact we would expect the GTX 960 to be about 60-70% slower than the GTX 980 in average frame rates. Exactly why we see this scale so high with the overclocked GPU is still an unknown – we have asked Microsoft and Futuremark for some insight. What it does prove is that the API Overhead Feature Test should not be used to compare the performance of a GeForce and Radeon GPUs to any degree; if the differences in performance inside NVIDIA's own GPU stack can't match up with real-world performance, then it is very unlikely that competing architectures will fare better.

Click to Enlarge

Of course we ran the Radeon R9 285 through the same kind of comparison – stock and then overclocked. In this case we did not see the drastic increase in draw call rate with the overclocked R9 285 but we do see the R9 290X and R9 285 resulting a score within 5% of one another. Again, these two GPUs definitely have different real-world performance metrics that are further apart than 5%, proving the above point once again.

And how could we let a test like this pass us by without testing out an AMD APU?

Click to Enlarge

The DX11 MT results refused to complete in our testing, but we are working with pre-release drivers, pre-release operating systems and an unfinished API, so just this one hiccup is actually a positive outcome. Moving from DX11 single threaded results to what you get with both DX12 and Mantle, the A10-7850K APU benefits from a 7.8x increase in draw call handling capability. That should improve game performance for properly written DX12 applications tremendously, and do so on a platform that desperately needs it.

Initial Thoughts

Though minimal in quantity compared to the grand scheme of things we want to test with, the results we are showing here today paint a very positive picture about the future of DirectX 12. Since the announcement of Mantle from AMD and its subsequent release in a couple of key titles, the move to an API with less overhead and higher efficiency has been clamored for by enthusiasts, developers and even hardware vendors. Microsoft stepped up the plate, willing to sacrifice so much of what made DirectX a success the past to pave a new trail with DirectX 12.

Futuremark's new 3DMark API Overhead Feature Test proves that something as fundamental as draw calls can be drastically improved upon with forward thinking and a large dose of effort. We saw improvements in API efficiency as high as 18-19x with the Radeon R9 290X when comparing DX12 and DX11 results and while we definitely won't see that same kind of outright gaming performance with the new API, it gives developers a completely new outlook on engine development and integration. Processor bottlenecks that users didn't even know existed can now be pushed aside to stretch the bounds of what games can accomplish. It might not turn the world on it's head day one, but I truly think that APIs like DX12 and Vulkan (what Mantle has become for Khronos) will alter gaming more than anyone previous thought.

Click to Enlarge

As for the AMD and NVIDIA debate, both Futuremark and Microsoft continue to push upon us that this feature test is not a reasonable test of GPU performance. Based on our overclocked results with the GTX 960 in particular, that is definitely the case. I'm sure you will soon see stories claiming that one party is ahead of the other in terms of DX12 driver development, or that one GPU brand is going to be faster in DX12 than the other, but that is simply not a conclusion you can derive from the data sets provided by this test. Just keep calm and wait – you'll see more DX12 gaming tests in the near future that will paint a better picture of what the gaming landscape will look like in 2016. For now, let's just wait for Windows 10 to roll so we can get final DX12 feature and comparison information, and to allow Intel, NVIDIA and AMD a little time to tweak drivers.

It's going to be a great year for PC gamers. There is simply no doubt.