Yesterday after an all-day session of benchmarking on Wednesday, we published our initial performance results for Civilization: Beyond Earth. As can often be the case with limited testing, we ran into a problem and were unable to find a solution at the time. In short, while there was a lot of talk about how developers Firaxis had spent some effort to improve latency using a custom Split-Frame Rendering (SFR) approach with Mantle on CrossFire configurations, we were unable to produce anything that corroborated that story. Emails were sent, but it took half a day before we finally had the answer: enabling SFR actually requires manual editing of the configuration file. Oops.

We could ask why manual editing of the INI file is even necessary, and there are other user interface items that would be nice to address as well as I noted in the conclusion of the original Benchmarked article. But that's all water under the bridge at this point, so let me issue a public apology for not having the complete information yesterday.

I've updated the text of the original article (and added a discussion of minimum frame rates in case you missed that), but since many people have potentially read the article already and are unlikely to revisit the subject, I wanted to post a separate Pipeline to update everyone on the true performance of CrossFire with Mantle and SFR. But before we get to that, let me also take this opportunity to provide some of the additional information from Firaxis and AMD on why SFR matters. Firaxis has a couple blog posts on the subject (including one highlighting the benefits of Mantle with multiple GPUs), and here's the direct quote from AMD's marketing folks:

With a traditional graphics API, multi-GPU (MGPU) arrays like AMD CrossFire are typically utilized with a rendering method called "alternate-frame rendering" (AFR). AFR renders odd frames on the first GPU, and even frames on the second GPU. Parallelizing a game’s workload across two GPUs working in tandem has obvious performance benefits.

As AFR requires frames to be rendered in advance, this approach can occasionally suffer from some issues:

Large queue depths can reduce the responsiveness of the user’s mouse input

The game’s design might not accommodate a queue sufficient for good MGPU scaling

Predicted frames in the queue may not be useful to the current state of the user’s movement or camera

Thankfully, AFR is not the only approach to multi-GPU. Mantle empowers game developers with full control of a multi-GPU array and the ability to create or implement unique MGPU solutions that fit the needs of the game engine. In Civilization: Beyond Earth, Firaxis designed a "split-frame rendering" (SFR) subsystem. SFR divides each frame of a scene into proportional sections, and assigns a rendering slice to each GPU in AMD CrossFire configuration. The "master" GPU quickly receives the work of each GPU and composites the final scene for the user to see on his or her monitor.

If you don’t see 70-100% GPU scaling, that is working as intended, according to Firaxis. Civilization: Beyond Earth’s GPU-oriented workloads are not as demanding as other recent PC titles. However, Beyond Earth’s design generates a considerable amount of work in the producer thread. The producer thread tracks API calls from the game and lines them up, through the CPU, for the GPU’s consumer thread to do graphics work. This producer thread vs. consumer thread workload balance is what establishes Civilization as a CPU-sensitive title (vs. a GPU-sensitive one).

Because the game emphasizes CPU performance, the rendering workloads may not fully utilize the capacity of a high-end GPU. In essence, there is no work leftover for the second GPU. However, in cases where the GPU workload is high and a frame might take a while to render (affecting user input latency), the decision to use SFR cuts input latency in half, because there is no long AFR queue to work through. The queue is essentially one frame, each GPU handling a half. This will keep the game smooth and responsive, emphasizing playability, vs. raw frame rates.

Let me provide an example. Let’s say a frame takes 60 milliseconds to render, and you have an AFR queue depth of two frames. That means the user will experience 120ms of lag between the time they move the map and that movement is reflected on-screen. Firaxis’ decision to use SFR halves the queue down to one frame, reducing the input latency to 60ms. And because each GPU is working on half the frame, the queue is reduced by half again to just 30ms.

In this way the game will feel very smooth and responsive, because raw frame-rate scaling was not the goal of this title. Smooth, playable performance was the goal. This is one of the unique approaches to MGPU that AMD has been extolling in the era of Mantle and other similar APIs.

When I first read the above, my initial reaction was: "This is awesome!" I've always been a bit leery of AFR and the increase in input latency that it can create, so using SFR to avoid the issue is an excellent idea. Unfortunately, it requires more work and testing to get it working right, so most games simply stick with AFR. Ironically, while reducing input latency is never a bad thing, it honestly doesn't matter nearly as much in a turn-based strategy game like Civilization: Beyond Earth. What we'd really love to see is use of techniques like SFR to reduce input latency on games from genres where input latency is a bigger deal – first-person games like Crysis, Battlefield, Far Cry, etc. and third-person games like Batman, Shadow of Mordor, Assassin's Creed, etc. being prime examples. With that said, let's revisit the subject of Civilization: Beyond Earth and CrossFire performance, with and without Mantle:

Our graphing engine doesn't allow for sorting on multiple criteria, otherwise I might try sorting by average + minimum frame rate. Regardless, you can see that across the range of options the CrossFire Mantle SFR support is now doing what we'd expect and improving frame rates. But it's not just about improving frame rates; as the above commentary notes, improving input latency is also important. We aren't really equipped to test for input latency (that would require a very high speed camera as well as additional time filming and measuring input latency), but the minimum frame rates definitely improve as well.

What's interesting is that CrossFire without Mantle (which uses AFR) has higher average FPS in many cases, but the minimum frame rates are worse than with a single GPU. The two images above show why this isn't necessarily a good thing. We haven't tested SLI performance, but I have at least one source that says SLI performance is similar to CrossFire AFR: higher average FPS but lower minimum FPS. It's entirely possible that driver updates will improve the situation with D3D, but for now CrossFire with Mantle SFR definitely scores a win over Direct3D AFR as it provides for a smoother gaming experience.

Let's look at the above charts in a different format before we continue this discussion.

We can see that even with just two GPUs splitting the workload, our CPU has apparently become a bottleneck with the R9 290X. Average frame rates still show an increase going from 4K Ultra to QHD Ultra to 1080p Ultra to 1080p High, but when we look at minimum FPS we've apparently run straight into a wall. For the R9 290X with Mantle, CrossFire effectively tops out with a minimum FPS of roughly 65FPS while a single GPU hits a lower minimum of around 50FPS without Mantle, and regular CrossFire on the 290X (i.e. without Mantle) has a minimum of 45FPS. Again, there are likely some optimizations that could be made in both drivers and the game to improve the situation, but it wouldn't be too surprising to find that Mantle and SFR with three or four GPUs doesn't show much of an increase over two GPUs.

I do have to wonder how applicable the above results are to other games. Last I checked, Mantle CrossFire rendering on Sniper Elite 3 was basically not working, but if other software developers can use Mantle to effectively implement SFR instead of AFR that would be nice to see. But didn't we have SFR way back in the early days of multiple GPUs? Of course we did! 3dfx initially called their solution SLI – Scan Line Interleave – and had each GPU rendering every other line. That approach had problems with things like anti-aliasing, but there are many other ways to divide the workload between GPUs, and both AMD (formerly ATI) and NVIDIA have done variations on SFR in the past.

The problem is that when DirectX 9 rolled around and we started getting programmable shaders and deferred rendering, at some point synchronization issues cropped up and basically developers were locked out of doing creative things like SFR (or geometry processing on one GPU and rendering on another). The only thing you can do with multiple GPUs using Direct3D right now is AFR. That may change with Direct3D 12, but we're still a ways out from that release. Basically, AFR is the easiest approach to implement, but it has various drawbacks even when it does work properly.

Of course there are other potential pitfalls with doing alternative workload splitting like SFR. They can require more work from the CPU, and as you add GPUs the CPU already creates a potential bottleneck. AMD informed us that the engine in Civilization: Beyond Earth is actually extremely scalable with CPU cores, so while we're testing with an overclocked i7-4770K, AMD said they even saw a 20% improvement in performance (with Mantle) going from hex-core Ivy Bridge-E to octal-core Haswell-E with R9 290X CrossFire. There are apparently other cases where certain hardware configurations and game settings can result in an even greater improvement in performance thanks to Mantle (e.g. the 50% increase in minimum frame rates on the R9 290X at our 1080p High settings).

The bottom line is that if you have an AMD GPU, games like Civilization: Beyond Earth can certainly benefit. Maybe Direct3D 12 will bring similar options to developers next year, but in the meantime, congrats to both AMD and Firaxis for shining the light on the latency subject once again. NVIDIA made some waves with similar discussions when they released FCAT last year, but the topic of latency and jitters is definitely important – and don't even get me started on silliness like capping frame rates at 30FPS by default (cough, The Evil Within, cough).