It’s an understatement to say Ubisoft have been front and center when it comes to controversy of late, whether it be frame rate, resolution, graphical downgrades or perhaps the most dreaded term of all, visual parity. Recently, a viewer and reader contacted me on Facebook and pointed out a rather interesting technical document I’d missed from GDC Europe 2014, where Alexis Vaisse, the Lead Programmer over at Ubisoft Montpellier goes into details of the GPU and CPU performance of both machines.

During this conference Alexis pushed home the importance of reduction of workload on the new systems CPU’s. Not only do Ubisoft show the relative GPU performance of the PS4 GPU vs the Xbox one’s (which we cover at the end), but also highlight the rather obvious performance differential between the AMD Jaguar’s which power both the PS4 and X1 and the previous generation consoles (the PS3’s SPU’s in particular were noted). So with the preface out of the way, let’s get on with the show.

In the above image, we can see the results of running cloth simulation on the the system’s CPU. The previous generation – Microsoft’s Xbox 360, and Sony’s Playstation 3 had very robust CPU’s. Naughty Dog (along with many other developers) have often highly praised the PS3’s SPU’s (Synergistic Processing Unit) as they were in effect vector processors Read more on the PS3’s cell processor here. They were in some ways a precursor to the Playstation 4’s GPGPU structure. So, the Jaguar’s lag behind, running the cloth simulation is too expensive.

You’ll notice the Playstation 4’s CPU performs slightly inferior if compared to the Xbox One’s; clearly this difference can be attributed to the raw clock speed difference between the two machines. Microsoft, as you might recall, bumped up the clock speed of their CPU (using the AMD Jaguar, in the same configuration as Sony, that is four cores per module, two modules total) from 1.6GHZ to 1.75GHZ. Sony didn’t do this, and despite tons of speculation as to the actual clock speed of the PS4’s CPU clock frequency, it was eventually confirmed by Sony to have been left at 1.6GHZ, thus providing Microsoft a slight advantage.

The above image might appear momentarily a little confusing, but in reality, it’s fairly simple. The graph is demonstrating the relative performance of the consoles own GPU versus the consoles own CPU. So, in the case of the Xbox One graph, it’s demonstrating that there’s a 15x performance difference between the systems X86-64-bit Jaguar CPU and GCN (Graphic Core Next) based GPU. Both next generation systems are using only six of their CPU cores for gaming (the other two strictly reserved for OS functionality). You can easily do this math yourself.

Take the AMD Jaguar’s clock speed, (we’ll use the Xbox One as an example) which is 1750MHZ. Take 1750 and multiply by 6 (six cores) and multiply that result with 8 (eight FLOPS per clock). You should get the figure of 84,000 GFLOPS. If you take this clock and use it to divide the GPU’s TFLOP performance, you’ll get roughly these numbers. Xbox One GPU is 1.31TFLOPS , so we take 1.31 TFLOPS, divide by 84,000 and we’re left with 15.5 (which a bunch of remainder),

For those who’re following along, you realize that this is really the crux of the matter. The CPU’s in both machines are relatively weak compared to their GPU’s, and thus offloading work load to the GPU’s remains the only logical choice. Ubisoft (like many other developers) are starting off loading a lot of work to the GPU. It’s high shader count allows for SIMD (Single Intruction Multi Data) nirvana. Quite simply put, multiple of the processors inside the systems GPU can be assigned to the task, and they’ll work at this task until it’s complete.

Unfortunately, it still requires careful management of the dispatch, or you’ll still cause excessive CPU workload. Think of the dispatch as the CPU telling the GPU “Hey, I need you to do this, like this, and give me these results and put them here”. In essence, if the CPU isn’t doing this efficiently, because of the massive discrepancy between the CPU and GPU performance, the CPU can still become bogged down with telling the GPU what to do.

Their new approach results a lot of initial trouble. A single compute shader was used, rather than the CPU needing to dispatch to multiple different shaders. This cut down on both CPU use, and also memory problems. Up to 32 cloth items could now be handled using a single dispatch, which is clearly much more efficient than throwing a fifty dispatches at a problem. CPU time is still very valuable, and in turn so is GPU.

If you’ve been following the next generation since the initial launch rumors, you might recall much being made of the memory bandwidth situation in both consoles. Despite Sony clearly having a bandwidth advantage. Despite both consoles using the same width of memory bus (256-bit), Sony opted to use GDDR5 running at 5500MHZ, instead of Microsoft’s slower DDR3 running at 2133MHZ. This leads to the Xbox one having about 68 GB/s of bandwidth (using purely DDR3) and the PS4 with 176GB/s.

This isn’t so bad if you consider the fact Microsoft’s console has 50 percent less shaders (but they do run at a slightly higher clock speed), which does slightly help lessen the bandwidth requirement. Aside from this, the Xbox One uses the rather infamous eSRAM to make up for the deficit. But, despite all of this, the Playstation 4’s memory bandwidth isn’t endless. We’ve found out of course that the PS4’s memory latency (which was suspected to be an issue because of a myth of GDDR5 RAM, won’t be an issue… see here for more info).

It’s more like the PS4’s memory bandwidth is roughly what you’d expect given the relative performance of its GPU. Robert Hallock from AMD told us in an exclusive interview

“GPU design is analogous to the game engine design question you previously posed. Everything does have to be balanced. You could throw fistfuls of render backends at a GPU, but if your memory bandwidth is insufficient, then that hardware is wasted. And vice versa, of course.

I think it can best be explained by working backwards, asking yourself: “What performance and resolution target do I want to hit?” Then you build a core out on paper that, by your mathematical models, would yield performance roughly equivalent to your target. Then you build it!

For 512 shader Unitss, then I would say: 32 texture units, 16 ROPs and a 128-bit bus.” You can find out much more in our interview with AMD here.

Compressing the data as it’s moved about the system is extremely important, both in terms of the main system memory (so, in the case of the Playstation 4, that’d be the GDDR5 RAM), along with the various caches and data shares. In the below image, you’ll spot Ubisoft preaching on the usage of the LDS (Local Data Store). With the Vector Units producing their own independent streams, it was important to include a high bandwidth scheduler. The Scheduler works alongside the unified cache and the 64KB of local data share to facilitate the information flow between data lanes. Just to be clear, there’s a LDS for each compute unit; so each CU has its own LDS.

In the below image, it’s shown just how important using the LDS is. For reference, though it’s fairly well known at this point, the Playstation 4 contains 1152 shaders, while the Xbox One contains a total of 768. There’s 64 shaders per Compute Unit in the GCN architecture.

With the Playstation 4’s compute, much of the work is down to the developer. This means the developer has to control the synching of data. This has been mentioned a few times previously by other developers, and is crucial to ensuring good performance. Quite simply speaking, not only do you need to make sure the correct data is being worked on by either the CPU or GPU, but that data is being processed in the correct order. As we have said in our Sucker Punch Second Son Post Mortem “Sucker Punch point out that it’s sometimes better to avoid using the cache (the Garlic bus of the PS4 actually bypasses the caches of the GPU). Sometimes the only way around this is to issue a “sync” command to the GPU’s buffer – but it’s not something you wish to constantly issue.”

Careful mapping and usage of memory, caches and management of data are crucial. Even having one of the CPU cores from module 0 accessing data in the cache of module 1 can result in a large hit to performance, as discussed in our SINFO analysis (based on a lecture from Naughty Dog).

For those familiar with the GCN architecture, this above graphic won’t come as much of a surprise. From the way the architecture has been designed, from the LDS, shaders and everything in between, running 64+ pieces of data (typically, multiples of 64 can work best, but clearly that’s rarely possible) is the way to go. Ubisoft’s Alexis Vaisse perhaps said it best during the document, with the phrase “Port an algorithm to the GPU only if you find a way to handle 64+ data at a time 95+% of the time”.

Sucker Punch has said themselves (in regards to running particles on GPGPU) you could use a “fancy API to prioritize, but we don’t”. Unfortunately there’s still a lot of CPU time eaten up by particles and compute, as it’s down to the CPU to setup and dispatch the compute tasks.

Ubisoft actually suggest that you write ten versions of your feature. It’s very hard to know what version is going to work best, due to the way the compiler works on the systems. This is likely due to the more Out of Order nature of the Jaguar processors, as opposed to the In-Order execution of the CPU’s sitting at the heart of both the Playstation 3 and Xbox 360. Therefore, much of the time you’re left with little choice but to perform a rather manual testing phase. Write ten or so versions of the code, testing each of them in turn and see which one (if any) works the best. If you find one that’s acceptable (in terms of performance and compute time) then you have a winner.

I’m sure the above is what you’re all waiting for – and it makes a lot of sense. But then, at the same time, the performance difference between the Xbox One’s GPU and Playstation 4’s GPU is actually slightly higher than what you might think. What Ubisoft do say is that “PS4 – 2 ms of GPU time – 640 dancers” – but no form of metric for the Xbox One, which is a bit of a shame. It’s clear however that for the benchmarking Ubisoft have used here, the PS4’s GPU is virtually double the speed.

With a little bit of guess work, it’s likely down to a few reasons. The first being the pure shader power. We’re left with 1.84 TFLOPS vs 1.32 TFLOPS. The second, the memory bandwidth equation, and the third – the more robust compute structure of the Playstation 4. The additional ACE buried inside the PS4’s GPU do help out a lot with task scheduling, and generally speed up the performance of the PS4’s compute / graphics commands which are issued to the shaders. Mostly the reason behind the improvement in ACE is Sony (so the story goes) knew the future of this generation of consoles was compute, and requested some changes to the GPU, thus there are many similarities to the PS4’s GPU and Volcanic Islands from AMD.

Another possibility for improvement is the PS4’s robust bus structure, or finally the so called ‘Volatile Bit’. Either way, it’s clearly a performance win in the GPU side for Sony. So parity in some ways could either be a PR effort, or in some cases, perhaps the relatively weak CPU in both machines is holding everything back.

It’s unfortunate Ubisoft were rather quiet on the Xbox One front – we do have a little information, for example Microsoft and AMD’s Developer Day Conference (Analysis here) but other than that, we’re left with precious little info. It remains to be seen how much DX12 will change the formula. But for right now, gamer’s are still going to want answers for the reasons of ‘parity’.