If Pascal and GDDR5X had a baby

At the heart of that efficiency is the GP104 GPU, which is built on a small 314mm², 16nm die and an all-new architecture. Well, I say all-new, but at first glance Pascal shares much with its predecessor Maxwell. There are the same four Graphics Processing Clusters (GPC), each of which contains a collection of Streaming Multiprocessors (SM) bound to a total of 64 Raster OPerators (ROP) and 2MB of L2 cache. The key difference at the top level is that there are 10 SMs crammed into each GPC instead of eight this time around, each of which contains 128 CUDA cores, 48 KB of L1 cache, and eight texture units.

This gives the 1080 a grand total of 2,560 CUDA cores and 160 texture units, a substantial increase over the 2,048 CUDA cores and 128 texture units of the 980.

GP104 isn't the first Pascal part—that honour goes to the GP100 chip of the Tesla P100, a professional GPU bound for servers and high-performance computing. The P100 is a far bigger chip at 610mm², and as such crams in an extra two GPCs to make up its array of 3,584 CUDA cores. The key difference is that the P100 splits its cores down the middle, with half of them devoted to science-focused FP64 cores and half to the FP32 cores more useful for gaming.

Coupled with a dramatic boost in clock speed—1,690MHz base and 1,733MHz boost—this means the 1080 pushes almost the same level of FP32 performance as the P100—nine teraflops versus 10.2 teraflops—in a smaller, more efficient, and much cheaper chip. Maxwell topped out at just 1,000MHz at stock, and even the most talented of overclockers would struggle to get past 1,300MHz without the aid of some seriously exotic cooling.

In the early days of Pascal, most rumours pointed to Nvidia using second-generation High Bandwidth Memory (HBM2), particularly as AMD used the original version of HBM to great effect in its Fury range. While the P100 does use HBM2 for its incredible 720GB/s of bandwidth, pricing and availability concerns surrounding the technology have pushed Nvidia towards using Micron's GDDR5X memory for the 1080. While not quite as impressive as HBM or HBM2, the 8GB of GDDR5X in the 1080 is essentially a drop-in replacement for GDDR5, boasting a 10,000MHz memory clock (versus 7,000MHz on the 980) attached to a 256-bit bus for 320 GB/s of bandwidth.

That's another big leap over the 980 with its 224GB/s of bandwidth, but does come in slightly under the 336GB/s of the Titan X and GTX 980 Ti with their wider 384-bit buses, or the Fury X's 4,096-bit bus. Still, GDDR5X operates at the same 1.35V as GDDR5, giving Nvidia the option to slot in standard GDDR5 across the rest of its range (as it plans to do with the GTX 1070). Plus there's room to grow if Nvidia does implement GDDR5X on a wider 384-bit bus on future cards, which would result in an impressive 480GB/s of bandwidth.

Nvidia has improved its delta colour compression technology in Pascal too. Where Maxwell featured 2:1 compression—that is, where the GPU calculates the colour difference between a range of pixels and halves the data required if the delta between them is small enough—Pascal can do 4:1 or even 8:1 colour compression with small enough deltas. The result is that Pascal significantly reduces the number of bytes that have to be fetched from memory each frame for roughly 20 percent additional effective bandwidth. Clever stuff.

Specs at a glance: GTX 1080 GTX Titan X GTX 980 Ti GTX 980 GTX 970 GTX 780 Ti CUDA Cores 2560 3072 2816 2048 1664 2880 Texture Units 160 192 176 128 104 240 ROPs 64 96 96 64 56 48 Core Clock 1607MHz 1000MHz 1000MHz 1126MHz 1050MHz 875MHz Boost Clock 1733MHz 1050MHz 1050MHz 1216MHz 1178MHz 928MHz Memory Bus Width 256-bit 384-bit 384-bit 256-bit 256-bit 384-bit Memory Speed 10GHz 7GHz 7GHz 7GHz 7GHz 7GHz Memory Bandwidth 320GB/s 336GB/s 336GB/s 224GB/s 196GB/s 336GB/sec Memory Size 8GB GDDR5X 12GB GDDR5 6GB GDDR5 4GB GDDR5 4GB GDDR5 3GB GDDR5 TDP 180W 250W 250W 165W 145W 250W

There's also Simultaneous Multi-Projection, a new hardware rendering pipeline for Pascal cards (so no, it's not coming to Maxwell) that allows them to render 16 independent "viewpoints" in a single rendering pass. On a regular graphics card, a single viewpoint—i.e. what a user sees on a monitor—is rendered in one pass. That's fine for most applications, but problems occur with multi-monitor setups and VR. In a triple-monitor setup where a user curves the monitors, the graphics card can only render a single viewpoint. It assumes all the monitors are arranged in a straight line, resulting in the images on the left and right monitors looking warped.

Traditionally, this problem is solved by using three separate graphics cards in supported games, but with multi-projection, the single GPU can render three different viewpoints so two of them correct the distortion. Nvidia uses a similar technique to speed up VR rendering, allowing for a stereo image to be rendered in a single pass. This dramatically improves the frame rate—a particularly big problem to solve when VR needs to run at a hefty 90 FPS. Without a VR headset or multiple monitors to trial, I can't say for sure how well this works just yet. In live demos it was impressive, and developers apparently don't need to do a thing to see the performance benefits.

Asynchronous shaders, or lack thereof

While Nvidia has led GPU performance for some time—bar AMD's impressive turn with the release of the 290X back in 2013—in recent months it's suffered a few setbacks when it comes to DirectX 12 and performance under Stardock's Ashes of the Singularity. The problem for Nvidia has been asynchronous shaders, or rather, the lack of them in its hardware. AMD took a gamble early on when designing its GCN range of GPUs (the 7000-series and up) with hardware-based asynchronous shaders. These allow its GPUs to take the multithreaded workloads of DX12 and execute them in parallel and asynchronously, greatly improving performance over serial processing.

Pascal still doesn't have hardware-based asynchronous shaders. In DX12 games like Ashes of the Singularity that take advantage of them, Nvidia doesn't enjoy the same kind of performance boost as AMD. In early tests it even dropped in performance, although recent driver updates have seen Nvidia cards at least achieve parity between DX11 and DX12.

Instead of asynchronous shaders, Pascal uses a technique called pre-emption. Effectively, this enables the GPU to prioritise one set of more complex tasks over another (for example, preferencing compute tasks like physics over graphics). The trouble is, longrunning compute jobs can end up monopolising the GPU. This was a particular issue for Maxwell, where the GPU could only pre-empt tasks at the end of each command. That means extra time spent waiting for the command to end increasing latency.

Pascal implements pixel level pre-emption, allowing the GPU to pause smaller tasks at any point in order to save the status of them to memory while bigger tasks complete. It's an interesting solution, but it still doesn't replace the performance of hardware-based asynchronous shaders. Fortunately for Nvidia, even with the increasing number of DX12 games being released, few of them take full advantage of asynchronous shaders. Fewer still have shown any real improvement in performance over DX11.

That will change over time (spoiler: it does a little here too), but there's more work required on the developer side to support the low-level hardware features of DX12. Right now, most simply aren't bothering. That's not to mention that despite its lack of async, Nvidia has one very big advantage over the competition: clock speed.