NVIDIA Pascal GP100 Architecture Deep-Dive P2: GP100 Cache & HBM2 vs. HBM1

The Order of 10 puzzle preempted a forthcoming event that we'll be covering, publicly disclosed as occurring on May 6 at 9PM EST. Although nVidia has not technically, officially laid claim to the “#OrderOf10” puzzle, the countdown timer happens to expire precisely when nVidia's yet-undetailed Twitch.tv streaming event will kick-off. UPDATE: View our GTX 1080 & GTX 1070 coverage here. And that full day of decoding mysteries led us to brush-up on nVidia's “Pascal” architecture, revealed years ago and announced at this year's GTC as hitting volume production. The first Pascal chip publicly known to enter production is the GP100, found on the Tesla P100 scientific and computational accelerator card. GP100 is the “Big GPU” for this generation of nVidia devices, measuring in at an intimidating 610mm^2 die size, and stands as the trailhead for imminent derivatives of the GPU architecture. Those derivatives will invariably include gaming-targeted devices – something for which the P100 is not remotely targeted – in the GeForce GTX lineup. This article dives deep into NVIDIA's new Pascal architecture. We'll talk streaming multiprocessor layout, memory subsystems and HBM1 vs. HBM2, L1 & L2 Cache, unified memory, GDMA, and more. As for the live-streamed event tonight, we're hoping that it'll offer us some official names for the rumored “GTX 1000” series (e.g. GTX 1080, GTX 1070, GTX 1060 Ti), but we really don't know for certain what's being unveiled. We will be covering that event tonight in full detail, whatever it may involve. Be sure to check the site and YouTube channel for updates as they're released live.

Pascal GP100 vs. K40, M40, GM200

Tesla GPUs Tesla K40 Tesla M40 Tesla P100 GPU Kepler GK110 Maxwell GM200 Pascal GP100 SMs 15 24 56 TPCs 15 24 28 FP32 CUDA Cores 192 128 64 FP32 CUDA Cores / GPU 2880 3072 3584 FP64 CUDA Cores / SM 64 4 32 FP64 CUDA Cores / GPU 960 96 1792 Base Clock 745MHz 948MHz 1328MHz GPU Boost Clock 810/875MHz 1114MHz 1480MHz Peak FP32 TFLOPs 5.04 6.8 10.6 Peak FP64 TFLOPs 1.68 2.1 5.3 Texture Units 240 192 224 Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 Memory Size Up to 12GB Up to 24GB 16GB L2 Cache 1536KB 3072KB 4096KB Register File Size / SM 256KB 256KB 256KB Register File Size / GPU 3840KB 6144KB 14336KB TDP 235W 250W 300W Transistor Count 7.1B 8B 15.3B GPU Die Size 551mm^2 601mm^2 610mm^2 Manufacturing Process 28nm 28nm 16nm FinFET

NVIDIA Full Pascal GP100 Block Diagram

This is the architectural block diagram for the GP100 Pascal chip that was revealed at GTC last month.

GP100 is the largest GPU that nVidia has ever built, and is fabricated by TSMC on a 16nm FinFET process node. That's a major shift from the preceding 28nm process node of currently active gaming architectures from both nVidia and AMD, and it champions an era of reduced wattage and more packed-in transistors. Both items work to better the performance-per-watt “ratio,” a change nVidia began making post-Fermi and that AMD executed with Fiji. The die shrink isn't alone in performance and power consumption improvements, either; a move to FinFET transistors means that power leakage becomes less significant, and marks the EOL for planar FETs in GPUs as all major silicon manufacturers have now transitioned.

FinFET transistors use a three-dimensional design which extrudes a fin to form the drain and source, with the gate encircling the transistor's fins. GP100 has a transistor count of 15.3 billion across its 610mm^2 GPU die size. It's rated for a 300W TDP and offers a staggering 5.3 TFLOPS of FP64 (double-precision) COMPUTE performance and 10.6 TFLOPS of FP32. FP16 is also available natively (21.2 TFLOPS) on GP100 and is more critical than it might sound on paper, something we'll discuss more momentarily.

The Tesla P100 Pascal science-targeted accelerator card (not quite a “graphics card”) hosts a total of 3584 CUDA Cores capable of single-precision FP32 and 1792 CUDA Cores capable of FP64 DP COMPUTE. The full GP100 GPU would host 1920 FP64 cores. The clockrate is natively high, pushing a frequency of 1328MHz base and 1480MHz boosted (stock), and bandwidth is further aided by the new move to HBM2 memory – an absurd speed increase over GDDR5.

Let's start with the graphics processing clusters and streaming multiprocessor (SM) architecture, then get into memory.

Pascal Streaming Multiprocessor (SM) Architecture

Although variants will exist, the GP100 chip we're looking at today (Tesla P100 GPU) is modeled in the above diagram. This chip hosts six Graphics Processing Clusters (GPCs), each of which contains a set of Texture Processing Clusters (TPCs). For every one TPC in Pascal, there are two Streaming Multiprocessors (SMs); in total, there are 60 SMs on GP100, with the Tesla P100 Accelerator hosting 56. This is what an SM looks like in the Pascal architecture:

The above SM sample is reproduced ten times per GPC, accompanied by one half that count of TPCs (five per GPC). Each SM contains 64 FP32 single-precision CUDA cores and 32 FP64 cores (half) on a fully-enabled GP100 GPU; this may interest astute readers as being a reduction in total core count per SM versus previous architectures, despite an overall higher per-GPU core count. Maxwell, for instance, ran 128 FP32 CUDA cores per SM, and predecessor Kepler ran 192 FP32 SP CUDA cores per SM. The reduced CUDA core count per SM is because GP100 has been segmented into two sets of 32-core processing blocks, each containing independent instruction buffers, warp schedulers, and dispatch units. There is one warp scheduler and one instruction buffer per 32-core partition. Each partition also contains two, independent dispatch units per (four total per SM), and each partition further contains a 32,768 x 32-bit register file. The SM segments share a single instruction cache, unified texture / L1 Cache, four texture units (TMUs), and one 64KB shared memory block.

Here is the COMPUTE potential of various architectures:

GPU Kepler GK110 Maxwell GM200 Pascal GP100 Compute Capability 3.5 5.2 6.0 Threads / Warp 32 32 32 Max Warps / Multiprocessor 64 64 64 Max Threads / Multiprocessor 2048 2048 2048 Max Thread Blocks / Multiprocessor 16 32 32 Max 32-bit Registers / SM 65536 65536 65536 Max Registers / Block 65536 32768 65536 Max Registers / Thread 255 255 255 Max Thread Block Size 1024 1024 1024 Shared Memory Size / SM 16KB / 32KB / 48KB 96KB 64KB

(From GP100 white paper.)

So, even though Pascal has half the cores-per-SM as Maxwell, it's got the same register size and comparable warp and thread count elements; the elements comprising the GPU are more independent in this regard, and better able to divvy workloads efficiently. GP100 is therefore capable of sustaining more active (“in-flight”) threads, warps, and blocks, partially resultant of its increased register access presented to the threads.

The new architecture yields an overall greater core count (by nature of hosting more SMs) and also manages to increase its processing efficiency by changing the datapath configuration. The move to FinFET also greatly increases per-core performance-per-watt, a change consistent with previous fabrication changes.

Because the cores-per-SM have been halved and total SM count has increased (to 10, in this instance), the per-SM memory registers and warps improve efficiency of code execution. This gain in execution efficiency is also supported by increased aggregate shared memory bandwidth (an effective doubling) because the per-SM shared memory access increases.

Pascal's datapath structure requires less power for data transfer management. Pascal schedules tasks with greater efficiency (and less die space consumed by datapath organization) than Maxwell by dispatching two warp instructions per clock, with one warp scheduled per block (or “partition,” as above).

With four texture units per SM, Pascal GP100 is outfitted with a total of 240 TMUs. A split L1/Texture cache is made available to the SM, explained in the next section.

As for why FP16 matters, it's mostly deep learning. This will only be given cursory coverage as it exits the scope of our gaming-tech website. The native FP16 support and high performance (>21TFLOPS) COMPUTE aids in applications which do not require the accuracy tolerances of double-precision. NVidia explains in official documentation that “deep neural network architectures have a natural resilience to errors due to the backpropagation algorithm used in their training.” FP16 becomes preferable in deep learning use cases, as it requires significantly less memory usage, reduces data transfer, and can improve performance ~2x over FP32.

Continue to Page 2 for L1 & L2 Cache, Unified Memory, & HBM2 vs. HBM.