We've known the name of Nvidia's next generation architecture for some time now: Pascal. Everything beyond that has largely consisted of speculation—some of it reasonable, and some of it pie-in-the-sky dreaming. Today at Jen-Hsun's keynote for GTC2016, Nvidia has revealed some of the first details of the hardware. If you were hoping to see the GPU launch first for consumers, followed by professional versions later, we're still waiting to see how that plays out. For now, Nvidia is talking a few higher level details, and the halo P100 product is shaping up to be an absolute monster.

What you need to understand first is that P100 is apparently going "all in" on deep learning, which may or may not see use limited to Tesla and Quadro products. Things like NVLink—a high-speed bus linking multiple GPUs together—won't necessarily be used or needed in the world of PC gaming, but even if Pascal is focused more on deep learning and supercomputing applications, that doesn't mean it won't be a killer gaming chip. Let's start with what we know about Pascal P100:

Behold, the Nvidia Tesla P100 graphics module, with 150 total billion transistors of performance.

If the above image looks a bit reminiscent of AMD's Fiji processors, there's good reason. Like Fiji, Nvidia is tapping HBM (High-Bandwidth Memory) for the P100, only they're using HBM2 instead of HBM1. The net result is four layers of stacked memory running on a 4096-bit bus, only the memory this time is running at 1.4Gbps instead of 1.0Gbps, yielding a total memory bandwidth of 720GB/s. That's all well and good, but perhaps more important than simply providing tons of memory bandwidth, HBM2 significantly increases the amount of memory per HBM stack, with P100 sporting a total of 16GB of VRAM. This was obviously a critical factor for Tesla cards, considering the older Tesla K40 already had 12GB of memory, and M40 likewise supports 12GB—not to mention the newly released M40 that comes with 24GB of GDDR5. HBM2 also includes "free" ECC protection, which is a plus for professional applications where reliability and accuracy are paramount.

Thanks to the move to the 16nm FinFET process technology, Nvidia has also been able to substantially increase the number of transistors in the GPU core. Where GM200 in the M40 has 3072 CUDA cores and consists of eight billion transistors, P100 nearly doubles transistor counts to 15.3 billion. Nvidia also noted that this is their largest GPU ever, measuring 610mm2, but while that's impressive, GM200 also measured around 600mm2, so that aspect hasn't changed too much. That size does not include the silicon interposer, however, which has to cover the area of both the GPU as well as the HBM2 chips, so this definitely qualifies as a gargantuan chip. If you count all the transistors in the GPU, interposer, and HBM2 modules, Nvidia says there are 150 billion transistors all told.

Jen-Hsun in traditional leather jacket, showing off P100.

What about core counts? Here's where things get a bit interesting. The Pascal architecture has once again evolved, changing the SM module size. In Kepler, a single SMX consisted of 192 CUDA cores, with the GK110 supporting up to 28 SMX units for 2880 CUDA cores total. Maxwell dropped the core count to 128 per SM, but the architecture was built to better utilize each core, leading to improved efficiency. In Pascal P100, Nvidia drops to just 64 CUDA cores per SM, and apparently there are further improvements to efficiency. What's interesting to note is that each SM in the P100 has 64 FP32 cores, along with 32 FP64 cores, and P100 also adds support for half-precision FP16, potentially doubling throughput in situations where raw performance takes priority over precision.

A fully enabled P100 has 60 SMs, giving a potential 3840 cores, but Tesla P100 disables four SMs to give 3584 total cores. That might sound like only a small step forward, considering the M40 has 3072 cores, but clock speeds have improved. Where M40 runs at 948-1114MHz, P100 can run at 1328-1480MHz. Raw compute power ends up being 21.2 half-precision FP16 TFLOPS, 10.6 single-precision FP32 TFLOPS, or 5.3 double-precision FP64 TFLOPS. M40 by comparison had half- and single-precision rates of 6.8 TFLOPS, but double precision rates of just 213 GFLOPS; that's because GM200 only included four FP64 cores per SMM, a significant departure from the GK110 Kepler architecture.

What all this means is that P100 may never be utilized in a mainstream consumer device. At best, I suspect we might see some new variant of Titan based off P100 in the future, but that could be a long way off. You see, even though Nvidia is spilling the beans on Tesla P100 today—or at least, some of the beans—and the chips are in volume production, Nvidia doesn't plan on full retail availability from OEMs (meaning, servers and workstations that ship with Tesla P100) until Q1'2017, with P100 showing up "cloud first" at an unspecified date. That means we're far more likely to see a GP104 chip that skips all the ECC, HBM2, and FP64 stuff and potentially stuffs more FP32 cores into a smaller die than P100. Sadly, Nvidia is not commenting on any future consumer facing products at this time. Looks like we'll have to wait for Computex to hear more about the consumer lines.