Hot Chips is a yearly conference where the best and the brightest of the semiconductor industry provide deep-dive details on the latest cutting-edge processors. This year featured a diverse range of topics that mirror the latest industry trends. Naturally, that means a huge dose of AI. Presentations from Nvidia, Google, and Microsoft, among many others, outlined the latest developments.

Nvidia was on hand to present some of the finer microarchitectural details off its new Volta GV100 SMs. We'll get to the presentation shortly, but we also had a meeting with Rob Ober, Tesla Chief Platform Architect at Nvidia, for a closer look at the GV100.

Nvidia Volta GV100

Ober happened to have a Tesla V100 tucked away in his bag, so we took the opportunity to snap a few pictures. The GV100 comes in the SXM2 form factor. Four stacks of HBM2 (16GB total) ride on a silicon substrate carrier, visible on opposing sides of the die, and they're flanked by four "wings" that help support the package when heatsink mounting pressure is applied.

Nvidia is pushing the boundaries of semiconductor manufacturing with Volta; it's the company's largest die yet. The massive 815mm2 Volta die, which wields 21 billion transistors built on TSMC's 12nm FFN process, is almost the size of a full reticle. That presents challenges due to the high probability of defects.

Image 1 of 5 Image 2 of 5 Image 3 of 5 Image 4 of 5 Image 5 of 5

Nvidia ships the GPU with 80 activated SMs (5,120 CUDA cores), but the company designed the die with 84 SM to maximize yields. The four spare SMs offset any defects in the manufacturing process; the probability of one SM suffering from a defect is high, whereas the chances of four defective SMs is quite low. Nvidia simply disables defective SMs to skirt defects, thus boosting yields. However, if an irreparable defect falls into a more critical area of the chip, such as I/O interconnects or critical pathways, the die is (usually) discarded. In either case, Volta is an engineering feat; its die size surpasses Nvidia's GP100 610mm2 die (15.3 billion transistors) by 33%.

The Volta die resides on a block of steel, so the GV100 has quite a bit of heft to it. Nvidia equipped the bottom of the GV100 with two mezzanine connectors. One connector primarily serves typical PCIe traffic, and the other is dedicated to NVLink connections. The GV100 modules are secured to custom boards (Nvidia offers its HGX reference board) via eight fasteners, and the boards reside inside server chassis of varying heights.

Image 1 of 5 Image 2 of 5 Image 3 of 5 Image 4 of 5 Image 5 of 5

A hefty array of 16 inductors and voltage regulators line the edge of the card. The package pulls an average of 300W at a little below 1V, so over 300A flows into the die. Nvidia provides reference cooling designs, but most of its HPC customers opt for custom liquid cooling solutions, while many hyperscalers go with air cooling. The thermal solution attaches to the four silver-edged holes next to the die.

Image 1 of 3 Image 2 of 3 Image 3 of 3

The Tesla V100-powered DGX-1 packs eight Volta GPUs crammed into a 3U chassis to deliver a whopping 960 TFLOPs of power from 40,960 CUDA cores. It also brings the addition of 5,120 Tensor cores (more coverage on that here), and six NVLink 2.0 connections increase throughput to 10X that of a standard PCIe connection (300GB/s). The DXG-1 draws up to 3,200W inside a single chassis, so effective cooling is a must.

Nvidia’s NVLink accommodates several system topologies, such as the hybrid cube mesh in the DGX-1 for machine learning applications and the HPC-specific P9 Coral System’s unique design. These topologies minimize peer-to-peer latency and provide multipathing capabilities.

Nvidia designed its proprietary NVLink protocol specifically for low latency and high-throughput peer-to-peer GPU communication. The company has considered opening up the NVLink protocol as a standard, but ultimately, Nvidia feels that could hinder development. Several large industry consortiums are developing competing open standards, such as CCIX and CAPI, but Nvidia believes NVLink is best suited for its specific use case.

Image 1 of 2 Image 2 of 2

Nvidia claims impressive performance advantages over its previous-generation P100. Highlights include a 12x boost to training and a 6x boost for inferencing. Performance is fed, in part, by faster HBM2, L2, and L1 caches. Meanwhile, NVLink 2’s expanded bandwidth nearly doubles inter-GPU throughput.

The Volta GV100 SM

Image 1 of 16 Image 2 of 16 Image 3 of 16 Image 4 of 16 Image 5 of 16 Image 6 of 16 Image 7 of 16 Image 8 of 16 Image 9 of 16 Image 10 of 16 Image 11 of 16 Image 12 of 16 Image 13 of 16 Image 14 of 16 Image 15 of 16 Image 16 of 16

80 SMs, wielding a total of 5,120 CUDA cores and 640 Tensor cores, populate the die. Improvements include doubled warp schedulers, a large L1 instruction cache, and tensor acceleration. The shared L1 instruction cache feeds one warp instruction per clock to the independently scheduled sub-cores. Each sub-core processes one warp instruction per clock and feeds into the shared MIO unit. The MIO unit houses texture, shared L1 data cache, and shared memory.

Each SM sub-core has its own L0 instruction cache and a dedicated branch unit. The warp scheduler feeds the math dispatch unit, sends MIO instructions to the MIO instruction queue for later scheduling, and feeds the two 4x4x4 Tensor cores (which are used specifically for deep learning).

The four sub-cores send instructions into the MIO scheduler. The 128KB L1 data cache provides 128 bytes of bandwidth per clock. Each sub-core has a 64 byte-per-clock connection to the L1 data cache. Nvidia noted that it designed the cache subsystem for superior data streaming performance; it offers four times the bandwidth and capacity compared to the GP100.

Nvidia also shared information on the Tesla V100’s independent thread scheduling, along with the mixed-precision FP16/FP32 Tensor cores for deep learning matrix arithmetic. The company has already released many of these details.

Nvidia is entrenched with several high-volume hyperscalers and high-volume customers, such as Facebook. Facebook developed its Big Basin platform, which is a custom system that leverages the V100 and Nvidia’s HGX reference board design. Facebook plans to release it to the Open Compute (OCP) project soon, so we can expect a wave of OEM and ODM systems to come to market with the new design. That’ll open more avenues for Nvidia to sell its data center GPUs, but Nvidia isn't commenting on when Volta-fueled GPUs will come to the desktop.