Today, Nvidia takes the wraps off Kepler, its next-generation GPU, and quite possibly the greatest step forward for graphics processing since it unveiled the G80 in November 2006. The first desktop card out of the gate is the GTX 680. Unlike the 5xx series, which were based on a refined version of the Fermi architecture Nvidia debuted back in 2010, Kepler uses a new GK104 GPU — and its design is a sharp departure from Nvidia’s previous architectures.

Over the past five years, Nvidia’s GPU strategy has more-or-less amounted to “Everything+Kitchen Sink and we’ll sort things out when we do the refresh.” After the disastrous debut of its R600 architecture in 2006, AMD adopted a strategy of building smaller, mid-range oriented parts and doubling them up to address the high end of the market — Nvidia, in contrast, adamantly stuck to its monolithic guns. Until Kepler.

The transistor counts below are from Nvidia; Kepler’s die size is estimated but should be close to the mark. Kepler’s die size and transistor count are notable achievements in and of themselves, but we’ve barely scratched the surface of the new core. Here’s a table comparing the vitals of NV’s GT200 that debuted in 2008, (the “Tesla” moniker refers to the GPU family, not the high-end scientific computing cards), Fermi, and GK104.

Shaders are now clocked at the same speed as the graphics core. Kepler is clocked 30% higher than Fermi and packs 3x as many cores, but we want to highlight a change Nvidia wouldn’t explain during its presentation — the GK104’s cores aren’t as capable as the GF110’s. With 3x the core count and a 30% clock speed boost, Kepler “only” offers twice the GFLOP throughput. Not that that’s a bad thing.

A number of other GPU resources have been shuffled around as well.

Nvidia’s ratio column is remarkably unhelpful; it only describes the increase between Fermi and Kepler rather than how resources are distributed relative to each other. GK104 packs four times the special function units (SFUs) and twice the texture units as GF110; the core is capable of processing twice as many instructions per clock (though it has three times as many cores to fill with those instructions).

One area Nvidia did shed some light on are the changes it made to its warp scheduler. In weaving (with a loom), the term “warp” refers to the longitudinal threads in a pattern; Nvidia uses the term to mean a group of threads. For our purposes it roughly corresponds to the thread scheduler.

Fermi’s scheduler was designed with hardware stages to “prevent data hazards in the math datapath itself.” Registers were tracked and checked before data was issued to ensure that they were ready for new instructions, while decoded instructions were kept available for fast dispatch when applicable. Kepler simplifies this structure and handles some of the checking in software; dispatch latency instructions are now issued alongside the instructions themselves.

The company also notes that “We also developed a new design for the processor execution core, again with a focus on best performance per watt. Each processing unit was scrubbed to maximize clock gating efficiency and minimize wiring and retiming overheads.”

What all this adds up to is a rearchitected GPU with a focus on power efficiency that’s been notably lacking from the company’s previous high-end efforts. Those of you familiar with Nvidia’s historic naming schemes will recognize the GK104 moniker as one that Team Green typically would reserve for a mid-range GPU. Thus far, there’s no indication of a higher-end part in the works, and no obvious places where NV might have disabled compute units to improve yields, as it did with GF100.

Next page: Enough jabber — how’s performance?