At Nvidia’s GPU Technology Conference in 2010, CEO Jen-Hsun Huang made some pretty dramatic claims about his company’s future GPU architecture, code-named Kepler. Huang predicted the chip would be nearly three times more efficient, in terms of FLOPS per watt, than the firm’s prior Fermi architecture. Those improvements, he said, would go “far beyond” the traditional advances chip companies can squeeze out of the move to a newer, smaller fabrication process. The gains would come from changes to the chip’s architecture, design, and software together.

Fast forward to today, and it’s time to see whether Nvidia has hit its mark. The first chip based on the Kepler architecture is hitting the market, aboard a new graphics card called the GeForce GTX 680, and we now have a clear sense of what was involved in the creation of this chip. Although Kepler’s fundamental capabilities are largely unchanged versus the last generation, Nvidia has extensively refined and polished nearly every aspect of this GPU with an eye toward improved power efficiency.

Kepler was developed under the direction of lead architect John Danskin and Sr. VP of GPU engineering Jonah Alben. Danskin and Alben told us their team took a rather different approach to chip development than what’s been common at Nvidia in the past, with much closer collaboration between the different disciplines involved, from the architects to the chip designers to the compiler developers. An idea that seemed brilliant to the architects would be nixed because it didn’t work well in silicon, or if it didn’t serve the shared goal of building a very power-efficient processor.

Although Kepler is, in many ways, the accumulation of many small refinements, Danskin identified the two most major changes as the revised SM—or shader multiprocessor, the GPU’s processing “core”—and a vastly improved memory interface. Let’s start by looking at the new SM, which Nvidia calls the SMX, because it gives us the chance to drop a massive block diagram on you. Warm up your scroll wheels for this baby.

Logical block diagrams of the Kepler SMX (left) and Fermi SM (right). Source: Nvidia.

To some extent, GPUs are just massive collections of floating-point computing power, and the SM is the locus of that power. The SM is where nearly all of the graphics processing work takes place, from geometry processing to pixel shading and texture sampling. As you can see, Kepler’s SMX is clearly more powerful than past generations, because it’s over 700 pixels tall in block diagram form. Fermi is, like, 520 or so, tops. More notably, the SMX packs a heaping helping of ALUs, which Nvidia has helpfully labeled as “cores.” I’d contend the SM itself is probably the closest analog to a CPU core, so we’ll avoid that terminology. Whatever you call it, though, the new SMX has more raw computing power—192 ALUs versus 32 ALUs in the Fermi SM. According to Alben, about half of the Kepler team was devoted to building the SMX, which is a new design, not a derivative of Fermi’s SM.

The organization of the SMX’s execution units isn’t truly apparent in the diagram above. Although Nvidia likes to talk about them as individual “cores,” the ALUs are actually grouped into execution units of varying widths. In the SMX, there are four 16-ALU-wide vector execution units and four 32-wide units. Each of the four schedulers in the diagram above is associated with one vec16 unit and one vec32 unit. There are eight special function units per scheduler to handle, well, special math functions like transcendentals and interpolation. (Incidentally, the partial use of vec32 units is apparently how the GF114 got to have 48 ALUs in its SM, a detail Alben let slip that we hadn’t realized before.)

Although each of the SMX’s execution units works on multiple data simultaneously according to its width—and we’ve called them vector units as a result—work is scheduled on them according to Nvidia’s customary scheme, in which the elements of a pixel or thread are processed sequentially on a single ALU. (AMD has recently adopted a similar scheduling format in its GCN architecture.) As in the past, Nvidia schedules its work in groups of 32 pixels or threads known as “warps.” Those vec32 units should be able to output a completed warp in each clock cycle, while the vec16 units and SFUs will require multiple clocks to output a warp.

The increased parallelism in the SMX is a consequence of Nvidia’s decision to seek power efficiency with Kepler. In Fermi and prior designs, Nvidia used deep pipelining to achieve high clock frequencies in its shader cores, which typically ran at twice the speed of the rest of the chip. Alben argues that arrangement made sense from the standpoint of area efficiency—that is, the extra die space dedicated to pipelining was presumably more than offset by the performance gained at twice the clock speed. However, driving a chip at higher frequencies requires increased voltage and power. With Kepler’s focus shifted to power efficiency, the team chose to use shorter pipelines and to expand the unit count, even at the expense of some chip area. That choice simplified the chip’s clocking, as well, since the whole thing now runs at one speed.

Another, more radical change is the elimination of much of the control logic in the SM. The key to many GPU architectures is the scheduling engine, which manages a vast number of threads in flight and keeps all of the parallel execution units as busy as possible. Prior chips like Fermi have used lots of complex logic to decide which warps should run when, logic that takes a lot of space and consumes a lot of power, according to Alben. Kepler has eliminated some of that logic entirely and will rely on the real-time complier in Nvidia’s driver software to help make scheduling decisions. In the interests of clarity, permit me to quote from Nvidia’s whitepaper on the subject, which summarizes the change nicely:

Both Kepler and Fermi schedulers contain similar hardware units to handle scheduling functions, including, (a) register scoreboarding for long latency operations (texture and load), (b) inter-warp scheduling decisions (e.g., pick the best warp to go next among eligible candidates), and (c) thread block level scheduling (e.g., the GigaThread engine); however, Fermi’s scheduler also contains a complex hardware stage to prevent data hazards in the math datapath itself. A multi-port register scoreboard keeps track of any registers that are not yet ready with valid data, and a dependency checker block analyzes register usage across a multitude of fully decoded warp instructions against the scoreboard, to determine which are eligible to issue. For Kepler, we realized that since this information is deterministic (the math pipeline latencies are not variable), it is possible for the compiler to determine up front when instructions will be ready to issue, and provide this information in the instruction itself. This allowed us to replace several complex and power-expensive blocks with a simple hardware block that extracts the pre-determined latency information and uses it to mask out warps from eligibility at the inter-warp scheduler stage.

The short story here is that, in Kepler, the constant tug-of-war between control logic and FLOPS has moved decidedly in the direction of more on-chip FLOPS. The big question we have is whether Nvidia’s compiler can truly be effective at keeping the GPU’s execution units busy. Then again, it doesn’t have to be perfect, since Kepler’s increases in peak throughput are sufficient to overcome some loss of utilization efficiency. Also, as you’ll soon see, this setup obviously works pretty well for graphics, a well-known and embarrassingly parallel workload. We are more dubious about this arrangement’s potential for GPU computing, where throughput for a given workload could be highly dependent on compiler tuning. That’s really another story for another chip on another day, though, as we’ll explain shortly.

The first chip: GK104

Now that we’ve looked at the SMX, we dial back the magnification a bit and consider the overall layout of the first chip based on the Kepler architecture, the GK104.

Logical block diagram of the GK104. Source: Nvidia.

You can see that there are four GPCs, or graphics processing clusters, in the GK104, each nearly a GPU unto itself, with its own rasterization engine. The chip has eight copies of the SMX onboard, for a gut-punching total of 1536 ALUs and 128 texels per clock of texture filtering power.

The L2 cache shown above is 512KB in total, divided into four 128KB “slices,” each with 128 bits of bandwidth per clock cycle. That adds up to double the per-cycle bandwidth of the GF114 or 30% more than the biggest Fermi, the GF110. The rest of the specifics are in the table below, with the relevant comparisons to other GPUs.

ROP pixels/ clock Texels filtered/ clock (int/fp16) Shader ALUs Rasterized triangles/ clock Memory interface width (bits) Estimated transistor count (Millions) Die

size (mm²) Fabrication process node GF114 32 64/64 384 2 256 1950 360 40 nm GF110 48 64/64 512 4 384 3000 520 40 nm GK104 32 128/128 1536 4 256 3500 294 28 nm Cypress 32 80/40 1600 1 256 2150 334 40 nm Cayman 32 96/48 1536 2 256 2640 389 40 nm Pitcairn 32 80/40 1280 2 256 2800 212 28 nm Tahiti 32 128/64 2048 2 384 4310 365 28 nm

In terms of basic, per-clock rates, the GK104 stacks up reasonably well against today’s best graphics chips. However, if the name “GK104” isn’t enough of a clue for you, have a look at some of the vitals. This chip’s memory interface is only 256 bits wide, all told, and its die size is smaller than the middle-class GF114 chip that powers the GeForce GTX 560 series. The GK104 is also substantially smaller, and comprised of fewer transistors, than the Tahiti GPU behind AMD’s Radeon HD 7900 series cards. Although the product based on it is called the GeForce GTX 680, the GK104 is not a top-of-the-line, reticle-busting monster. For the Kepler generation, Nvidia has chosen to bring a smaller chip to market first.

Die shot of the GK104. Source: Nvidia.

Although Nvidia won’t officially confirm it, there is surely a bigger Kepler in the works. The GK104 is obviously more tailored for graphics than GPU computing, and GPU computing is an increasingly important market for Nvidia. The GK104 can handle double-precision floating-point data formats, but it only does so at 1/24th the rate it processes single-precision math, just enough to maintain compatibility. Nvidia has suggested there will be some interesting GPU-computing related announcements during its GTC conference in May, and we expect the details of the bigger Kepler to be revealed at that point. Our best guess is that the GK100, or whatever it’s called, will be a much larger chip, presumably with six 64-bit memory interfaces and 768KB of L2 cache. We wouldn’t be surprised to see its SM exchange those 32-wide execution units for 16-wide units capable of handling double-precision math, leaving it with a total of 128 ALUs per SM. We’d also expect full ECC protection for all local storage and off-chip memory, just like the GF110.

The presence of a larger chip at some point in Nvidia’s future doesn’t mean the GK104 lacks for power. Although it “only” has four 64-bit memory controllers, this chip’s memory interface is probably the most notable change outside of the SMX. As Danskin very carefully put it, “Fermi, our memory wasn’t as fast as it could have been. This is, in fact, as fast as it could be.” The interface still supports GDDR5 memory, but data rates are up from about 4 Gbps in the Fermi products to 6 Gbps in the GeForce GTX 680. As a result, the GTX 680 is able essentially to match the GeForce GTX 580 in total memory bandwidth, at 192 GB/s, while having a 50% narrower data path.

The other novelty in the GK104 is Nvidia’s first PCI Express 3.0-compatible interconnect, which doubles the peak data rate possible for GPU-to-host communication. We don’t expect major performance benefits for graphics workloads from this faster interface, but it could matter in multi-GPU scenarios or for GPU computing applications.

Several new features

On this page, we intend to explain some of the important new features Nvidia has built into the GK104 or its software stack. However, in the interests of getting this review posted before our deadline, we’ve decided to put in a placeholder, a radically condensed version of the final product. Don’t worry, we’ll fix it later in software—like the R600’s ROPs.

GPU Boost — As evidenced by the various “turbo” schemes in desktop CPUs, dynamic voltage and frequency schemes are all the rage these days. The theory is straightforward enough. Not all games and other graphics workloads make use of the GPU in the same way, and even relatively “intensive” games may not cause all of the transistors to flip and thus heat up the GPU quite like the most extreme cases. As a result, there’s often some headroom left in a graphics card’s designated thermal envelope, or TDP (thermal design power), which is generally engineered to withstand a worst-case peak workload. Dynamic clocking schemes attempt to track this headroom and to take advantage of it by raising clock speeds opportunistically. Although the theory is fairly simple, the various implementations of dynamic clocking vary widely in their specifics, which can make them hard to track. Intel’s Turbo Boost is probably the gold standard at present; it uses a network of thermal sensors spread across the die in conjunction with a programmable, on-chip microcontroller that governs Turbo policy. Since it’s a hardware solution with direct inputs from the die, Turbo Boost reacts very quickly to changes in thermal conditions, and its behavior may differ somewhat from chip to chip, since the thermal properties of the chips themselves can vary. Although distinct from one another in certain ways, both AMD’s Turbo Core (in its CPUs) and PowerTune (in its GPUs) combine on-chip activity counters with pre-production chip testing to establish a profile for each model. In use, power draw for the chip is then estimated based on the activity counters, and clocks are adjusted in response to the expected thermal situation. AMD argues the predictable, deterministic behavior of its DVFS schemes is an admirable trait. The price of that consistency is that it can’t squeeze every last drop of performance out of each individual slab of silicon. GPU Boost is essentially a first-generation crack at a dynamic clocking feature, and it combines some traits of each of the competing schemes. Fundamentally, the logic is more like the two Turbos than it is like AMD’s PowerTune. With PowerTune, AMD runs its GPUs at a relatively high base frequency, but clock speeds are sometimes throttled back under atypically high GPU utilization. By contrast, GPU Boost starts with a more conservative base clock speed and ranges into higher frequencies when possible. The inputs for Boost’s decision-making algorithm include power draw, GPU and memory utilization, and GPU temperatures. Most of this information is collected from the GPU itself, but I believe the power use information comes from external circuitry on the GTX 680 board. In fact, Nvidia’s Tom Petersen told us board makers will be required to include this circuitry in order to get the GPU maker’s stamp of approval. The various inputs for Boost are then processed in software, in a portion of the GPU driver, not in an on-chip controller. The combination of software control and external power circuitry is likely responsible for Boost’s relatively high clock-change latency. Stepping up or down in frequency takes about 100 milliseconds, according to Petersen. A tenth of a second is a very long time in the life of a gigahertz-class chip, and Petersen was frank in admitting that this first generation of GPU Boost isn’t everything Nvidia hopes it will become in the future. Graphics cards with Boost will be sold with a couple of clock speed numbers on the side. The base clock is the lower of the two—1006MHz on the GeForce GTX 680—and represents the lowest operating speed in thermally intensive workloads. Curiously enough, the “boost clock”—which is 1058MHz on the GTX 680—isn’t the maximum speed possible. Instead, it’s “sort of a promise,” according to Petersen, the clock speed at which the GPU should run during typical operation. GPU Boost performance will vary slightly from card to card, based on factors like chip quality, ambient temperatures, and the effectiveness of the cooling solution. GTX 680 owners should expect to see their cards running at the Boost clock frequency as a matter of course, regardless of these factors. Beyond that, GPU Boost will make its best effort to reach even higher clock speeds when feasible, stepping up and down in increments of 13MHz. Petersen demoed several interesting scenarios to illustrate Boost behavior. In a very power-intensive scene, 3DMark11’s first graphics test, the GTX 680 was forced to remain at its base clock throughout. When playing Battlefield 3 , meanwhile, the chip spent most of its time at about 1.1GHz—above both the base and boost levels. In a third application, the classic DX9 graphics demo “rthdribl,” the GTX throttled back to under 1GHz, simply because additional GPU performance wasn’t needed. One spot where Nvidia intends to make use of this throttling capability is in-game menu screens—and we’re happy to see it. Some menu screens can cause power use and fan speeds to shoot skyward as frame rates reach quadruple digits. Nvidia has taken pains to ensure GPU Boost is compatible with user-driven tweaking and overclocking. A new version of its NVAPI allows third-party software, like EVGA’s slick Precision software, control over key Boost parameters. With Precision, the user may raise the GPU’s maximum power limit by as much as 32% above the default, in order to enable operation at higher clock speeds. Interestingly enough, Petersen said Nvidia doesn’t consider cranking up this slider overclocking, since its GPUs are qualified to work properly at every voltage-and-frequency point along the curve. (Of course, you could exceed the bounds of the PCIe power connector specification by cranking this slider, so it’s not exactly 100% kosher.) True overclocking happens by grabbing hold of a separate slider, the GPU clock offset, which raises the chip’s frequency at a given voltage level. An offset of +200MHz, for instance, raised our GTX 680’s clock speed while running Skyrim from 1110MHz (its usual Boost speed) to 1306MHz. EVGA’s tool allows GPU clock offsets as high as +549MHz and memory clock offsets up to +1000MHz, so users are given quite a bit of leeway for experimentation. Although GPU Boost is only in its first incarnation, Nvidia has some big ideas about how to take advantage of these dynamic clocking capabilities. For instance, Petersen openly telegraphed the firm’s plans for future versions of Boost to include control over memory speeds, as well as GPU clocks. More immediately, one feature exposed by EVGA’s Precision utility is frame-rate targeting. Very simply, the user is able to specify his desired frame rate with a slider, and if the game’s performance exceeds that limit, the GPU steps back down the voltage-and-frequency curve in order to conserve power. We were initially skeptical about the usefulness of this feature for one big reason: the very long latency of 100 ms for clock speed adjustments. If the GPU has dialed back its speed because the workload is light and then something changes in the game—say, an explosion that adds a bunch of smoke and particle effects to the mix—ramping the clock back up could take quite a while, causing a perceptible hitch in the action. We think that potential is there, and as a result, we doubt this feature will appeal to twitch gamers and the like. However, in our initial playtesting of this feature, we’ve not noticed any problems. We need to spend more time with it, but Kepler’s frame rate targeting may prove to be useful, even in this generation, so long as its clock speed leeway isn’t too wide. At some point in the future, when the GPU’s DVFS logic is moved into hardware and frequency change delays are measured in much smaller numbers, we expect features like this one to become standard procedure, especially for mobile systems.

— As evidenced by the various “turbo” schemes in desktop CPUs, dynamic voltage and frequency schemes are all the rage these days. The theory is straightforward enough. Not all games and other graphics workloads make use of the GPU in the same way, and even relatively “intensive” games may not cause all of the transistors to flip and thus heat up the GPU quite like the most extreme cases. As a result, there’s often some headroom left in a graphics card’s designated thermal envelope, or TDP (thermal design power), which is generally engineered to withstand a worst-case peak workload. Dynamic clocking schemes attempt to track this headroom and to take advantage of it by raising clock speeds opportunistically. Adaptive vsync — Better than dumb vsync.

— Better than dumb vsync. TXAA — Quincunx 2.0, or Nvidia erects a narrower tent.

— Quincunx 2.0, or Nvidia erects a narrower tent. Bindless textures — Megatexturing in hardware, but not for DX11.

— Megatexturing in hardware, but not for DX11. NVENC — Hardware video encoding, or right back atcha, QuickSync.

— Hardware video encoding, or right back atcha, QuickSync. Display output improvement — Eye-nvidi-ty.

The GeForce GTX 680

Now that we’ve looked at the GPU in some detail, let me drop the specs on you for the first card based on the GK104, the GeForce GTX 680.

GPU base clock (MHz) GPU boost clock (MHz) Shader ALUs Textures filtered/ clock ROP pixels/ clock Memory transfer rate Memory interface width (bits) Idle/peak power draw GeForce GTX 680 1006 1058 1536 128 32 6 GT/s 256 15W/195W

The GTX 680 has (as far as we know, at least) all of the the GK104’s functional units enabled, and it takes that revised memory interface up to 6 GT/s, as advertised. The board’s peak power draw is fairly tame, considering its positioning, but not perhaps considering the class of chip under that cooler.

Peak pixel fill rate (Gpixels/s) Peak bilinear filtering (Gtexels/s) Peak bilinear FP16 filtering (Gtexels/s) Peak shader arithmetic (TFLOPS) Peak rasterization rate (Mtris/s) Memory bandwidth (GB/s) GeForce GTX 560

Ti 29 58 58 1.4 1800 134 GeForce GTX 560

Ti 448 29 41 41 1.3 2928 152 GeForce GTX 580 37 49 49 1.6 3088 192 GeForce GTX 680 32 129 129 3.1 4024 192 Radeon HD 5870 27 68 34 2.7 850 154 Radeon HD 6970 28 85 43 2.7 1780 176 Radeon HD 7870 32 80 40 2.6 2000 154 Radeon HD 7970 30 118 59 3.8 1850 264

Multiply the chip’s capabilities by its clock speeds, and you get a sense of how the GTX 680 stacks up to the competition. In most key rates, its theoretical peaks are higher than the Radeon HD 7970’s—and our estimates conservatively use the base clock, not the boost clock, as their basis. The only deficits are in peak shader FLOPS, where the 7970 is faster, and in memory bandwidth, thanks to Tahiti’s 384-bit memory interface.

With that said, you may or may not be pleased to hear that Nvidia has priced the GeForce GTX 680 at $499.99. On one hand, that undercuts the Radeon HD 7970 by 50 bucks and should be a decent deal given its specs. On the other, that’s a lot more than you’d expect to pay for the spiritual successor to the GeForce GTX 560 Ti—and despite its name, the GTX 680 is most definitely that. Simply knowing that fact may create a bit of a pain point for some of us, even if the price is justified based on this card’s performance.

Thanks to its relatively low peak power consumption, the GTX 680 can get away with only two six-pin power inputs. Strangely, Nvidia has staggered those inputs, supposedly to make them easier to access. However, notice that the orientation on the lower input is rotated 180° from the upper one. That means the tabs to release the power plugs are both “inside,” facing each other, which makes them harder to grasp. I don’t know what part of this arrangement is better than the usual side-by-side layout.

The 680’s display outputs are a model of simplicity: two dual-link DVI ports, an HDMI output, and a full-sized DisplayPort connector.

At 10″, the GTX 680 is just over half an inch shorter than its closest competitor, the Radeon HD 7970.

Our testing methods

This review marks the debut of our new GPU test rigs, which we’ve already outed here. They’ve performed wonderfully for us, with lower operating noise, higher CPU performance in games, and support for PCI Express 3.0.

Oh, before we move on, please note below that we’ve tested stock-clocked variants of most of the graphics cards involved, including the Radeon HD 7970, 7870, 6970, and 5870 and the GeForce GTX 580 and 680. We agonized over whether to use a Radeon HD 7970 card like the XFX Black Edition, which runs 75MHz faster than AMD’s reference clock. However, we decided to stick with stock clocks for the higher-priced cards this time around. We expect board makers to offer higher-clocked variants of the GTX 680, which we’ll happily compare to higher-clocked 7970s once we get our hands on ’em. Although we’re sure our decision will enrage some AMD fans, we don’t think the XFX Black Edition’s $600 price tag would have looked very good in our value scatter plots, and we just didn’t have time to include multiple speed grades of the same product.

As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and we’ve reported the median result.

Our test systems were configured like so:

Processor Core i7-3820 Motherboard Gigabyte

X79-UD3 Chipset Intel X79

Express Memory size 16GB (4 DIMMs) Memory type Corsair

Vengeance CMZ16GX3M4X1600C9

DDR3 SDRAM at 1600MHz Memory timings 9-9-11-24

1T Chipset drivers INF update

9.3.0.1019 Rapid Storage Technology Enterprise 3.0.0.3020 Audio Integrated

X79/ALC898 with Realtek 6.0.1.6526 drivers Hard drive Corsair

F240 240GB SATA Power supply Corsair

AX850 OS Windows 7 Ultimate x64 Edition Service Pack 1 DirectX 11 June 2010 Update

Driver

revision GPU

core clock (MHz) Memory clock (MHz) Memory size (MB) Asus GeForce

GTX 560 Ti DirectCU II TOP ForceWare

295.73 900 1050 1024 Zotac GeForce

GTX 560 Ti 448 ForceWare

295.73 765 950 1280 Zotac GeForce GTX 580 ForceWare

295.73 772 1002 1536 GeForce GTX

680 ForceWare

300.99 1006 1502 2048 Asus

Matrix Radeon HD 5870 Catalyst

8.95.5-120224a 850 1200 2048 Radeon HD 6970 Catalyst

8.95.5-120224a 890 1375 2048 Radeon HD

7870 Catalyst

8.95.5-120224a 1000 1200 2048 Radeon HD 7970 Catalyst

8.95.5-120224a 925 1375 3072

Thanks to Intel, Corsair, and Gigabyte for helping to outfit our test rigs with some of the finest hardware available. AMD, Nvidia, and the makers of the various products supplied the graphics cards for testing, as well.

Unless otherwise specified, image quality settings for the graphics cards were left at the control panel defaults. Vertical refresh sync (vsync) was disabled for all tests.

We used the following test applications:

Some further notes on our methods:

We used the Fraps utility to record frame rates while playing a 90-second sequence from the game. Although capturing frame rates while playing isn’t precisely repeatable, we tried to make each run as similar as possible to all of the others. We tested each Fraps sequence five times per video card in order to counteract any variability. We’ve included frame-by-frame results from Fraps for each game, and in those plots, you’re seeing the results from a single, representative pass through the test sequence.

We measured total system power consumption at the wall socket using a Yokogawa WT210 digital power meter. The monitor was plugged into a separate outlet, so its power draw was not part of our measurement. The cards were plugged into a motherboard on an open test bench. The idle measurements were taken at the Windows desktop with the Aero theme enabled. The cards were tested under load running Skyrim at its Ultra quality settings with FXAA enabled.

We measured noise levels on our test system, sitting on an open test bench, using an Extech 407738 digital sound level meter. The meter was mounted on a tripod approximately 10″ from the test system at a height even with the top of the video card. You can think of these noise level measurements much like our system power consumption tests, because the entire systems’ noise levels were measured. Of course, noise levels will vary greatly in the real world along with the acoustic properties of the PC enclosure used, whether the enclosure provides adequate cooling to avoid a card’s highest fan speeds, placement of the enclosure in the room, and a whole range of other variables. These results should give a reasonably good picture of comparative fan noise, though.

We used GPU-Z to log GPU temperatures during our load testing.

The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Texture filtering

We’ll begin with a series of synthetic tests aimed at exposing the true, delivered throughput of the GPUs. In each instance, we’ve included a table with the relevant theoretical rates for each solution, for reference.

Peak pixel fill rate (Gpixels/s) Peak bilinear filtering (Gtexels/s) Peak bilinear FP16 filtering (Gtexels/s) Memory bandwidth (GB/s) GeForce GTX 560

Ti 29 58 58 134 GeForce GTX 560

Ti 448 29 41 41 152 GeForce GTX 580 37 49 49 192 GeForce GTX 680 32 129 129 192 Radeon HD 5870 27 68 34 154 Radeon HD 6970 28 85 43 176 Radeon HD 7870 32 80 40 154 Radeon HD 7970 30 118 59 264

The pixel fill rate is, in theory, determined by the speed of the ROP hardware, but this test usually winds up being limited by memory bandwidth long before the ROPs run out of steam. That appears to be the case here. Somewhat surprisingly, the GTX 680 manages to match the Radeon HD 7970 almost exactly, even though the Radeon has substantially more potential memory bandwidth on tap.

Nvidia’s new toy comes out looking very good in terms of texturing capacity, more than doubling the performance of the GeForce GTX 580 in the texture fill and integer filtering tests. Kepler’s full-rate FP16 filtering allows it outperform the 7970 substantially in the final test. In no case does the GTX 680’s relatively lower memory bandwidth appear to hinder its ability to keep up with the 7970.

Tessellation and geometry throughput

Peak rasterization rate (Mtris/s) Memory bandwidth (GB/s) GeForce GTX 560

Ti 1800 134 GeForce GTX 560

Ti 448 2928 152 GeForce GTX 580 3088 192 GeForce GTX 680 4024 192 Radeon HD 5870 850 154 Radeon HD 6970 1780 176 Radeon HD 7870 2000 154 Radeon HD 7970 1850 264

Although the GTX 680 has a higher theoretical rasterization rate than the GTX 580, the GK104 GPU has only half as many setup and tessellator units (aka PolyMorph engines) as the GF110. Despite that fact, the GTX 680 achieves twice the tessellation performance of Fermi. The GTX 680 even exceeds that rate in TessMark’s 64X expansion test, where it’s nearly three times the speed of the Radeon HD 7970. We doubt we’ll see a good use of a 64X geometry expansion factor in a game this year, but the Kepler architecture clearly has plenty of headroom here.

Shader performance

Peak shader arithmetic (TFLOPS) Memory bandwidth (GB/s) GeForce GTX 560

Ti 1.4 134 GeForce GTX 560

Ti 448 1.3 152 GeForce GTX 580 1.6 192 GeForce GTX 680 3.1 192 Radeon HD 5870 2.7 154 Radeon HD 6970 2.7 176 Radeon HD 7870 2.6 154 Radeon HD 7970 3.8 264

Our first look at the performance of Kepler’s re-architected SMX yields some mixed, and intriguing, results. The trouble with many of these tests is that they split so cleanly along architectural or even brand lines. For instance, the 3DMark particles test runs faster on any GeForce than on any Radeon. We’re left a little flummoxed by the fact that the 7970 wins three tests outright, and the GTX 680 wins the other three. What do we make of that, other than to call it even?

Nonetheless, there are clear positives here, such as the GTX 680 taking the top spot in the ShaderToyMark and GPU cloth tests. The GTX 680 improves on the Fermi-based GTX 580’s performance in five of the six tests, sometimes by wide margins. Still, for a card with the same memory bandwidth and ostensibly twice the shader FLOPS, the GTX 680 doesn’t appear to outperform the GTX 580 as comprehensively as one might expect.

GPU computing performance

This benchmark, built into Civ V, uses DirectCompute to perform compression on a series of textures. Again, this is a nice result from the new GeForce, though the 7970 is a smidge faster in the end.

Here’s where we start to worry. In spite of doing well in our graphics-related shader benchmarks and in the DirectCompute test above, the GTX 680 tanks in LuxMark’s OpenCL-driven ray-tracing test. Even a quad-core CPU is faster! The shame! More notably, the GTX 680 trails the GTX 580 by a mile—and the Radeon HD 7970 by several. Nvidia tells us LuxMark isn’t a target for driver optimization and may never be. We suppose that’s fine, but we’re left wondering just how much Kepler’s compiler-controlled shaders will rely on software tuning in order to achieve good throughput in GPU computing applications. Yes, this is only one test, and no, there aren’t many good OpenCL benchmarks yet. Still, we’re left to wonder.

Then again, we are in the early days for OpenCL support generally, and AMD seems to be very committed to supporting this API. Notice how the Core i7-3820 runs this test faster when using AMD’s APP driver than when using Intel’s own OpenCL ICD. If a brainiac monster like Sandy Bridge-E can benefit that much from AMD’s software tuning over Intel’s own, well, we can’t lay much fault at Kepler’s feet just yet.

The Elder Scrolls V: Skyrim

Our test run for Skyrim was a lap around the town of Whiterun, starting up high at the castle entrance, descending down the stairs into the main part of town, and then doing a figure-eight around the main drag.

Since these are pretty capable graphics cards, we set the game to its “Ultra” presets, which turns on 4X multisampled antialiasing. We then layered on FXAA post-process anti-aliasing, as well, for the best possible image quality without editing an .ini file.

At this point, you may be wondering what’s going on with the funky plots shown above. Those are the raw data for our snazzy new game benchmarking methods, which focus on the time taken to render each frame rather than an frame rate averaged over a second. For more information on why we’re testing this way, please read this article, which explains almost everything.

Frame time

in milliseconds FPS

rate 8.3 120 16.7 60 20 50 25 40 33.3 30 50 20

If that’s too much work for you, the basic premise is simple enough. The key to creating a smooth animation in a game is to flip from one frame to the next as quickly as possible in continuous fashion. The plots above show the time required to produce each frame of the animation, on each card, in our 90-second Skyrim test run. As you can see, some of the cards struggled here, particularly the GeForce GTX 560 Ti, which was running low on video memory. Those long waits for individual frames, some of them 100 milliseconds (that’s a tenth of a second) or more, produce less-than-fluid action in the game.

Notice that, in dealing with render times for individual frames, longer waits are a bad thing—lower is better, when it comes to latencies. For those who prefer to think in terms of FPS, we’ve provided the handy table at the right, which offers some conversions. See how, in the last plot, frame times are generally lower for the GeForce GTX 680 than for the Radeon HD 7970, and so the GTX 680 produces more total frames? Well, that translates into…

…higher FPS averages for the new GeForce. Quite a bit higher, in this case. Also notice that some of our worst offenders in terms of long frame times, such as the GeForce GTX 560 Ti and the GTX 560 Ti 448, produce seemingly “acceptable” frame rates of 41 and 50 FPS, respectively. We might expect that FPS number to translate into adequate performance, but we know from looking at the plot that’s not the case.

To give us a better sense of the frame latency picture, or the general fluidity of gameplay, we can look at the 99th percentile frame latency—that is, 99% of all frames were rendered during this frame time or less. Once we do that, we can see just how poorly the GTX 560 Ti handles itself here compared to everything else.

We’re still experimenting with our new methods, and I’m going to drop a couple of new wrinkles on you here today. We think the 99th percentile latency number is a good one, but since it’s just one point among many, we have some concerns about using it alone to convey the general latency picture. As a bit of an experiment, we’ve decided to expand our look at frame times to cover more points, like so.

This illustrates how close the matchup is between several of the cards, especially our headliners, the Radeon HD 7970 and GeForce GTX 680. Although the GeForce generally produces frames in less time than the Radeon, both are very close to that magic 16.7 ms (60 FPS) mark 95% of the time. Adding in those last few percentage points, that last handful of frames that take longer to render, makes the GTX 680’s advantage nearly vanish.

Our next goal is to focus more closely on the tough parts, places where the GPU’s performance limitations may be contributing to less-than-fluid animation, occasional stuttering, or worse. For that, we add up all of the time each GPU spends working on really long frame times, those above 50 milliseconds or (put another way) below about 20 FPS. We’ve explained our rationale behind this one in more detail right here, if you’re curious or just confused.

Only the two offenders we’ve already identified really spend any significant time working on really long-to-render frames. The rest of the pack (and I’d include the GTX 580 in this group) handles Skyrim at essentially the highest quality settings quite well.

Batman: Arkham City

We did a little Batman-style free running through the rooftops of Gotham for this one.

Frame time in milliseconds FPS rate 8.3 120 16.7 60 20 50 25 40 33.3 30 50 20

Several factors converged to make us choose these settings. One of our goals in preparing this article was to avoid the crazy scenario we had in our GeForce GTX 560 Ti 448 review, where every card tested could run nearly every game adequately. We wanted to push the fastest cards to their limits, not watch them tie a bunch of other cards for adequacy. So we cranked up the resolution and image quality and, yes, even enabled DirectX 11. We had previously avoided using DX11 with this game because the initial release had serious performance problems on pretty much any video card. A patch has since eliminated the worst problems, and the game is now playable in DX11, so we enabled it.

This choice makes sense for benchmarking ultra high-end graphics cards, I think. I have to say, though, that the increase in image quality with DX11 tessellation, soft shadows, and ambient occlusion isn’t really worth the performance penalty you’ll pay. The image quality differences are hard to see; the performance differences are abundantly obvious. This game looks great and runs very smoothly at 2560×1600 in DX9 mode, even on a $250 graphics card.

The GTX 680 again takes the top spot in the FPS sweeps, but as you can see in the plots above, all of the cards produce some long frame times with regularity. As a result of those higher-latency frames, the GTX 680 ties the 7970 in the 99th percentile frame time metric.

A broader look at the latency picture shows that the GTX 680 generally produces lower-latency frames than the 7970, which is why its FPS average is so high. However, that last 1% gives it trouble.

Lots of trouble, when we look at the time spent on long-latency frames. What happened to the GTX 680? Well, look up at the plots above, and you’ll see that, very early in our test run, there was a frame that took nearly 180 ms to produce—nearly a fifth of a second. As we played the game, we experienced this wait as a brief but total interruption in gameplay. That stutter, plus a few other shorter ones, contributed to the 680’s poor showing here. Turns out we ran into this problem with the GTX 680 in four of our five test runs, each time early in the run and each time lasting about 180 ms. Nvidia tells us the slowdown is the result of a problem with its GPU Boost mechanism that will be fixed in an upcoming driver update.

Battlefield 3

We tested Battlefield 3 with all of its DX11 goodness cranked up, including the “Ultra” quality settings with both 4X MSAA and the high-quality version of the post-process FXAA. We tested in the “Operation Guillotine” level, for 60 seconds starting at the third checkpoint.

Blessedly, there aren’t many wrinkles at all in BF3 performance from any of the cards. The 99th percentile frame times mirror the FPS averages, and all is well with the world. Even the slow cards are just generally slow and not plagued with excessively spiky, uneven frame times like we saw in Arkham City. This time, the GeForce GTX 680 outperforms the Radeon HD 7970 in every metric we throw at it, although its advantage is incredibly slim in every case.

Crysis 2

Our cavalcade of punishing but pretty DirectX 11 games continues with Crysis 2, which we patched with both the DX11 and high-res texture updates.

Notice that we left object image quality at “extreme” rather than “ultra,” in order to avoid the insane over-tessellation of flat surfaces that somehow found its way into the DX11 patch. We tested 90 seconds of gameplay in the level pictured above, where we gunned down several bad guys, making our way up the railroad bridge.

The GTX 680 just trails the 7970 in the FPS average, but its 99th percentile frame time falls behind a couple of other cards, including the Radeon HD 7870. Why? If you look at the plot for the GTX 680, you can see how, in the opening portion of the test run, its frame times range regularly into the 30-millisecond range. That’s probably why its 99th percentile frame time is 32 milliseconds—or, translated, roughly 30 FPS—and therefore nothing to worry about in the grand scheme. The GTX 680 devotes almost no time to really long frames, and its performance is quite acceptable here—just not quite as good as the 7970’s during those opening moments of the test sequence.

Serious Sam 3: BFE

We tested Serious Sam 3 at its “Ultra” quality settings, only tweaking it to remove the strange two-megapixel cap on the rendering resolution.

How interesting. Generally, this is one of those games where a particular sort of GPU architecture tends to do well—Radeons, in this case. However, the GeForce GTX 680 is different enough from its siblings that it utterly reverses that trend, effectively tying the Radeon HD 7970.

Power consumption

We’re pretty pleased with the nice, low power consumption numbers our new test rigs are capable of producing at idle. Not bad for quad memory channels, Sandy Bridge Extreme, and an 850W PSU, eh?

Although the entire system’s power draw is part of our measurement, the display is not. The reason we’re testing with the display off is that the new Radeons are capable of going into a special ultra-lower power mode, called ZeroCore power, when the display goes into standby. Most of the chip is turned off, and the GPU cooling fans spin down to a halt. That allows them to save about 12W of power draw on our test system, a feat the GTX 680 can’t match. Still, the 680’s power draw at idle is otherwise comparable to the 7970’s, with only about a watt’s worth of difference between them.

We’re running Skyrim for this test, and here’s where Kepler’s power efficiency becomes readily apparent. When equipped with the Radeon HD 7970, our test rig requires over 40W more power under load than it does when a GeForce GTX 680 is installed. You can see why I’ve said this is the same class of GPU as the GeForce GTX 560 Ti, although its performance is a generation beyond that.

Since we tested power consumption in Skyrim, we can mash that data up with our performance results to create a rough picture of power efficiency. By this measure, the GTX 680 is far and away the most power-efficient performer we’ve tested.

Noise levels and GPU temperatures

Even though the Radeon HD 7970 can turn off its cooling fan when the display goes into power-save, it doesn’t convey any measurable advantage here. The GTX 680 essentially adds nothing to our system’s total noise levels, which consist almost entirely of noise from the (very quiet) CPU cooler.

Under load, the GTX 680’s cooler performs admirably, maintaining the same GPU temperature as the 7970 while generating substantially less sound pressure. Of course, the GTX 680’s cooler has quite a bit less power (and thus heat) to deal with, but Nvidia has a long tradition of acoustic excellence for its coolers, dating back to at least the GeForce 8800 GTX (though not, you know, NV30.)

We’re not terribly pleased with the fan speed profile AMD has chosen for its stock 7970 cards, which seems to be rather noisy. However, we should note that we’ve seen much better cooling and acoustic performance out of XFX’s Radeon HD 7970 Black Edition, a card with slightly higher clock speeds. It’s a little pricey, but it’s also clearly superior to the reference design.

Going scatter-brained

The scatter plot of power and performance on the previous page has inspired me to try a bit of an experiment. This is just for fun, so feel free to skip ahead if you’d like. I’m just curious see what we can learn by mashing up some other bits of info with our overall performance data across all of the games we tested.

This one isn’t really fair at all, since we haven’t normalized for the chip fabrication process involved. The three GPUs produced on a 28-nm process are all vastly superior, in terms of performance per area, to their 40-nm counterparts. The difference in size between the GeForce GTX 580 and the Radeon HD 7870, for roughly equivalent performance, is comical. The GTX 680 looks quite good among the three 28-nm chips, with higher performance and a smaller die area than the 7970.

The next few scatters are for the GPU architecture geeks who might be wondering about all of those graphics rates we’re always quoting and measuring. Here’s a look at how the theoretical peak numbers in different categories track with delivered performance in games. What we’re looking for here is a strong or weak correlation; a stronger correlation should give us a nice collection of points roughly forming diagonal line, or something close to it.

The first couple of plots, with rasterization rate and FLOPS, don’t show us much correlation at all between these properties and in-game performance. The final three begin to fall into line a little bit, with memory bandwidth and ROP rate (or pixel fill) being most strongly correlated, to my eye. Notice that the GeForce GTX 680 is apparently very efficient with its memory bandwidth, well outside of the norm.

These results led to me wonder whether the correlations would grow stronger if we subbed in the results of directed tests instead of theoretical peak numbers. We do have some of that data, so…

ShaderToyMark gives us the strongest correlation, which shouldn’t be too much of a surprise, since it’s the most game-like graphics workload among our directed tests. Otherwise, I’m not sure we can draw too many strong conclusions from these results, other than to say that the GTX 680 sure looks to have an abundance of riches when it comes to FP16 texture filtering.

Conclusions

With a tremendous amount of information now under our belts, we can boil things down, almost cruelly, to a few simple results in a final couple of scatter plots. First up is our overall performance index, in terms of average FPS across all of the games we tested, matched against the price of each card. As usual, the most desirable position on these plots is closer to the top left corner, where the performance is higher and the price is lower.

The GeForce GTX 680 is slightly faster and 50 bucks less expensive than the Radeon HD 7970, so it lands in a better position on this first plot. However, if we switch to an arguably superior method of understanding gaming performance and smoothness, our 99th percentile frame time (converted to FPS so the plot reads the same), the results change a bit.

The GTX 680’s few instances of higher frame latencies, such as that apparent GPU Boost issue in Arkham City, move it just a couple of ticks below the Radeon HD 7970 in overall performance. Then again, the GTX 680 costs $50 less, so it’s still a comparable value.

The truth is that, either way you look at it, there is very little performance difference between these two cards, and any difference is probably imperceptible to the average person.

GeForce GTX 680 March 2012