Oh, man. Just a few days before Christmas, AMD uncorked a massive jug of holiday cheer in the form of the Radeon HD 7970 graphics card. Sloshing around inside? The world’s first GPU produced on an 28-nm manufacturing process. This incredibly fine new production process has allowed AMD to cram more transistors—and thus more graphics horsepower of virtually every sort—into this puppy than any graphics chip to come before. While many kids were looking forward to the latest Xbox 360 game under the tree on Christmas morning, the Radeon HD 7970 delivers nearly fifteen times the texel filtering speed of Microsoft’s venerable game console, to name one key graphics rate. I don’t want to dwell on it, but this new Radeon is nearly an order of magnitude more powerful than an Xbox 360 in nearly every respect that matters.

Ok, so I kinda do want to dwell on it, but we need to move on, just like the former ATI has done since creating the Xbox 360’s GPU.

This new Radeon’s true competitors, of course, are the other PC graphics processors on the market, and it has nearly all of them beaten on paper. The chip behind the action is known as “Tahiti,” part of AMD’s “Southern Islands” lineup of next-gen GPUs. As a brand-new design, Tahiti is, of course, infused with all of the latest features—and a few new marketing buzzwords, too. The highlights alone are breathtaking: 2048 shader ALUs, a 384-bit memory interface, PCI Express 3.0, support for DirectX 11.1, and a hardware video encoding engine. Tahiti features the “Graphics Core Next” (note to Rory Read: time to stop letting engineers name these things) shader architecture that promises more efficient scheduling and thus higher delivered throughput, especially for non-graphics applications.

If the prior paragraph wasn’t sufficient to impress you, perhaps the block diagram above will do the trick. One of the themes of modern GPUs is massive parallelism, and nowhere is that parallelism more massive than in Tahiti. Honestly, the collection of Chiclets above leaves much to be desired as a functional representation of a GPU, especially the magic cloudy bits that represent the shader cores. Still, the basic outlines of the thing are obvious, if you’ve looked over such diagrams in the past. Across the bottom are six memory controllers, each with a pair of 32-bit memory channels. Running up and down the center are the shader or compute units, of which there are 32. Flanking the CUs are eight ROP partitions, each with four color and 16 Z/stencil ROP units. The purple bits represent cache and buffers of various types, which are a substantial presence in Tahiti’s floorplan.

We will get into these things in more detail shortly, but first, let’s take a quick look at how Tahiti stacks up, in a general sense, versus the DirectX 11 GPUs presently on the market.

The Tahiti GPU between my thumb and forefinger

ROP pixels/ clock Texels filtered/ clock (int/fp16) Shader ALUs Rasterized triangles/ clock Memory interface width (bits) Estimated transistor count (Millions) Die

size (mm²) Fabrication process node GF114 32 64/64 384 2 256 1950 360 40 nm GF110 48 64/64 512 4 384 3000 520 40 nm Cypress 32 80/40 1600 1 256 2150 334 40 nm Barts 32 56/28 1120 1 256 1700 255 40 nm Cayman 32 96/48 1536 2 256 2640 389 40 nm Tahiti 32 128/64 2048 2 384 4310 365 28 nm

The most immediate comparison we’ll want to make is between Tahiti and the chip it succeeds, the Cayman GPU that powers the Radeon HD 6900 series. At 4.3 billion, its transistor count doesn’t quite double Cayman’s, but Tahiti is easily the most complex GPU ever. Tahiti improves a bunch of key graphics resources by at least a third over Cayman, including texture filtering capacity, memory interface width, and number of shader ALUs. Even so, Tahiti is a smaller chip than Cayman, and it carries on AMD’s recent practice of building “mid-sized” chips to serve the upper portions of the market. As you can see, Nvidia’s GF110 still dwarfs Tahiti, although Tahiti crams in more transistors courtesy of TSMC’s 28-nm fabrication process.

Of course, Tahiti is just the first of a series of GPUs, and it will contribute its DNA to at least two smaller chips still in the works. One, code-named “Pitcairn,” will supplant Barts and drive the Radeon HD 7800 series of graphics cards in a more affordable (think $250 or less) portion of the market. Below that, another chip, known as “Cape Verde,” will at last relieve the Juniper GPU of its duties, which have included both the Radeon HD 5700 series and the re-branded 6700 series. Although we believe both of these new chips are imminent, we don’t yet know exactly when AMD plans to introduce them. Probably before they arrive, AMD will unleash at least one additional card based on Tahiti, the more affordable Radeon HD 7950.

There is one other code name in this collection of Southern Islands. At its press event for the 7970, AMD simply showed the outline of a pair of islands along with the words, “Coming soon.” The rest isn’t too hard to parse out, since the contours of the islands pictured match those of New Zealand—which also happens to be the name of the rumored upcoming dual-GPU video card based on a pair of Tahiti chips. New Zealand will probably end up being called the Radeon HD 7990 and serving the very high end of the market by being really, truly, obnoxiously, almost disturbingly powerful. We’re curious to see whether New Zealand will be as difficult to find in stock at Newegg as Antilles, also known as the Radeon HD 6990. Maybe, you know, the larger land mass will help folks locate it more consistently.

Absent any additional code names, we’re left to speculate that AMD may rely on older chips to serve the lower reaches of the market. The recent introduction of the mobile Radeon HD 7000M series, based entirely on Cypress derivatives, suggests that’s the plan, at least for a while.

The one card: Radeon HD 7970

We’ve discussed Tahiti’s improvements in key graphics specs versus Cayman, but AMD has another bit of good news in store, too. Onboard the Radeon HD 7970, Tahiti will flip bits at pretty good clip: 925MHz. That’s up slightly from the highest default clocks for products based on Cypress (the 5870 at 850MHz) and Cayman (the 6970 at 880MHz). The 7970 has the same 5500 MT/s memory speed as its predecessor, so it will rely on 50% more memory channels to provide additional bandwidth.

The 7970’s combination of clock speeds and per-clock throughput give it the highest theoretical memory bandwidth, texture filtering rate, and shader arithmetic rate of any single-GPU video card. Thus, AMD has taken direct aim at Nvidia’s single-chip flagship, the GeForce GTX 580, by pricing the Radeon HD 7970 at $549. That price will get you a card with 3GB of GDDR5 memory onboard, enough to drive multiple displays at high resolutions, and it undercuts the 3GB versions of the GTX 580, which are selling for just under $600 at Newegg right now. AMD says it is shipping cards into the channel now, and the plan of record is for formal availability to start on January 9th. We wouldn’t be surprised to see cards for sale before the official date, though, if they make it to the right retailers.

At 10.75″, the 7970 matches the length of its two predecessors almost exactly. However, the deletion of one DVI port has opened up additional real estate on the expansion plate cover for venting. This change, along with the use of a somewhat larger blower pushing air across the card’s vapor chamber-based cooler, should improve cooling efficiency and allow for more air movement at lower fan speeds—and thus lower noise levels.

The downsides of this config are all related to the removal of that DVI port. What remains are two mini-DisplayPort outputs, an HDMI port, and one dual-link DVI output. To offset the loss of the second DVI port, AMD is asking board makers to pack two adapters in the box with every 7970: one HDMI-to-DVI cable and one active mini-DP-to-DVI converter. That config should suffice for folks wanting to run a three-way Eyefinity setup on 1080p displays or the like, but I believe neither of those adapters support dual-link DVI, so folks hoping to drive multiple 30″ monitors via DVI may have to seek another solution.

Incidentally, like the 6970 before it, the 7970 should in theory be able to drive as many as six displays concurrently when its DisplayPort outputs are multiplied via a hub. Unfortunately, the world is still waiting for DisplayPort hub solutions to arrive. AMD tells us it is working with “multiple partners” on enabling such hubs, and it expects some products to arrive next summer.

Although the 7970’s clock speeds are fairly high to start, AMD claims there’s still quite a bit of headroom left in the cards and in their power delivery hardware. The GPUs have the potential to go over 1GHz, with “a good chunk” capable of reaching 1.1GHz or better. The memory chips, too, may be able to reach 6500 MT/s. In addition to giving end users some healthy overclocking headroom, that sort of flexibility could allow AMD’s board partners to build some seriously hopped-up variants of the 7970.

A revised graphics architecture

The biggest change in Tahiti and the rest of the Southern Islands lineup is undoubtedly the shader core, the computational heart of the GPU, where AMD has implemented a fairly major reorganization of the way threads are scheduled and instructions are executed. AMD first revealed partial details of this “Graphics core next” at its Fusion Developer Summit last summer, so some information about Tahiti’s shader architecture has been out there for a while. Now that the first products are arriving, we’ve been able to fill in most of the rest of the details.

As we’ve noted, Tahiti doesn’t look like too much of a departure from its Cayman predecessor at a macro level, as in the overall architecture diagram on page one. However, the true difference is in the CU, or compute unit, that is the new fundamental building block of AMD’s graphics machine. These blocks were called SIMD units in prior architectures, but this generation introduces a very different, more scalar scheme for scheduling threads, so the “SIMD” name has been scrapped. That’s probably for the best, because terms like SIMD get thrown around constantly in GPU discussions in ways that often confuse rather than enlighten.

In AMD’s prior architectures, the SIMDs are arrays of 16 execution units, and each of those units is relatively complex, with either four (in Cayman) or five (in Cypress and derivatives) arithmetic logic units, or ALUs, grouped together. These execution units are superscalar—each of the ALUs can accept a different instruction and operate on different data in one clock cycle. Superscalar execution can improve throughput, but it relies on the compiler to manage a problem it creates: none of the instructions being dispatched in a cycle can rely on the output of one of the other instructions in the same group. If the compiler finds dependencies of this type, it may have to leave one or more of the ALUs idle in order to preserve the proper program order and obtain the correct results.

The superscalar nature of AMD’s execution units has been both a blessing and a curse over time. On the plus side, it has allowed AMD to cram a massive amount of ALUs and FLOPS into a relatively small die area, since it’s economical in terms of things like chip area dedicated to control logic. The downside is, as we’ve noted, that those execution units cannot always reach full utilization, because the compiler must schedule around dependencies.

Folks who know at AMD, including Graphics CTO Eric Demers, have consistently argued that these superscalar execution units have not been a problem for graphics simply because the machine maps well to graphics applications. For instance, DirectX-compliant GPUs typically process pixels in four-by-four blocks known as quads. Each pixel is treated as a thread, and 16-thread groups known as “wavefronts” or (in Nvidia’s lexicon) “warps” are processed together. In an architecture like Cypress, a wavefront could be dispatched to a SIMD array, and each of the 16 execution units would handle a single thread or pixel. As I understand it, then, the four components of a pixel can be handled in parallel across the superscalar ALUs: red, green, blue, and alpha—and, in the case of Cypress, a special function like a transcendental in that fifth slot, too. In just one clock cycle, a SIMD array can process an operation for every element of an entire wavefront, with very full utilization of the available ALU resources.

The problems come when moving beyond the realm of traditional graphics workloads, either with GPU computing or simply when attempting to process data that has only a single component, like a depth buffer. Then, the need to avoid dependencies can limit the utilization of those superscalar ALUs, making them much less efficient. This dynamic is one reason Radeon GPUs have had very high theoretical FLOPS peaks but have sometimes had much lower delivered performance.

Logical block diagram of a Tahiti CU. Source: AMD.

In a sense, Tahiti’s compute unit is the same basic “width” as the SIMDs in Cayman and Cypress, capable of processing the equivalent of one wavefront per clock cycle. Beneath the covers, though, many things have changed. The most basic execution units are actually wider than before, 16-wide vector units (also called SIMD-16 in the diagram above), of which there are four. Each CU also has a single scalar unit to assist, along with its own scheduler. The trick here is that those vec16 execution units are scheduled very much like the 16-wide execution units in Nvidia’s GPUs since the G80—in scalar fashion, with each ALU in the unit representing its own “lane.” With graphics workloads, for instance, pixel components would be scheduled sequentially in each lane, with red on one clock cycle, blue on the next, and so on. In the adjacent ALUs on the same vec16 execution unit, the other pixels in that wavefront would be processed at the same time, in the same one-component-per-clock fashion. At the end of four clocks, each vec16 unit will have processed 16 pixels or one wavefront. Since the CU has four of those execution units, it is capable of processing four wavefronts in four clock cycles—as we noted, the equivalent of one wavefront per cycle. Like Cayman, Tahiti can process double-precision floating-point datatypes for compute applications at one quarter the usual rate, which is, ahem, 947 GFLOPS in this case, just shy of a teraflop.

For graphics, the throughput of the new CU may be similar to that of Cypress or Cayman. However, the scalar, lane-based thread scheduling scheme simplifies many things. The compiler no longer has to detect and avoid dependencies, since each thread is executed in an entirely sequential fashion. Register port conflicts are reduced, and GPU performance in non-traditional workloads should be more stable and predictable, reaching closer to those peak FLOPS throughput numbers more consistently. If this list of advantages sounds familiar to you, well, it is the same set of things Nvidia has been saying about its scheduling methods for quite some time. Now that AMD has switched to a similar scheme, the same advantages apply to Tahiti.

That’s not to say the Tahiti architecture isn’t distinctive and, in some ways, superior to Nvidia’s Fermi. One unique feature of the Tahiti CU is its single scalar execution unit. Nvidia’s shader multiprocessors have a special function unit in each SM, and one may be tempted to draw parallels. However, AMD’s David Nalasco tells us Tahiti handles special functions like transcendentals in the vec16 units, at a very nice rate of four ops per clock cycle. The scalar unit is a separate, fully programmable ALU. In case you’re wondering, it’s integer-only, which is why it doesn’t contribute to Tahiti’s theoretical peak FLOPS count. Still, Nalasco says this unit can do useful things for graphics, like calculating a dot product and forwarding the results for use across multiple threads. This unit also assists with flow control and handles address generation for pointers, as part of Tahiti’s support of C++-style data structures for general-purpose computing.

Another place where Tahiti stands out is its rich complement of local storage. The chip has tons of SRAM throughout, in the form of registers (260KB per CU), hardware caches, software-managed caches or “data shares,” and buffers. Each of these structures has its own point of access, which adds up to formidable amounts of total bandwidth across the chip. Also, Tahiti adds a hardware-managed, multi-level read/write cache hierarchy for the first time. There’s a 16KB L1 instruction cache and a 32KB scalar data cache shared across four CUs and backed by the L2 caches. Each CU also has its own L1 texture/data cache, which is fully read/write. Meanwhile, the CU retains the 64KB local data share from prior AMD architectures.

Nvidia has maintained a similar split between hardware- and software-managed caches in its Fermi architecture by allowing the partitioning of local storage into 16KB/48KB of texture cache and shared memory, or vice-versa. Nalasco points out, however, that the separate structures in Tahiti can be accessed independently, with full bandwidth to each.

Tahiti has six L2 cache partitions of 128KB, each associated with one of its dual-channel memory controllers, for a total of 768KB of L2 cache, all read/write. That’s the same amount of L2 cache in Nvidia’s Fermi, although obviously Tahiti’s last-level caches service substantially more ALUs. The addition of robust caching should be a big help for non-graphics applications, and AMD clearly has its eye on that ball. In fact, for the first time, an AMD GPU has gained full ECC protection—not just of external DRAMs like in Cayman, but also of internal storage. All of Tahiti’s SRAMs are single-error correct, double-error detect protected, which means future FirePro products based on this architecture should be vying in earnest for deployment in supercomputing clusters and the like against Nvidia’s Tesla products. Nvidia has a big lead in the software and tools departments with CUDA, but going forward, AMD has the assistance of both Microsoft, via its C++ AMP initiative, and the OpenCL development ecosystem.

How this architecture stacks up

Understanding the basics of an architecture like this one is good, but in order to truly grok the essence of a modern GPU, one must develop a sense of the scale involved when the basic units are replicated many times across the chip. With Tahiti, those numbers can be staggering. Tahiti has 33% more compute units than Cayman has SIMDs (32 versus 24), with a third more peak FLOPS and a third higher texture sampling and filtering capacity, clock for clock.

If you’d like to look at it another way, just four of Tahiti’s CUs would add up to the same pixel-shading capacity as an entire R600, the chip behind the Radeon HD 2900 XT (though Tahiti has much more robust datatype support and host of related enhancements).

Today’s quad-core Sandy Bridge CPUs have four cores and can track eight threads via simultaneous multi-threading (SMT), but GPUs use threading in order to keep their execution units busy on a much broader scale. Each of Tahiti’s CUs can track up to 40 wavefronts in flight at once. Across 32 CUs, that adds up to 1280 wavefronts or 20,480 threads in flight. Meanwhile, by Demers’ estimates, Tahiti’s L1 caches have an aggregate bandwidth of about 2 TB/s, while the L2s can transfer nearly 710 GB/s at 925MHz.

Peak pixel fill rate (Gpixels/s) Peak bilinear filtering (Gtexels/s) Peak bilinear FP16 filtering (Gtexels/s) Peak shader arithmetic (TFLOPS) Peak rasterization rate (Mtris/s) Memory bandwidth (GB/s) GeForce GTX 280 19 48 24 0.6 602 142 GeForce GTX 480 34 42 21 1.3 2800 177 GeForce GTX 580 37 49 49 1.6 3088 192 Radeon HD 5870 27 80 40 2.7 850 154 Radeon HD 6970 28 85 42 2.7 1760 176 Radeon HD 7970 30 118 59 3.8 1850 264

In terms of key graphics rates, the Tahiti-driven Radeon HD 7970 eclipses the Cayman-based Radeon HD 6970 and the Fermi-powered GeForce GTX 580 in nearly every respect. The exceptions are the ROP rates and the triangle rasterization rate.

ROP rates, of course, include the pixel fill rate and, more crucially these days, the amount of blending power for multisampled antialiasing. The 7970 is barely faster than the 6970 on the this front because it sports the same basic mix of hardware: eight ROP partitions, each capable of outputting four colored pixels or 16 Z/stencil pixels per clock. Rather than increasing the hardware counts here, AMD decided on a reorganization. In previous designs, two ROP partitions (or render back-ends) were associated with each memory controller, but AMD claims the memory controllers were “oversubscribed” in that setup, leaving the ROPs twiddling their thumbs at times. Tahiti’s ROPs are no longer associated with a specific memory controller. Instead, the chip has a crossbar allowing direct, switched communication between each ROP partition and each memory controller. (The ROPs are not L2 cache clients, incidentally.) With this increased flexibility and the addition of two more memory controllers, AMD claims Tahiti’s ROPs should achieve up to 50% higher utilization and thus efficiency. Higher efficiency is a good thing, but the big question is whether Tahiti’s relatively low maximum ROP rates will be a limiting factor, even if the chip does approach its full potential more frequently. The GeForce GTX 580 still has quite an advantage in max possible throughput over the 7970, 37 to 30 Gpixels/s.

Tahiti’s peak polygon rasterization rates haven’t improved too much on paper, either. It still has dual rasterizers, like Cayman before it. Rather than trying to raise the theoretical max throughput, which is already quite high considering the number of pixels and polygons likely to be onscreen, AMD’s engineers have focused on delivered performance, especially with high degrees of tessellation. Geometry expansion for tessellation can have one input but many outputs, and that can add up to a difficult data flow problem. To address this issue, the parameter caches for Tahiti’s two geometry engines have doubled in size, and those caches can now read from one another, a set of changes Nalasco says amounts to tripling the effective cache size. If those caches become overwhelmed, they can now spill into the chip’s L2 cache, as well. If even that fails, AMD says vaguely that Tahiti is “better” when geometry data must spill into off-chip memory. This collection of tweaks isn’t likely to allow Tahiti to match Fermi’s distributed geometry processing architecture step for step, but we do expect a nice increase over Cayman. That alone should be more than sufficient for everything but a handful of worst-case games that use polygons gratuitously and rather bizarrely.

In addition to everything else, Tahiti has a distinctive new capability called partially resident textures, or in the requisite TLA, PRTs. This feature amounts to hardware acceleration for virtual or streaming textures, a la the “MegaTexture” feature built into id Software’s recent game engines, including the one for Rage. Tahiti supports textures up to 32 terabytes in size, and it will map and filter them. Large textures can be broken down into 64KB tiles and pulled into memory as needed, managed by the hardware.

AMD’s internal demo team has created a nifty animated demonstration of this feature running on Tahiti. The program implements, in real time, a method previously reserved primarily for offline film rendering. In it, textures are mapped on a per-polygon basis, eliminating the need for an intermediate UV map serving as a two-dimensional facsimile of each 3D object. AMD claims this technique solves one of the long-standing problem with tessellation: the cracking and seams that can appear when textures are mapped onto objects of varying complexity.

The firm is hopeful methods like this one can remove one of the long-standing barriers to the wider use of tessellation in future games. Trouble is, Tahiti’s PRT capability isn’t exposed in any current or near-future version of Microsoft’s DirectX API, so it’s unlikely more than a handful of game developers—those who favor OpenGL—will make use of it anytime soon. We’re also left wondering whether the tessellation hardware in AMD’s prior two generations of DirectX 11 GPUs will ever be used as effectively as we once imagined, since they lack PRT support and are subject to the texture mapping problems Tahiti’s new hardware is intended to solve.

ZeroCore power

GPUs continue to become more like CPUs not just in terms of computational capabilities, but also in the way they manage power consumption and heat production. AMD took a nice step forward with Cayman by introducing a power-limiting feature called PowerTune, which is almost the inverse of the Turbo Core capability built into AMD microprocessors. By measuring chip activity, PowerTune estimates likely power consumption and, if needed in specific cases of very high utilization, reduces the GPU’s clock speed and voltage to keep power in check. The cases where PowerTune steps in are relatively rare and are usually cased by synthetic benchmarks or the like, not typical games. Knowing PowerTune is watching, though, allows AMD to set the default clock speeds and voltages for its GPUs higher than it otherwise could. That’s one reason Tahiti is able to operate at a very healthy 925MHz aboard the Radeon HD 7970.

This new chip takes things a step further by introducing a new GPU state somewhat similar to the “C6” or “deep sleep” feature added to CPUs some years ago. AMD has experience with deploying such tech on the graphics front from the development of its “Llano” CPU-GPU hybrid. Now, with Tahiti, AMD calls the feature ZeroCore, in a sort of play on the whole Turbo Core thing, I suppose. The concept is simple. The Tahiti chip has multiple voltage planes. When the host system sits idle long enough to turn off its display (and invoke power-save mode on the monitor), voltage to the majority of the chip is turned off. Power consumption for the whole video card drops precipitously, down to about three watts, and its cooling fan spins to a halt, no longer needed. A small portion of the GPU remains active, ready to wake up the rest of the chip on demand. AMD says waking a Radeon from its ZeroCore state ought to happen in “milliseconds” and be essentially imperceptible. In our experience, that’s correct. As someone who tends to leave his desktop computer turned on at all times, ready to be accessed via a remote connection or the flick of a mouse, I’m a big fan of this feature.

ZeroCore has even more potential to please users of systems with CrossFire multi-GPU configs. Even during active desktop use where the primary video card is busy, the second (and third, and fourth, if present) video card will drop into ZeroCore mode if not needed. Although we haven’t had a chance to try it yet, we expect this capability will make CrossFire-equipped systems into much better citizens of the average home.

Even when the display isn’t turned off, sitting at a static screen, the 7970 should use less power than the 6970—about 15W versus about 20W, respectively—thanks to several provisions, including putting its DRAM into an idle state. We’ll test all of these power draw improvements shortly, so hang tight.

Finally, true video encode acceleration in a desktop GPU

Desktop graphics chips have had video decoding engines embedded in them for ages, growing in functionality over time, and Tahiti participates in that trend. Its Universal Video Decoder block adds hardware decode acceleration for two standards: the MPEG-4 format (used by DivX and the like) and the MVC extension to H.264 for stereoscopic content. Also, the UVD block has the related ability to decode dual HD video streams simultaneously.

More exciting is a first for discrete desktop GPUs: a hardware video encoder. Since UVD refers explicitly to decoding, AMD has cooked up a new acronym for the encoder, VCE or Video Codec Engine. Like the QuickSync feature of Intel’s Sandy Bridge processors (and probably the SoC driving the smart phone in your pocket), VCE can encode videos using H.264 compression with full, custom hardware acceleration. We’re talking about hardware purpose-built to encode H.264, not just an encoder that does its calculations on the chip’s shader array. As usual, the main advantages of custom logic are higher performance and lower power consumption. Tahiti’s encode logic looks to be quite nice, with the ability to encode 1080p videos at 60 frames per second, at least twice the rate of the most widely used formats. The VCE hardware supports multiple compression and quality levels, and it can multiplex inputs from various sources for the audio and video tracks to be encoded. Interestingly, the video card’s frame buffer can act as an input source, allowing for a hardware-accelerated HD video capture of a gaming session.

AMD plans to enable a hybrid mode for situations where raw encoding speed is of the essence. In this mode, the VCE block will take care of entropy encoding and the GPU’s shader array will handle the other computational work. On a high-end chip like Tahiti, this mode should be even faster than the fixed encoding mode, with the penalty of higher power draw.

Unfortunately, software applications that support Tahiti’s VCE block aren’t available yet, so we haven’t been able to test its performance. We fully expect support to be forthcoming, though. AMD had reps on hand from both ArcSoft and Sony Creative Software at its press event for the 7970, in a show of support. We’ll have to revisit VCE once we can get our hands on software that uses it properly.

..and even more stuff

Tahiti is the first GPU to support PCI Express 3.0, which uses a combination of higher signaling rates and more efficient encoding to achieve essentially twice the throughput of second-generation PCIe. Right now, the only host systems capable of PCIe 3.0 transfer rates are based on Intel’s Sandy Bridge-E processors and the X79 Express chipset. We don’t expect many tangible graphics performance benefits from higher PCIe throughput, since current systems don’t appear to be particularly bandwidth-limited, even in dual eight-lane multi-GPU configs. In his presentation about Tahiti, Demers downplayed the possibility of graphics performance gains from PCIe 3.0, but did suggest there may be benefits for GPU computing applications.

AMD claims Tahiti is capable of supporting the upcoming DirectX 11.1 standard, a fairly minor incremental bump whose feature list is fairly esoteric but includes provisions for native support of stereoscopic 3D rendering. A future beta driver for Windows 8 will add hooks for DX11.1 support, according to AMD.

As if all of that weren’t enough, the Radeon HD 7970 is hitting the market alongside a gaggle of software upgrades to AMD’s Eyefinity multi-display graphics technology. Collectively, these modifications have been labled Eyefinity 2.0. Some of the changes are available for older Radeons in current drivers, including tweaks to enable several new display layouts and multi-monitor stereoscopic gaming. Upcoming releases in the first couple months of 2012 will do even more, including the display-geek-nirvana unification: Eyefinity multi-displays, HD3D stereoscopy, and CrossFire multi-GPU should all work together starting with the Catalyst 12.1 driver rev. You’ll either have a truly mind-blowing gaming experience or get an unprecedentedly massive headache from such a setup, no doubt.

Our testing methods

As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and we’ve reported the median result.

Our test systems were configured like so:

Processor Core

i7-980X Motherboard Gigabyte EX58-UD5 North bridge X58 IOH South bridge ICH10R Memory size 12GB (6 DIMMs) Memory type Corsair Dominator CMD12GX3M6A1600C8 DDR3 SDRAM at 1333MHz Memory timings 9-9-9-24 2T Chipset drivers INF update

9.2.0.1030 Rapid Storage Technology 10.8.0.1003 Audio Integrated ICH10R/ALC889A with Realtek 6.0.1.6482 drivers Graphics Asus Radeon HD

5870 1GB with Catalyst 8.921 drivers Asus Matrix Radeon HD

5870 2GB with Catalyst 8.921 drivers Radeon HD 6970

2GB with Catalyst 8.921 drivers Radeon HD 7970

3GB with Catalyst 8.921 drivers XFX GeForce GTX

280 1GB with ForceWare 290.36 beta drivers

GeForce GTX 480 1.5GB with ForceWare 290.36 beta drivers Zotac GeForce GTX

580 1.5GB with ForceWare 290.36 beta drivers Hard drive Corsair

F240 240GB SATA Power supply PC Power & Cooling Silencer 750 Watt OS Windows 7 Ultimate x64 Edition Service Pack 1 DirectX 11 June 2009 Update

Thanks to Intel, Corsair, Gigabyte, and PC Power & Cooling for helping to outfit our test rigs with some of the finest hardware available. AMD, Nvidia, and the makers of the various products supplied the graphics cards for testing, as well.

Unless otherwise specified, image quality settings for the graphics cards were left at the control panel defaults. Vertical refresh sync (vsync) was disabled for all tests.

We used the following test applications:

Some further notes on our methods:

We used the Fraps utility to record frame rates while playing a 90-second sequence from the game. Although capturing frame rates while playing isn’t precisely repeatable, we tried to make each run as similar as possible to all of the others. We tested each Fraps sequence five times per video card in order to counteract any variability. We’ve included frame-by-frame results from Fraps for each game, and in those plots, you’re seeing the results from a single, representative pass through the test sequence.

We measured total system power consumption at the wall socket using a Yokogawa WT210 digital power meter. The monitor was plugged into a separate outlet, so its power draw was not part of our measurement. The cards were plugged into a motherboard on an open test bench. The idle measurements were taken at the Windows desktop with the Aero theme enabled. The cards were tested under load running Skyrim at its Ultra quality settings with FXAA enabled.

We measured noise levels on our test system, sitting on an open test bench, using an Extech 407738 digital sound level meter. The meter was mounted on a tripod approximately 10″ from the test system at a height even with the top of the video card. You can think of these noise level measurements much like our system power consumption tests, because the entire systems’ noise levels were measured. Of course, noise levels will vary greatly in the real world along with the acoustic properties of the PC enclosure used, whether the enclosure provides adequate cooling to avoid a card’s highest fan speeds, placement of the enclosure in the room, and a whole range of other variables. These results should give a reasonably good picture of comparative fan noise, though.

We used GPU-Z to log GPU temperatures during our load testing.

The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Texture filtering

Peak bilinear filtering (Gtexels/s) Peak bilinear FP16 filtering (Gtexels/s) Memory bandwidth (GB/s) GeForce GTX 280 48 24 142 GeForce GTX 480 42 21 177 GeForce GTX 580 49 49 192 Radeon HD 5870 80 40 154 Radeon HD 6970 85 42 176 Radeon HD 7970 118 59 264

Now that we’ve talked about the architecture ad nauseum, it’s nice to get into some test results. On paper, Tahiti has massive texture filtering throughput compared to any other current GPU, and in this quick synthetic test, it delivers on that promise quite nicely. The only saving grace for the competition is the GF110’s full-rate FP16 filtering, which allows the GeForce GTX 580 to avoid being completely embarrassed.

Tessellation and geometry throughput

Peak rasterization rate (Mtris/s) Memory bandwidth (GB/s) GeForce GTX 280 602 142 GeForce GTX 480 2800 177 GeForce GTX 580 3088 192 Radeon HD 5870 850 154 Radeon HD 6970 1760 176 Radeon HD 7970 1850 264

Given that Tahiti has substantially more buffering for geometry expansion than its predecessors, we’d expected the 7970 to perform better in this test. Instead, it’s not much faster at the “Extreme” tessellation level—and is actually slower at the lower “Normal” setting. Our sense is that this result may be caused by a software quirk, at least in part. TessMark is written in OpenGL, and AMD’s driver support there doesn’t always get the attention the DirectX drivers do.

We do have another option, which is to try a program that can act as a tessellation benchmark via DirectX 11. Unigine Heaven fits the bill by offering gratuitous amounts of tessellation on its “Extreme” setting. The additional polygons don’t really improve image quality in the demo, which is a shame, but they do push the graphics hardware pretty hard, so this demo will serve our need for a synthetic test of geometry throughput.

Now that’s more like it. The 7970 shows major improvement over the past two generations of Radeon graphics hardware, enough to put it at the front of the pack. Now, I’m not convinced Tahiti is outright faster than the GF110 in tessellation throughput. The Heaven demo includes things other than ridiculous numbers of polygons, including lots of pixel shader effects. My sense is that the tessellation hardware on the top few GPUs is simply fast enough that something else, like pixel shader performance, becomes the primary performance limiter. When that happens, Tahiti’s massive shader array kicks in, and the contest is over. The relevant point is that Tahiti’s geometry throughput is sufficiently improved that it’s not an issue, even with an extremely complex tessellation workload like this one.

Putting those new shaders to work

Peak shader arithmetic (TFLOPS) Memory bandwidth (GB/s) GeForce GTX 280 0.6 142 GeForce GTX 480 1.3 177 GeForce GTX 580 1.6 192 Radeon HD 5870 2.7 154 Radeon HD 6970 2.7 176 Radeon HD 7970 3.8 264

The first couple of tests above, the cloth and particles simulations, primarily use vertex and geometry shaders to do their work. In those tests, the 7970 easily outperforms the 6970, but it’s not quite as fast as the two Fermi-based GeForces. As we’ve noted, vertex processing remains a strength of Nvidia’s architecture.

Boy, things turn around in a hurry once we move into the last three tests, which rely on pixel shader throughput. True to form, AMD’s older GPUs tend to outrun the GeForces in these tests, since they’re quite efficient with pixel-centric workloads. Even so, Tahiti is substantially faster. In a couple of cases, the 7970 delivers on its potential to crank out over twice the FLOPS of the GeForce GTX 580.

GPU computing performance

These results are instructive. When we move from pixel shaders into DirectCompute performance, the Fermi-based GeForces recapture the lead from the Cypress- and Cayman-based Radeons. The Radeons have much higher theoretical FLOPS peaks, but the GeForces tend to be more efficient here. Tahiti, though, changes the dynamic. The Radeon HD 7970 outruns the GTX 580 and is nearly 50% faster than the Cypress-based Radeon HD 5870.

LuxMark is a ray-traced rendering test that uses OpenCL to harness any compatible processor to do its work. As you can see, we’ve even included the Core i7-980X CPU in our test system as a point of comparison. Obviously, though, the 7970 is the star of this show. The newest Radeon nearly doubles the throughput of its elder siblings—and nearly triples the performance of the Fermi-based GeForces. We’ve only run a couple of GPU computing tests, so our results aren’t the last word on the matter, but Tahiti may be the best GPU computing engine out there. AMD appears to have combined two very desirable traits in this chip’s shader array: much higher utilization (and thus efficiency) than previous DX11-class Radeons, and gobs of FLOPS in the given chip area.

The Elder Scrolls V: Skyrim

Our test run for Skyrim was a lap around the town of Whiterun, starting up high at the castle entrance, descending down the stairs into the main part of town, and then doing a figure-eight around the main drag.

Since these are pretty capable graphics cards, we set the game to its “Ultra” presets, which turns on 4X multisampled antialiasing. We then layered on FXAA post-process anti-aliasing, as well, for the best possible image quality without editing an .ini file.

The plots above show the time required to render the individual frames produced during our 90-second test run. If you’re unfamiliar with our fancy new testing methods, let me direct you to this article, which explains what we’re doing. In a nutshell, our goal is to measure graphics performance in a way that more fully quantifies the quality of the gaming experience—the smoothness of the animation and the ability of the graphics card to avoid momentary pauses or periods of poor performance.

Because Skyrim is a DirectX 9 game, it’s one of the few places where our representative of older GPU generations, the GeForce GTX 280, is able to participate fully. However, as you can see, the GTX 280 is slow enough to have earned its own plot, separate from the other GeForces. Our decision to test at 2560×1600 with 8X AA and 16X aniso has laid low this geezer of a GeForce; its 1GB of RAM isn’t sufficient for this task, which is why it’s churning out frame times as high as 100 ms. We had the same video memory problem with our Radeon HD 5870 1GB card, so we swapped in a 2GB card from Asus to work around it.

You can tell just by looking at the plots that the Radeon HD 7970 performs well here; it produces more frames than anything else, and not a single frame time stretches over the 40 ms mark.

The fact that the 7970 produces the most frames in the plots should be a dead giveaway that it would have the highest average frame rate. The newest Radeon reigns supreme in this most traditional measure of performance.

This number is about frame latencies, so it’s a little different than the FPS average. This result simply says “99% of all frames produced were created in less than x milliseconds.” We’re ruling out the last one percent of outliers in order to get a general sense of frame times, which will determine how smoothly the game plays.

I’ll admit, I had to stare at the frame time plots above for a little while in order to understand why those two GeForces would have a lower 99th percentile frame latency than the Radeon HD 7970, which looks so good. The culprit, I think, is those first 150 or so frames where all of the cards are slowest. That section of the test run comprises more than 1% of the frames for each card, and in it, the GeForces deliver somewhat lower frame latencies.

Now, a difference of two milliseconds is nearly nothing, but those opening moments are the only place where the fastest cards struggle, and the GeForces are ever so slightly quicker there. I do think some focus on the pain points for gaming performance is appropriate. What we seem to be finding over time is that viewing graphics as a latency-sensitive subsystem is a great equalizer. To give you a sense of what this result means, note that a score between 33 and 37 milliseconds translates to momentary frame rates between 27 and 30 FPS. For the vast majority of the time, then, all of these cards are churning out frames quickly enough to maintain relatively smooth motion, especially for an RPG game like this one that doesn’t rely on quick-twitch reactions.

Our next goal is to find out about worst-case scenarios—places where the GPU’s performance limitations may be contributing to less-than-fluid animation, occasional stuttering, or worse. For that, we add up all of the time each GPU spends working on really long frame times, those above 50 milliseconds or (put another way) below about 20 FPS. We’ve explained our rationale behind this one in more detail right here, if you’re curious or just confused.

In this case, our results are crystal clear. Only the GeForce GTX 280, which doesn’t have enough onboard video RAM to handle the game at these settings, struggles at all with avoiding major slowdowns in Skyrim. We’ve noted in the past that Skyrim performance appears to be more CPU limited than anything else. Don’t worry, though. We’ll be putting these GPUs through the wringer shortly.

Batman: Arkham City

We did a little Batman-style free running through the rooftops of Gotham for this one.

Several factors converged to make us choose these settings. One of our goals in preparing this article was to avoid the crazy scenario we had in our GeForce GTX 560 Ti 448 review, where every card tested could run nearly every game adequately. The Radeon HD 7970 is a pretty pricey bit of hardware, and we wanted to push it to its limits, not watch it tie a bunch of other cards for adequacy. So we cranked up the resolution and image quality and, yes, even enabled DirectX 11. We had previously avoided using DX11 with this game because the initial release had serious performance problems on pretty much any video card. A patch has since eliminated the worst problems, and the game is now playable in DX11, so we enabled it.

This choice made sense for benchmarking ultra high-end graphics cards, I think. I have to say, though, that the increase in image quality with DX11 tessellation, soft shadows, and ambient occlusion isn’t really worth the performance penalty you’ll pay. The image quality differences are hard to see; the performance differences are abundantly obvious. This game looks great and runs very smoothly at 2560×1600 in DX9 mode, even on a $250 graphics card.

As you can see, all of the cards produce some long frame times; the frame time plots are more jagged than in Skyrim. This will make for an interesting comparison. Also, it’s pretty clear the Radeon HD 5870 is overmatched here, even with 2GB of video RAM onboard.

We’ve found that average FPS and 99th percentile frame times don’t always track together, especially when there are wide swings in frame times involved, like we have here. However, in this case, they mirror each other pretty closely. All of the cards seem to have some long frame times in relatively proportional measure. Thus, in both FPS and 99th percentile latency, the Radeon HD 7970 manages to outperform the GeForce GTX 580 by a small margin.

The 7970’s slight edge holds when we turn our attention toward longer-latency frames. The new Radeon is the only card of the bunch to spend less than half a second working on rendering frames beyond 50 ms. The GTX 580 isn’t far behind, though.

Battlefield 3

We tested Battlefield 3 with all of its DX11 goodness cranked up, including the “Ultra” quality settings with both 4X MSAA and the high-quality version of the post-process FXAA. We tested in the “Operation Guillotine” level, for 60 seconds starting at the third checkpoint.

Yes, at these settings, we’re pushing these cards very hard. We very much wanted to avoid a situation where the GPUs weren’t really challenged. I think we succeed there, although we may have overshot.

Nevertheless, the Radeon HD 7970 comes out of this contest looking very good indeed, with a clear lead over the GeForce GTX 580 in both average FPS and 99th percentile frame times. That’s true even though we didn’t encounter any of the big frame time spikes that we have on other levels of this game with GeForce cards.

In fact, even with the relatively low average frame rates we saw, this stretch of BF3 runs quite well, with pretty even frame times throughout, especially on the three fastest cards. As a result, even the GeForce GTX 480, which averaged 27 FPS, avoids long frame times very effectively—and is thus quite playable.

The 7970 is easily the best solution here, though, both subjectively and in every way we’ve measured performance.

Crysis 2

Our cavalcade of punishing but pretty DirectX 11 games continues with Crysis 2, which we patched with both the DX11 and high-res texture updates.

Notice that we left object image quality at “extreme” rather than “ultra,” in order to avoid the insane over-tessellation of flat surfaces that somehow found its way into the DX11 patch. We tested 90 seconds of gameplay in which we tracked down an alien, killed him, and harvested his DNA. Cruel, yes, but satisfying.

You can tell the Radeon HD 7970 is relatively fast here from the frame time plots. The 7970 generates more frames than any other card and generally has lower frame latencies. However, the 7970’s frame time plot has quite a few spikes in it compared to the GeForces. That results in a dead heat between the 7970 and the GTX 580 in 99th percentile frame times.

Those frame time spikes cause the 7970 to spend more time processing frames beyond 50 ms than the GTX 580 does, as well. However, a total of 51 ms in this category isn’t bad.

Civilization V

We’ll round out our punishment of these GPUs with one more DX11-capable game. Rather than get all freaky with the FRAPS captures and frame times, we simply used the scripted benchmark that comes with Civilization V.

GeForces have long been at the top of the performance charts in this game, and we’ve suspected that the reason was geometry throughput. The terrain is tessellated in this game, and there are zillions of tiny, animated units all over the screen. Even so, the 7970 grabs the top spot with room to spare. That’s solid progress for the Radeon camp.

Power consumption

The first two graphs above give us a look at the 7970’s ZeroCore feature at work. The 7970 system’s total power draw drops by 17W when the display goes into power save, almost entirely courtesy of the 7970’s new low-power state for long idle. (The display’s power consumption is not part of our measurement.)

Overall, the 7970’s power consumption picture is quite nice. Even when idling at the Windows desktop, the newest Radeon shaves off about 5W of system-wide power consumption—more than that compared to the GeForces. When running Skyrim, the 7970 system draws a gobsmacking 80W less than the otherwise-identical GTX 580 rig. It seems the 28-nm process at TSMC is coming along quite nicely, doesn’t it?

Noise levels and GPU temperatures

When ZeroCore kicks in and the 7970’s fan stops spinning, we hit the noise floor for the rest of the components in our test system, mainly the PSU and the CPU cooler. By itself, without its fan spinning, the 7970 is pretty much silent.

I had hoped the bigger blower and larger exhaust venting area would make the 7970 quieter than the competition, especially since its power draw is relatively low overall. The 7970 is fairly quiet during active idle, but its fan ramps up quite a bit when running a game. Judging by the temperatures we measured, it appears AMD has biased its cooling policy toward restraining GPU temperatures rather than noise levels. I’d prefer a somewhat quieter card that runs a little hotter, personally. Still, nothing about the 7970’s acoustic profile is terribly offensive; it’s just not quite as nice as the GTX 580’s, which is a surprise given the gap in power consumption between the two. It’s a shame AMD didn’t capitalize on the chance to win solidly in this category. Perhaps the various Radeon board makers can remedy the situation.

Conclusions

For several generations now, whenever a new Radeon GPU was making its debut, I have bugged AMD Graphics CTO Eric Demers about whatever features were missing compared to the competition. There have always been feature deficits, whether it be graphics-oriented capabilities like coverage sampled antialiasing and faster geometry processing or compute-focused capabilities like better scheduling, caching, and ECC protection. Each time, Demers has answered my questions about what’s missing with quiet confidence, preaching the gospel of making the correct tradeoffs in each successive generation of products without compromising on architectural efficiency or time-to-market.

That confidence has seemed increasingly well founded as the years have progressed, in part because we often seem to be comparing what AMD is doing right now to what Nvidia will presumably be doing later. After all, AMD has been first to market with revamped GPU architectures based on new process tech for quite a few generations in a row. It hasn’t hurt that, since the introduction of the Radeon HD 4800 series, AMD has been at least competitive with Nvidia’s flagship chips in contemporary games, if not outright faster, while building substantially smaller, more efficient GPUs. Meanwhile, the firm has steadily ratcheted up the graphics- and compute-focused features in its new chips, gaining ground on Nvidia seemingly every step of the way.

With Tahiti and the Radeon HD 7970, AMD appears to have reached a very nice destination. In graphics terms, the Radeon HD 6970 and Cayman had very nearly achieved feature parity with the GF110. Tahiti moves AMD a few steps ahead on that front, though the changes aren’t major. The biggest news there may be the improvements to tessellation performance. Tahiti may not have caught up to Nvidia entirely in geometry throughput, but it’s fast enough now that no one is likely to notice the difference in any way that matters.

The more consequential changes in this GPU are primarily compute-related features, including caching, C++ support, ECC protection, and the revamped shader array. AMD has dedicated substantial space on this chip to things like SRAM and ECC support, and Tahiti looks poised to take on Nvidia in the nascent market for GPUs in the data center as a result. Nvidia has one heckuva head start in many ways, but AMD can make its case on several fronts, including comparable feature sets, superior power efficiency, and more delivered FLOPS.

Radeon HD 7970 January 2012