AMD Radeon R9 Fury X Review, Benchmark, & Architecture Drill-Down vs. GTX 980 Ti Thermal, Power, & Noise Analysis AMD R9 Fury X vs. GTX 980 Ti Gaming Benchmark Radeon R9 Fury X vs. GTX 980 Ti Conclusion

The Fury X has been a challenging video card to review. This is AMD's best attempt at competition and, as it so happens, the card includes two items of critical importance: A new GPU architecture and the world's first implementation of high-bandwidth memory. Some system builders may recall AMD's HD 4870, a video card that was once a quickly-recommended solution for mid-to-high range builds. The 4870 was the world's first graphics card to incorporate the high-speed GDDR5 memory solution, reinforcing AMD's position of technological jaunts in the memory field. Prior to the AMD acquisition, graphics manufacturer ATI designed the GDDR3 memory that ended up being used all the way through to GDDR5 (GDDR4 had a lifecycle of less than a year, more or less, but was also first instituted on ATI devices).

The Fury X is AMD's flag-bearer for a new form of high-speed memory that, regardless of how the Fury X performs, will inevitably become the future of graphics memory for both major manufacturers. In this respect, the card has been difficult to review as it has required a substantial investment in the research process to fully understand the implications of HBM. The architecture is somewhat analogous to the Tonga GPU found in the R9 285 solution, but is more comparable in the vertical to where Hawaii – found on the R9 290X – once rested.

This Fury X review will explain the architecture and HBM, then dig into CrossFire performance, frametimes, the thermal envelope and CLC, and overclocking potential. Direct comparisons pitting the AMD Radeon R9 Fury X vs. the GTX 980 Ti will be made, including coverage of EVGA's GTX 980 Ti Hybrid, a liquid-cooled competitor at the high-end.

AMD R9 Fury X Specs

AMD R9 Fury X Fab Process 28nm Stream Processors 4096 Base Clock (GPU) 1050MHz COMPUTE 8.6 TFLOPs TMUs 256 Texture Fill-Rate 268.8GT/s ROPs 64 Pixel Fill-Rate 67.2GP/s Z/Stencil 256 Memory Config 4GB HBM Memory Interface 4096-bit Memory Speed 500MHz / 1Gbps Power 2x8-pin

275W TDP Others PCI-e 3.0

Dx12, Vulkan, Mantle support Price $650

Up to Speed: Recapping 300 Series, Pump Whine, & Drivers

AMD has made major efforts in the past few weeks. The company hasn't released a new round of GPUs in around two years – excluding the R9 285, which launched late last year – and AMD has remedied this with back-to-back graphics card launches. The first was the Radeon 300 series, of which we reviewed the R9 390 and R9 380.

To bring everyone up to speed, the 300 series cards are all refreshes – not quite a hard “rebadge” – of their 200 series counterparts. The R9 285 introduced “Tonga” with a few years' worth of architectural tuning, but did everything short of implementing a fresh architecture. These improvements were distributed across the board on the 300 series. Other upgrades to the 300 series cards include a tuned power envelope that marginally lowered TDP, a ~50MHz clockrate boost, and introduction of a few software-side features. Our general conclusion was that owners of 200 series cards would see zero compulsion to “side-grade” to the 300 series, furthering that some 200 series prices (like a $270 R9 290X, now expired) made the previous line more desirable than the 300 series. These prices haven't remained as favorable as we've distanced from launch, but that was the first impression.

The 300 series was largely uninspiring in this regard.

Then came Fury – finally. Some of our initial analysis found high-frequency pump whine that was emitted on both of our retail cards, a fact that was supported by numerous other outlets reporting on the same issue. Our conclusion was that the whine was a non-issue for users building in an enclosure, though open air use was discouraged given the irritating whine. More on this later.

The question of drivers lasted only for a brief period. Online discussion was spurred by the disparity between AMD's initial benchmarks – showing huge gains over the 980 Ti – and reviews, which often showed the 980 Ti winning. A few users pointed toward drivers, thinking that the B8 and B9 press drivers were underperformers against the official 15.15.1004 launch drivers. We tested all three, finding zero difference.

AMD's New Fiji Architecture

AMD's Fiji architecture is potentially the last AMD candidate on 28nm process. AMD has been producing 28nm chips since 2011 – nVidia is in a similar boat – and TSMC's process is almost fully mature at this point. The Fiji die has met the reticle limit imposed by lithographic lenses in stepper / scanner systems; it is not possible to go “bigger” on the die size without moving to a new fabrication process.

This “Big GPU” approach is one that both AMD and nVidia have taken for their high-end chips, with the GM200 sizing up similarly: the Fury X's Fiji GPU is 596mm^2, only slightly smaller than the massive GM200's 601mm^2. Building a larger package allows the manufacturing process to cram more transistors into the space and, as a bonus, helps spread heat over a larger surface area.

Fiji hosts an updated version of AMD's Graphics Core Next (GCN) architecture. The Fiji version of GCN (1.2) adds a new level of delta color compression that offers a 40% bandwidth efficiency increase – not dissimilar from Maxwell's own strides with delta color compression – and can compress tiles at a ratio of 8:1. Delta color compression is something we've described in the past. The below graphics are from an nVidia deck, but describe the technology well and apply to AMD as much as they do to nVidia:

For those who didn't see this in our GTX 980 review, the top-level is fairly simple. Delta color compression analyzes pixel color temporally, frame-to-frame, and calculates a delta value output when drawing the pixel's new color. To this end, the GPU modifies color values only when there is a change, and even then, it further reduces workload by functioning on delta values rather than absolute values.

In the above samples, the first frame (n) has been drawn by the GPU and rendered. Frame n+1 – that is, the next frame – doesn't need to redraw all of these absolute color values for every pixel on the screen. Instead, frame n+1 draws only the delta values (the pink highlight helps show color change temporally). This reduces bandwidth saturation.

Other GCN updates further what was already done on the 300 series, primarily including improved power efficiency. This is done in several ways on the Fury X, not the least of which is its implementation of a CLC – but more on that in a moment. Reducing the power envelope was made possible by gating power provided to internal components of the GPU. The most noteworthy is the shader array, which throttles when not in demand: If the GPU detects that shader units are unloaded, it will modulate power assignment to those shader units (effectively cutting off the power) to reduce maximum power consumption. The shift to HBM inherently grants further power efficiency gains. Fiji also saw the introduction of APU frequency scaling algorithms, which throttle-down the clockrate when high speeds are unnecessary. This is similar to nVidia and Intel throttling chips and “boosting” when under load. AMD's dynamic voltage tech on APUs was also migrated to Fiji, modulating voltage levels dependent upon load. Both of these assist in power reduction.

The Fury X's next major improvement is its tessellation throughput, which uses small instance caching to improve draw call performance – somewhat of a specialty for AMD – by caching repeat geometry local to the GPU. As with any caching system, utilizing GPU cache eliminates the need of traversing the memory interface to access the on-card VRAM, resulting in a significant speed improvement for cached assets. Vertex data is also cached in a more extensive fashion than on Hawaii and Tonga's iterations of GCN.

AMD has historically suffered from poor tessellation and processing of complex geometric objects, but has long offered a strength in draw call performance that has been restricted by APIs. Fiji's architecture helps improve tessellation performance substantially with lower taps (x8, x16) over Hawaii and Tonga.

Fiji's SIMD uses a 64 asset-wide wavefront, providing a huge amount of lane potential for pixel and vertex shaders, depending on width of each. This grants GCN some versatility and theoretical advantages for multi-purpose use over Maxwell's gaming-targeted 32-wide warp (nVidia's name for “collection of threads”). This is part of why we see COMPUTE-bound users on nVidia often favoring Kepler architecture to Maxwell. For AMD, algorithm-level tuning of SIMD lane sharing will grant advantages in OpenCL for data structures, but doesn't do much for gamers. This contributes to the root-cause of the AMD-nVidia performance disparity in COMPUTE and gaming applications.

We'll talk about whether Fiji is more ROPs- or triangle-bound at higher resolutions once we get to the benchmarks.

High-Bandwidth Memory is the big “thing” for AMD's Fury X. The first implementation is limited to 4GB capacity, but exhibits unbelievably high throughput on an unprecedented 4096-bit wide memory interface. We'll talk more about how HBM works in an in-depth article soon.

Testing Methodology

We tested using our updated 2015 Multi-GPU test bench, detailed in the table below. Our thanks to supporting hardware vendors for supplying some of the test components.

The latest AMD Catalyst drivers (15.7) were used for testing. NVidia's 353.3 drivers were used for testing. Game settings were manually controlled for the DUT. All games were run at 'ultra' presets, with the exception of The Witcher 3, where we disabled HairWorks completely, disabled AA, and left SSAO on. GRID: Autosport saw custom settings with all lighting enabled. GTA V used two types of settings: Those with Advanced Graphics ("AG") on and those with them off, acting as a VRAM stress test.

Each game was tested for 30 seconds in an identical scenario, then repeated three times for parity.

Average FPS, 1% low, and 0.1% low times are measured. We do not measure maximum or minimum FPS results as we consider these numbers to be pure outliers. Instead, we take an average of the lowest 1% of results (1% low) to show real-world, noticeable dips; we then take an average of the lowest 0.1% of results for severe spikes. Anti-Aliasing was disabled in all tests except GRID: Autosport, which looks significantly better with its default 4xMSAA. HairWorks was disabled where prevalent. Manufacturer-specific technologies were used when present (CHS, PCSS).

Overclocking was performed incrementally using MSI Afterburner and AMD's OverDrive. Parity of overclocks was checked using GPU-Z. Overclocks were applied and tested for five minutes at a time and, if the test passed, would be incremented to the next step. Once a failure was provoked or instability found -- either through flickering / artifacts or through a driver failure -- we stepped-down the OC and ran a 30-minute endurance test using 3DMark's FireStrike Extreme on loop (GFX test 2).

Thermals and power draw were both measured using our secondary test bench, which we reserve for this purpose. The bench uses the below components. Thermals are measured using AIDA64. We execute an in-house automated script to ensure identical start and end times for the test. 3DMark FireStrike Extreme is executed on loop for 25 minutes and logged. Parity is checked with GPU-Z.

Acoustics Testing

Acoustics testing requires stricter environment controls than any other tests we perform. All ambient sources of noise must be removed from the environment. For purposes of this testing, we powered-down all devices not under test in the room, turned off AC, and disconnected bench fans that were not mission-critical. The only fans left enabled were for the radiators and video cards (where applicable). Solid-state drives are used exclusively on our bench to eliminate hard drive vibration and spindle noise.

A reporter-class Roland R-05 recorder was mounted atop a tripod positioned precisely 11.5” away from the system. The recorder was positioned at a 45-degree angle above the video cards, with the recorder pointed down toward the center of the card – where the pump resides. The R-05 was left in this position, untouched, throughout the entirety of the test process, including during video card swap-outs.

A rough one-foot distance from the DUT (device under test) is necessary for user representative acoustics data. We also tested with a sensitive microphone mounted atop the memory for a 1.75” distance from the AIB, but the results of such a close measurement are in no way representative of what an end user would encounter, even on an open bench. This would be the equivalent of placing your ear against the backplate of the video card (do not do this). The additional layer of audio logging was used as a means to validate our own testing and offer redundancy. The proximity allowed us to hear the actual liquid exchange within the chamber of the 980 Ti Hybrid, which is more of a curiosity than an actionable dataset. This will be played-back in our next YouTube video.

A third microphone (Sennheiser MKE600 shotgun) was mounted to our Canon XA20 at 11.5" from the rear of the video card. Note that these additional two microphones were used during analysis to determine if the noise was worse from any particular direction; they were also used in the event that one microphone struggled to pick-up the audio. Ultimately, all charts below were generated using data from the R-05.

We tested the following cards:

Acoustics logs were dumped into Audacity and validated in Adobe Soundbooth. We then used a frequency spectrum analysis tool to map the dB levels against the frequency range. Note that the R-05 uses various hi- and low-gain hardware and software switches to improve pick-up.

Note that AMD's Fury X manual demands installation of the radiator with the tubes oriented at the bottom of the radiator (vertically installed). For our testing, we installed the radiator as suggested, orienting the tubes toward at the bottom of the additional liquid chamber.

3DMark Extreme was run on loop to provoke the whine.