You’ve been waiting long enough for a new generation of GPUs, haven’t you? The whole world has, in fact. The extended stay at the 28nm production process did no-one any favours in terms of advancement, and it forced the hands of engineers at AMD and NVIDIA to bring forward their architectural changes to offset the stagnation. While we’ve definitely gotten a few gems out of this GPU cycle, like the excellent Geforce GTX 750 Ti, and AMD’s Hawaii architecture that just keeps on trucking, this year it all changes as we move to newer processes and better technology. NVIDIA started this off with Pascal in the past month and a half, launching the Geforce GTX 1080 and GTX 1070 in a manner that many claim was a paper launch.

But that was then. Today, it’s AMD’s turn, and the Radeon RX 480 is not only launching today, but it’s also on sale today. Local retailers have had stock for about a week now, and pricing should be quite reasonable. I was even told it might be lower than R5500, which makes it a fantastic bargain for the right person who was looking at their options for a mid-range card this week. Let’s get into this launch and dissect the details.

Before we get going, please excuse the watermarks on the slides and the horrible aliasing on the text. These were new slides that AMD sent me this morning that had the watermarks applied following the leak by Videocardz late last night. Once I receive slides that don’t look like this, I’ll replace them as soon as possible.

Let’s start with an overview of the RX 480. This is AMD’s mid-range contender, sitting at the lower end of the market priced at $199 for the 4GB model, and $229 for the 8GB model. Initially, most sales will be of the 8GB model, mostly because of its future-proofing VRAM pool, but also because the 4GB models will be produced in less numbers. AMD doesn’t expect most people to opt for 4GB GPUs unless they’re not expecting to be driving a 2560 x 1440 monitor or doing triple 768p displays, and so 4GB would be enough for most games played at 1080p, provided you don’t go crazy on the texture quality.

So, the specs. At 5.8TFLOPS of single-precision compute capability, the RX 480 sits about square with the Geforce GTX 1070 (5.7TFLOPS). It features, for the first time, a base and a boost clock, set to 1120MHz and 1266MHz respectively. In the press briefing AMD gave, it was explained that due to how consumers were interpreting the performance offered from previous releases, it was thought that they were getting a raw deal when the typical base clock was often higher than the actual frequencies they were seeing reported. Now, AMD offers the same sort of thing as NVIDIA – hardcoded into the BIOS is the base clock of 1120MHz, and the guarantee is that no matter what you’re doing with the card, you’re almost never going to see frequencies lower than that.

At 224GB/s of memory bandwidth, it’s expected that most reference cards from AMD today will have GDDR5 memory clocked at 7.0GHz. The “or higher” addition to that line suggests that some partners will have the option to use newer 8.0GHz memory from Samsung or Hynix, and it is also possible that GDDR5X implementations might creep into the picture. AMD’s memory controller in Polaris has been reworked to support GDDR5X, so I wouldn’t be surprised to see this arrive later.

The thermal design power (TDP) limit is set to 150W. I don’t have a review sample on hand to verify this claim and see what the actual power draw is during gaming and benchmarks, but the average power draw for the RX 480 should be well below this limit, possibly sitting at around 120W. To this end, the RX 480 ships with one 6-pin PEG power connector, and it’s reasonable to assume that AMD’s partner designs will have an 8-pin connector as an option later, increasing the TDP to around 225W. Like NVIDIA’s Pascal cards, the RX 480 supports Displayport 1.3 and 1.4 (to be tested and verified at a later date), as well as HDMI 2.0 with HDCP 2.2 capability. There’s also x.265 HEVC decode and encode support, which means you’ll be able to stream out to Twitch or Youtube using software like OBS or Raptr and use less bandwidth.

While reference designs will not ship with a DVI connector, there will be partner custom designs that will feature it, although you won’t be able to use VGA displays with this card. It’s the final death knell of the D-sub standard, which I’m not really concerned about, to be honest.

Diving in to some of the improvements to the latest version of GCN, which is officially dubbed GCN 4.0, AMD has made a lot of changes to their long-running architecture that started out in 2012. Compared to GCN 3.0, we have improved geometry processing capablities (more on that in a bit), Liquid VR support with foveated rendering support, big upgrades to the asynchronous compute capabilities that further extend AMD’s technical lead over NVIDIA (though this will have to be seen in benchmarks to discern any actual performance benefits), and native support for single-precision operations performed on the GPU, which come in handy for uses like machine learning. There’s also a new version of AMD TrueAudio, and it’s geared towards improving the VR experience.

First, the geometry changes. One of the primary advantages NVIDIA had over AMD with Maxwell, at least until recently, was a discarding engine to drop rendering unnecessary triangles that weren’t needed or even in the current viewport. The new primitive discard accelerator will drop work on any triangles that are included in the rendering pipeline, but not needed because they aren’t going to be seen, or have no sample points. This helps in areas like high levels of tesselation applied to polygons, where the GPU can discard triangles introduced into the pipeline that are too small to be noticed.

Looking at the slide above, AMD claims a huge performance benefit with primitive discarding turned on. There’s a 3x improvement in overall throughput when rendering a tesselated model with 4x MSAA at an 18 trillion pixel depth. 32 tri/px sees a drop in performance as the workload almost doubles, but greater returns are seen when you approach 98 tri/px levels, where the triangles are so small that they’re basically indistinguishable. While this improvement doesn’t benefit us now, it will be necessary as VR gaming gains more ground, where highly tesselated models become a necessity to maintain some level of realism.

AMD mentions that the benefits of doing this grow as you apply more levels of MSAA, and this means that the triangles that don’t need to be sampled for MSAA are also dropped from the MSAA workload, boosting performance. MSAA has always been a taxing addition to games, and while techniques like FXAA and TXAA do help in giving you an approximate image that looks about the same while doing less work, discarding triangles helps to minimize the amount of work required to run multiple levels of MSAA.

This new index cache thing is a bit tricky to understand, but one of the things to note with GCN 4.0 designs, and in particular with Polaris 10 as seen today in the RX 480, there’s a larger amount of L2 cache available, so this helps to reduce the amount of copy operations required when running particular workloads. The index cache now also keeps some geometry in the cache so that you don’t have to copy and re-render the same object. It’s a small change, but these things add up.

Like Maxwell 1.0, and later 2.0, AMD claims per-compute unit improvements in overall shader efficiency, compared to the Hawaii architecture. This means that for a given number of shaders in a compute unit, GCN 4.0 shaders will be 15% faster than shaders found in Hawaii, which is technically GCN 2.0. If the RX 480 had the same number of shaders as an R9 290, then, it would have 15% more performance right off the bat, even without clock speed boosts coming into the equation. If you’re paying attention to how the math works out (the R9 290 has 2560 shaders versus the RX 480’s 2304), you might have some idea of the raw performance of the RX 480. For those of you who didn’t whip out their calculator, at the same clock speed as the R9 290, the RX 480 is about 5% faster overall. That’s not too shabby coming from a card that has less functional units overall and a TDP set 100 watts lower.

I won’t do into too much detail with these changes, but the two changes I do want to point out are the reduction of pipeline stalls and the native FP16 support. The former change finally solves AMD’s issues with shader utilisation in older versions of GCN, particularly large designs like Fiji and Hawaii. As the shader engine grew wider, AMD’s engineers ran into issues with resource utilisation because while the hardware and architecture was technically sound, it couldn’t adapt very efficiently to games and other software that introduced stalls into the pipeline. Thus, while the GPU was waiting for an operation to complete, the other shaders sat twiddling their thumbs and picking bogeys out their nose. AMD’s driver team mitigated those stalls over time through working with software vendors to fix the issues, but some of this was also achieved through driver hacks. This should be a thing of the past now.

The native FP16 and Int16 support also puts AMD in the running again for deep learning initiatives, which NVIDIA recently started hammering with the GP100-based P100 Accelerator, a GPU designed specifically for deep learning and slicing through big data workloads. This puts AMD’s FirePro solutions into the running for large-scale supercomputer projects, but those cards may come later, as NVIDIA’s P100 is also experiencing availability issues along with extremely high prices as a result of low yields.

The GCN 4.0 architecture, and the RX 480 in particular, now suppports AMD’s second-generation delta colour compression engine. Whereas the biggest improvements were introduced with the Fiji family, these benefits weren’t immediately realised because so much memory bandwidth was on tap that it was a pointless exercise to talk about how memory compression boosted performance. Thanks solely to the high amount of bandwidth provided by HBM, it was never a concern for AMD.

However, in the RX 480, there’s almost a 40% gain in performance from the memory controllers as a result of tweaks AMD has made to their compression engine, and the RX 480’s theoretical bandwidth is now much higher. This graph is slightly confusing, however. The math, at 224GB/s, works out to the RX 480 having around 358GB/s of theoretical bandwidth. That’s 38GB/s higher than the R9 290X’s peak of 320GB/s, but still well below the Fury X’s peak of 604GB/s, assuming 18% uplift from delta colour compression.

This is still, however, a good result. AMD’s 256-bit memory bus implementations on GCN 4.0 can match and exceed the standard bandwidth numbers of Hawaii, even if we’re talking about a situation where most the image can be compressed, which can be an infrequent occurance. This allows AMD to continue to use GDDR5 memory in their designs without needing to move to GDDR5X immediately, and the yields with GDDR5X will be even higher.

AMD’s changes to their asynchronous compute engine (ACE) is quite simple, but further extends their lead over NVIDIA. Whereas Pascal improves NVIDIA’s ability to switch compute contexts from graphics to regular compute workloads at the drop of a hat (measured somewhere in the nanosecond range), GCN 4.0 can match Pascal’s ability and then some. AMD can still run compute workloads asynchronously, where there’s no delineated point at which the shader cores do compute or graphics work, but they can also perform pre-emption of compute workloads (just like Pascal does now), and they can switch contexts very quickly (just like Pascal).

These changes mostly benefit games and software running under modern renderers like DirectX 12 and Vulkan, but the improved pre-emption and the addition of a quick response queue also brings up the performance for compute workloads in DirectX 11 titles as well. It’s not going to heavily increase performance, however. A lot of DirectX 11 titles didn’t pile on the compute workloads too much, and these changes won’t improve AMD’s performance in existing titles that much.

With all the small details out of the way, lets’ look at AMD’s claimed benchmark results. As soon as I have a benchmark rig and an RX 480 to test, I’ll be able to provide a second set of data to compare against AMD’s benchmarks.

AMD, surprisingly, pits the RX 480 against the ASUS Geforce GTX 970 Strix. This isn’t the highest-clocked GTX 970 around, but it is one of the nicest cards around, and definitely one of the quietest. The DirectX 11 performance is a little… disappointing. While the GTX 970 is overclocked, the stock performance from the RX 480 allows the card to pull ahead in Far Cry 4 and Primal, Overwatch, Middle-Earth: Shadow of Mordor, Tom Clancy’s The Division, and Thief. It’s unable to draw ahead in the other titles in the list, many of which have NVIDIA-specific technologies thanks to the use of some Gameworks effects, but take note that all these benchmarks are also being run at 1440p. There are only two games that don’t see an fps average over 60 fps. With the RX 480 replacing the Radeon R9 380 at the same price point, there’s a big performance leap on offer here.

Where things get really interesting is AMD’s clear dominance in DirectX 12 benchmarks. It pulls wins across the board compared to the GTX 970 Strix, and the only titles where it doesn’t have a big lead are in Rise of the Tomb Raider and Total War: Warhammer – the former has a lackluster approach to taking advantage of DirectX 12, while the latter’s adoption is still in infancy. The patch to support DirectX 12 in Warhammer isn’t even out yet, so more improvements may be waiting in the wings. And again, compared to the R9 380, there’s a large gap in performance here. Even the performance of Forza 6 Apex, benchmarked at 4K, is impressive.

AMD also threw a bone in for the Vulkan API in their reviewers guide, showing clear gains when running DOTA 2 Reborn at 1440p and 4K. The benefits at 4K aren’t that high, but that’s because the GPU is almost tapped out and can’t run any faster. Overclocking should see it reach about 80fps or thereabouts, keeping it well within the range of most 4K monitors with a supported FreeSync window.

As always with AMD’s press releases, though, the benchmark settings chosen for this release are not what you’d typically see in the reviews coming out today from other sites. Only some games are run with 16x anisotropic filtering (AF) applied, while many others have no MSAA applied at all. Shadow of Mordor oddly has SMAA applied, which runs better on AMD’s cards and is lighter on resources as well. Forza 6 Apex has 8x MSAA applied, and is also a game which makes good use of tesselation.

This raises the question of where the RX 480 stands relative to the GTX 970 Strix when it’s run at settings that most gamers will actually use and enable. Will it come out ahead, or will it fall behind, only to match a stock GTX 970? Coming out ahead is more preferable, as an overclocked GTX 970 Strix is at the same performance level as a stock GTX 980. If that’s where the reference RX 480 sits, then AMD has a winner on their hands and NVIDIA had better come up with a retort fast. If it ends up being slower than an overclocked GTX 970, it still ends up being a good card and an excellent value-for-money buy, but it won’t persuade GTX 960 owners, for example, to upgrade – they’re going to want to wait for the GTX 1060 instead for an equivalent performance boost. AMD needs to match the GTX 980 in real-world tests to make a significant dent in the mid-range market.

That’s all from the Radeon RX 480 release today. It promises to be AMD’s strongest opening in the mid-range market since the launch of the Radeon HD 4850 and HD 4870, and I think we’re in for a treat with NVIDIA’s GTX 1060 expected to appear on the market soon.