The features of AMD's upcoming Radeon RX 500 Series Vega architecture have been discovered in the code of the just launched ve.ga teaser site and they're incredibly impressive. The company's upcoming next generation Vega graphics architecture is due for a major preview at CES on Thursday, less than three days away.

However, thanks to our crafty friends over at 3DCenter who have managed to dig up some major yet unreleased details regarding the brand new architecture you don't have to wait one more minute. All the details have been pulled from within the code-base of the Vega teaser website, ve.ga. Which not only makes this the biggest Vega leak yet, it also makes it the most significant because of its impeccable authenticity and accuracy. So without any further delay, let's get to the juicy bits!

AMD Wins Best Brand, Ryzen Wins Best PC Component, NVIDIA GTX 1080 Ti Wins Best GPU – TR Awards 2017

Vega, AMD's Most Advanced & Most Impressive Graphics Architecture To Date

Let's start off with a simple summary of Vega's key features. This should help paint a picture of how much of a drastic step forward the new architecture is compared to Polaris.

Vega Architecture

- 4x Power Efficiency

- 2x Peak Throughput/Performance Per Clock

- High Bandwidth Cache

- 2x Bandwidth per pin

- 8x Capacity Per stack ( 2nd Generation High Bandwidth Memory )

- 512TB Virtual Address Space

- Next Generation Compute Engine

- Next Generation Pixel Engine

- Next Compute Unit Architecture

- Rapid Packed Math

- Draw Stream Binning Rasterizer

- Primitive Shaders

AMD Vega Lineup

Graphics Card Radeon R9 Fury X Radeon RX 480 Radeon RX Vega Frontier Edition Radeon Vega Pro Radeon RX Vega (Gaming) Radeon RX Vega Pro Duo GPU Fiji XT Polaris 10 Vega 10 Vega 10 Vega 10 2x Vega 10 Process Node 28nm 14nm FinFET FinFET FinFET FinFET FinFET Stream Processors 4096 2304 4096 3584 4096 (?) Up to 8192 Performance 8.6 TFLOPS

8.6 (FP16) TFLOPS 5.8 TFLOPS

5.8 (FP16) TFLOPS ~13 TFLOLPS

~25 (FP16) TFLOPS 11 TFLOLPS

22 (FP16) TFLOPS >13 TFLOLPS

>25 (FP16) TFLOPS TBA

TBA Memory 4GB HBM 8GB GDDR5 16GB HBM2 TBA TBA TBA Memory Bus 4096-bit 256-bit 2048-bit 2048-bit 2048-bit 4096-bit Bandwidth 512GB/s 256GB/S 480GB/S 400GB/S TBA TBA TDP 275W 150W TBA TBA TBA TBA Launch 2015 2016 June 2017 June 2017 July 2017 TBA

Vega's Next Compute Unit (NCU), 2x Peak Throughput per Clock And 4x The Power Efficiency

According to the newly dug up data Vega delivers four times the graphics performance at the same power compared to AMD's previous generation. There isn't much detail to expand upon in terms of the context here. However, it's very clear that AMD is referring to half precision compute. Which would mean that Vega delivers double the single precision compute at the same power.

This is the most impressive figure of the bunch. Doubling the power efficiency of a graphics architecture whilst maintaining or boosting performance is an incredibly challenging engineering feat. One that's made even harder in the case of Vega considering that it is built on the same 14nm manufacturing process as Polaris. If it stands true then AMD engineers will have pulled nothing short of a miracle.

AMD Launching RX Vega 32, 28 & A Dozen New Vega 11 Cards, GPU Passes Certification

2x peak throughput/clock is another impressive figure that stands as a testament to how radically different Vega is compared to AMD's previous generation GCN architecture. It means that Vega should deliver double the performance at any given clock speed compared to AMD's previous generation GCN based GPUs.

High Bandwidth Cache, 8x Capacity Per Stack, 2x Bandwidth Per Pin And 512TB Address Space

These specs and features are specific to Vega's second generation High Bandwidth Memory technology. HBM2 offers 8x the capacity per stack compared to first generation HBM and twice the bandwidth per stack thanks to a higher clock speed. First generation HBM found in AMD's Fury series of enthusiast graphics cards features a maximum of 1GB capacity per stack and 128GB/s of bandwidth per stack.

Second generation HBM comes in stacks of up to 8GB and 256GB/s of bandwidth. Interestingly, the Vega engineering sample that AMD demoed last month was actually an 8GB model with 512GB/s of bandwidth. Which would indicate that it was equipped with two 4GB HBM2 stacks, each delivering 256GB/s of bandwidth, rather than a single 8GB stack. However, the Radeon Instinct MI25 deep-learning accelerator based on the same Vega GPU features 16GB of memory and 512GB/s of bandwidth. Which means that AMD had to equip it with two 8GB stacks.

Each HBM stack connects to the GPU via a 1024bit memory controller. HBM2 comes out of the factory clocked at double the frequency of first generation HBM. Which is how it delivers double the bandwidth per pin. The 512TB virtual address space feature is quite an interesting one and is likely achieved by quickly swapping data in and out of the HBM cache.

Below you will find a quick recap of what we know about AMD's Vega architecture & the upcoming RX 500 series graphics cards.

A New Top-To-Bottom Range Of Radeon RX 500 Series Graphics Cards Based On The Vega Architecture

AMD will be rolling out its next generation Vega architecture across the entire range of its 2017 Radeon graphics cards and it'll do it "soon". The new lineup will span a top-end 4K 60FPS triple A gaming Radeon graphics card, the very same one that was demoed last week, to mid-range and entry level offerings for 1440p and 1080p gaming. The highest end models will feature HBM2 whilst the mid-range and more budget oriented cards will feature GDDR5/X memory.

We've already seen one upcoming Radeon graphics card based on Vega in action. The yet unreleased graphics card was demoed in a head-to-head comparison with NVIDIA's GTX 1080. The demo Vega graphics card had 8GB of HBM2 and it outperformed the 1080 by 10% whilst running Doom in Vulkan at 4K.

The Vega Architecture - AMD's Next Generation Compute Unit

One big announcement that AMD made in its recent press event where Vega was demoed is that the new architecture features what the company calls its NCU, short for Next Compute Unit. We had already detailed key parts of this new design in our exclusive piece about Vega 10 and Vega 11 a couple of months ago.

This new architecture holds several key advantages over its predecessor. Chief among which is that each SIMD inside a given Vega NCU is now capable of simultaneously processing variable length wavefronts. Which to the average person sounds like a bunch of meaningless technical jargon, I know it did to me when I first learned about it. However, once you scratch the surface and truly understand what this means you quickly begin to realize how much of a big deal this really is.

In AMD's current GCN implementation, each compute unit has four 16-wide vector SIMD units, capable of executing four 16-wide wavefronts (a group of threads) over four cycles. In addition to one scalar unit, capable of executing one instruction per cycle. This unit is delegated time-critical tasks, where the four-cycle turnaround of the SIMD unit is simply not good enough.

Unfortunately, these 16-wide SIMD units work exactly the same no matter how small of a wavefront they're fed. The SIMD unit has to spend four cycles executing whatever threads are presented to it, no matter what. Which means that executing a 16-wide wavefront would take just as long as executing a 4-wide wavefront as an example, rendering the other 12 ALUs inside the SIMD completely useless. Graphics workloads are inherently non-uniform, which means that it's effectively impossible to find any scenario where all 16-wide SIMD units would be fully occupied at any given time.

Variable Width Wavefront SIMDs, Getting More Performance Out Of Fewer Cycles

This is no longer the case in AMD's new GCN implementation inside Vega. The V9 architecture includes new clever schedulers and coherency subsystems that allow several wavefronts, of different widths, to be executed simultaneously inside any compute unit that's able to accommodate the workload. So that more ALUs would be doing useful work at any given time instead of idling or executing predicted off threads that produce no results.



This in effect allows each NCU to finish considerably more work in the same amount of time compared to a traditional CU. In addition to freeing up valuable cache and memory resources for other compute units. It's very hard to predict how much of a difference this big of an improvement in resource utilization and CU occupancy will yield given how unpredictable and inherently fluctuant graphics workloads are. Vega's Next Compute Units are therefore not only faster but also more power efficient. Although by how much exactly remains to be seen.