Fiji is the world's first GPU to feature stacked High Bandwidth Memory and is AMD's largest and most advanced chip to date. So we got pretty excited when we found out that ChipWorks have successfully x-ray imaged and released die shots of AMD's most powerful GPU yet powering the R9 Fury X, R9 Fury and R9 Nano as well as the stacked High Bandwidth Memory that feeds this massive GPU with the bandwidth it needs.



Sadly both AMD and Nvidia have developed a habit of not releasing any die shots of their latest chips. So we've come to look to independent bodies like ChipWorks to visualize the inner-workings of these cutting edge pieces of silicon. The latest of which is AMD's Fiji GPU. And this particular chip is of great interest because it's the very first GPU in the industry to support the brand new HBM JEDEC standard for stacked memory.

The GDDR5 or graphics DDR5 memory standard has been the primary solution for high performance GPUs for seven years now. However as GPUs grew more powerful their bandwidth requirements grew exponentially and the continued reliance on GDDR5 memory to meet the bandwidth requirements of these new powerful engines was simply not viable.

HBM not only offers benefits in terms of extremely high memory bandwidth at a considerably lower thermal and power cost when compared to GDDR5. But it was also said to actually require a fewer number of transistors than GDDR5 to support on any SOC. This was said to be because the memory interface which links HBM to the SOC was considerably smaller than that of GDDR5 for the same amount of bandwidth and memory capacity. However because we didn't have any die shots of Fiji we were never able to verify this claim, that is until now.

Diving Deep Into AMD's Fiji GPU

Below we have three GPUs. Tahiti, which is the industries first high performance 28nm GPU that debuted with AMD's HD 7970 and HD 7950 graphics cards in January of 2012. Then we have Tonga, a re-worked, updated version of Tahiti that was released in 2014 with the R9 285. And finally we have Fiji, the largest GPU AMD has ever created in its history and the history of ATi before it.







The first item on my list when I found out that a die shot of Fiji has been released was to look for the HBM memory interface on the die, and verify whether or not the claims about it being significantly smaller than that of GDDR5 were accurate and I'm happy to report that they are indeed accurate.

We've recreated an image with both Fiji & Tonga that's to scale and looked at the die area taken up by their respective memory interfaces.

What we've found was that a 1024bit HBM memory interface is only marginally larger than a 64bit GDDR5 memory interface in Tonga. This means that the entirety of the 4096bit HBM memory interface inside Fiji takes roughly the same silicon area as a 256bit GDDR5 memory interface today. However the maximum amount of memory bandwidth that can be driven through a 256bit GDDR5 memory interface with the fastest GDDR5 memory modules available today - 8Gb/s GDDR5 memory chips - is 256GB/s. That's half the bandwidth that HBM1 modules can deliver today with a 4096bit memory interface, and only a quarter of what HBM2 modules will deliver next year.

So yes, there are tremendous benefits associated with HBM that extend even to reducing the area needed to support the memory standard. Area that can be dedicated to more shader units and thus drive the overall performance of the GPU even higher.



Why 3D Stacked High Bandwidth Memory Is A Cornerstone For Next Generation GPUs

Traditionally performance and power efficiency have primarily been pushed by the progress of more advanced manufacturing processes in the semiconductor industry. Which enabled the creation of smaller, denser and more power efficient transistors as time went by, an observation that later became what is now known as Moore’s Law.



AMD and SK Hynix spent seven years and poured an extraordinary amount resources to develop an entirely new memory standard for very good reasons. High Bandwidth Memory was born and exists today out of necessity. Traditional memory standards simply reached a point where they had become architecturally and economically not viable. This point is represented in the graph above at the intersection between the energy consumption of memory and the processor it is feeding.



Accelerators, especially bandwidth hungry ones such as GPUs, continued to scale rapidly. Because they benefited the most from the increasing transistor count and efficiency. Unfortunately memory standards did not and the more powerful these accelerator became the more memory bandwidth they required.



The situation ended up being that engineers would have to trade GPU power for more memory bandwidth to keep the GPU fed. AMD shared a similar graph nearly three years ago when the company’s head of the die stacking program Bryan Black first discussed stacked memory and it’s needed in the industry. And that’s why the industry needed to look at other memory alternatives that would solve the performance and power efficiency challenges.



Memory standards such as GDDR5 hit a wall on several fronts. GDDR5 failed to continue to scale effectively in performance and power efficiency. Pushing the frequency of GDDR5 to attain more bandwidth meant sacrificing power efficiency. And designing processors with wider GDDR5 memory interfaces inflated both costs and power consumption.

The Real World Advantages of Stacked High Bandwidth Memory

GDDR5 also has a density limitation that would prevent its integration into high performance small form factor accelerators. High Bandwidth Memory on the other hand is vertically stacked, this in turn meant that the connections from one DRAM die to another were much shorter and thus more efficient.





Also because vertical stacking enables much greater densities there are immense area savings on the printed circuit board as a result. Enabling far more compact form factors. The area savings extend to the DRAM die itself as well. HBM is far more compact than GDDR5.



Also unlike GDDR5, HBM is packaged along with hos processor / the accelerator on a single interposer. The closer proximity to the GPU enables significantly wider memory interfaces and reduces latency. The smaller, shorter connections also enable great power efficiency.



This means that that HBM will require a lower voltage to operate and is connected via considerably wider interfaces. Enabling significant memory bandwidth increases all the while keeping frequencies low for better power efficiency.



The end result is a 300% improvement in bandwidth per watt. Attained through significantly higher bandwidth and significantly reduced power. More interestingly, AMD estimates that 15-20% of the R9 290Xs 250W TDP is for the memory sub-system. Which means that AMD can reduce the memory power consumption from 40-50W down to 15W just by switching from GDDR5 to HBM. Something which AMD used to its advantage with the R9 Fury X, R9 Fury and R9 Nano graphics cards.