The AMD Kaveri Architecture

AMD is launching Kaveri this week though taking a different approach. Rather than send us the high-end 7850K for testing we are looking at the 45 watt A8-7600.

Kaveri: AMD’s New Flagship Processor

How big is Kaveri? We already know the die size of it, but what kind of impact will it have on the marketplace? Has AMD chosen the right path by focusing on power consumption and HSA? Starting out an article with three questions in a row is a questionable tactic for any writer, but these are the things that first come to mind when considering a product the likes of Kaveri. I am hoping we can answer a few of these questions by the end of this article, but alas it seems as though the market will have the final say as to how successful this new architecture is.

AMD has been pursuing the “Future is Fusion” line for several years, but it can be argued that Kaveri is truly the first “Fusion” product that completes the overall vision for where AMD wants to go. The previous several generations of APUs were initially not all that integrated in a functional sense, but the complexity and completeness of that integration has been improved upon with each iteration. Kaveri takes this integration to the next step, and one which fulfills the promise of a truly heterogeneous computing solution. While AMD has the hardware available, we have yet to see if the software companies are willing to leverage the compute power afforded by a robust and programmable graphics unit powered by AMD’s GCN architecture.

( Editor's Note : The following two pages were written by our own Josh Walrath, dicsussing the technology and architecture of AMD Kaveri. Testing and performance analysis by Ryan Shrout starts on page 3.)

Process Decisions

The first step in understanding Kaveri is taking a look at the process technology that AMD is using for this particular product. Since AMD divested itself of their manufacturing arm, they have had to rely on GLOBALFOUNDRIES to produce nearly all of their current CPUs and APUs. Bulldozer, Piledriver, Llano, Trinity, and Richland based parts were all produced on GF’s 32 nm PD-SOI process. The lower power APUs such as Brazos and Kabini have been produced by TSMC on their 40 nm and 28 nm processes respectively.

Kaveri will take a slightly different approach here. It will be produced by GLOBALFOUNDRIES, but it will forego the SOI and utilize a bulk silicon process. 28 nm HKMG is very common around the industry, but few pure play foundries were willing to tailor their process to the direct needs of AMD and the Kaveri product. GF was able to do such a thing. APUs are a different kind of animal when it comes to fabrication, primarily because the two disparate units require different characteristics to perform at the highest efficiency. As such, compromises had to be made.

GPUs perform best using high density transistors running at lower speeds, as more parallel units can be packed into a chip. The lower clock speeds are not necessarily a hindrance to these massively parallel processors, so the focus is primarily that of maximizing transistor count to die space. CPUs on the other hand seem to work better with more spacing between transistors and being able to run at a higher clock speed without breaking any power and TDP envelopes. These are generalizations, but the truth of the matter is that CPUs and GPUs are very different beasts when it comes to design considerations at a very low level.

The 28 nm bulk/HKMG process at GF is more of a compromise that is optimized for good performance for both the GPU and CPU. It offers good enough density and good enough speed to make for a competitive product in the marketplace. It is a bit more biased towards the GPU portion, as the CPU takes a hit when it starts to run at the higher TDPs. So at 95 watts, the CPU portion of Kaveri is running as fast as it can while being constrained by TDP concerns. Even though 28 nm HKMG in theory should offer a little more headroom than the previous 32 nm PD-SOI based process, in the end Kaveri will run oh-so-slightly slower than the previous generation Richland in terms of raw CPU clockspeed. The GPU portion will run significantly slower than the previous VLIW4 based part in Richland. These are not necessarily bad things, because the efficiency improvements in both the CPU and GPU offset the clockspeed disadvantages.

Steamroller Improvements

Some years back AMD decided to go the CMT (clustered multi-threading) route for multi-threaded efficiency vs. die cost. The first product to sport these new cores was the Bulldozer based FX-8150. The results were not very positive. The part showed some real issues with power consumption, heat production, and single threaded performance. While it did very well in heavily multi-threaded apps, it was not exactly a winning formula. The next update to the architecture was Piledriver. This is found in both the Trinity/Richland line of APUs as well as the FX 8300/6300/4300 series of parts. Piledriver had some small improvements in performance per clock, but the biggest improvement was power. Piledriver did not get as hot or pull as much power per clock as did Bulldozer.

Kaveri introduces the new Steamroller architecture for the CPU portion of the APU. Steamroller is another improvement over Piledriver, especially in terms of performance per clock. Kaveri is comprised of two Steamroller modules which each contain two cores, so a two module unit can address four threads. The front end of the module was reworked in a very significant manner to improve not only single thread performance, but also multi-threaded performance as well.

The biggest improvement is the addition of another decoder. Previous iterations only had one instruction decode unit per module, so each module was limited to one thread per clock. We can see right off the bat that single threaded performance will suffer because a good portion of the execution units in each core will be waiting for instructions every clock. Multi-threading also suffers because each module only addresses half of the potential threads vs. core count.

AMD did not just stop there. They improved essentially every piece of the front end, as well as how the D-caches handle and store data. The integer and floating point units look to be left untouched, but every other aspect of the chip was touched upon and improved by AMD’s engineers. The integer and floating point/SIMD units were seemingly fast enough for the job, but they just could not be fed data and instructions effectively and efficiently.

AMD showed us estimates of a peak 20% improvement in performance per clock. They then told us that in most real-world situations that number is likely to be 10%. Still, this is a pretty big jump in single thread performance, and it will be able to handle multi-threaded loads more efficiently as well.

Power does not seem to be an issue with this design, though as mentioned in the process section AMD did take a hit in CPU performance in the high TDP range. With more tweaking of the process we can expect faster parts to be released down the line, but for the now the A10-7850K will be the top SKU for this introduction. Also, AMD will be offering these products in the 15 watt TDP range later on this year. That is a pretty significant range of TDPs for essentially a single design. AMD did disclose all of the power saving features, but they seem to be very comparable to what was introduced with Richland.

Definition of Compute Cores

AMD is coming out with a new description for cores with Kaveri. Compute cores were bandied about during the tech day, and they actually make a bit of sense. At CES, NVIDIA came out with their “192 core” Tegra K1, but that actually seems a bit of a misnomer as compared to how AMD is defining “cores”. Those Tegra cores are more akin to SIMD units than standalone cores. My understanding is that a single SMX unit could be considered a “compute core”.

On the other hand, AMD’s GCN compute clusters can be defined as cores in a more historical sense. The top end APU has a total of 12 compute cores; 4 of them are the CPU cores in the Steamroller modules, while the other 8 are the GCN units. Each GCN unit contains 4 x 16 wide vector units (SIMD), a single scalar unit, branch and message unit, a scheduler, texture and texture fetch units, and a bunch of cache. Each GCN unit has around 146 KB of cache divided between vector registers, a scalar register, local data share, and L1 cache. It also has such basics as a program counter, which certainly fits in with their traditional definition of cores. Each GCN unit can theoretically assign new jobs/work to the CPU when needed. While you certainly can’t boot up an OS from a GCN core, it can do a significant amount of work independently from the CPU.