This site may earn affiliate commissions from the links on this page. Terms of use

AMD’s Steamroller core, found in Kaveri APUs, is an interesting duck. It’s the first chip to ship with HSA support, the first CPU core that can be paired with a GCN-based graphics core, and it promised significant improvements in performance-per-clock that were then largely offset by a lower clock speed. Our investigation shows that the situation isn’t that simple, however — there are some areas, particularly on the APU side of the equation, where Steamroller is far better than its predecessor. This article is a low-level investigation into where the chip improved, where it fell short, and what AMD might do going forward.

Since it launched in 2011, Bulldozer’s perennial problem has been the disconnect between the sorts of gains AMD seemed to be promising and the real-world gains that actually materialized. AMD’s slides, if you recall, promised enormous gains:

A 30% improvement in total operations delivered per cycle. 20% reduction in missed branches. Instruction cache misses reduced by 30%. This last is critically important in a chip with a long pipeline; branch mispredicts kill CPU performance. But the truly puzzling thing about Steamroller is this: Despite a great many improvements, the core still seems handicapped, held back from its full potential.

Let’s try and find out why.

Test setup

All of our tests were conducted with an Asus A88X Pro motherboard, 8GB of DDR3-2133, a Samsung 840 Pro SSD, and Windows 8.1. We tested the A10-7850K against the Piledriver/Richland-based A10-6800K. AMD’s previous APU is also a quad-core design, but has higher clock speeds balanced against an older GPU core and second-generation CPU core.

Turbo Mode was disabled for these tests; the A10-7850K and A10-6800K were both locked to a 4.2GHz clock speed. This is a significant overclock compared to the A10-7850K’s turbo speed of 4.0GHz, but it’s a down clock for the A10-6800K, which normally runs at 4.4GHz Turbo. Both integrated memory controllers and northbridges were run at their default clock speeds (1.8GHz for the Northbridge (Sandra reports that this results in a memory controller clock speed of 3.6GHz).

We’ll start our investion with a look at the cache performance of the two chips. This is one area where AMD has historically struggled, so let’s see if Steamroller improves the equation. We’d like to thank CPU analyst and programmer Agner Fog for his assistance in testing and identifying ongoing bottlenecks in the Steamroller microarchitecture. Agner has written multiple guides to optimizing x86 processors, built an open-source set of benchmarks for testing specific characteristics (available on his website) and compiled detailed latency charts and edge case information for both Intel and AMD chips.

Next page: Benchmarking the cache structure and throughput