One of the significant problems facing the microprocessor industry, particularly as we push for exascale computing, is the fact that existing memory systems don’t scale well. Now, a new paper from the University of Connecticut has proposed multiple methods of improving cache performance and reducing total power consumption — with extremely positive results.

Before we dive into the discussion, let’s review how modern caches work. Chips from Intel, AMD, and ARM all use a similar structure of small private caches backed up by common shared caches. Sometimes the shared cache is L2, sometimes it’s L3, but the basic concept is the same. The problem with this system is that while it works fine for dual-core or even quad-core systems, it starts to break down thereafter. CPU designers have to start carving unified blocks of cache into smaller sections (dubbed slices) in order to maintain what’s called spatiotemporality — each “slice” of cache needs to be close to the CPU cores that it services. This is one place where block diagrams and actual die layouts tend to differ — a block diagram from Intel or AMD will include a single contiguous area and the label “Shared L3 cache.” In reality, cache architectures tend to look more like this:

Those bright green blocks in the center are Bulldozer’s 8MB L3, and they all have to be crossconnected to each other to ensure that each core can access the entire cache structure. Meanwhile, Bulldozer is just eight cores — things start getting really crazy when we step up to 15 cores, like Intel’s Ivytown.

This diagram below actually shows how the cache is arranged; each red arrow comes off a slice of cache that’s 2.5MB, with the entire structure connected in a ring bus topology that links to the memory controllers as well. This design is part of why Intel’s Ivytown has come out years after the company launched Ivy Bridge — it’s far more difficult to design a low-latency topology with 15 separate cores to feed.

What the research team did in this instance was create a locality-aware system that distributed data to specific areas of the cache depending on what was happening to that particular block of data. Data that one particular core is requesting frequently can be written to that core’s particular area of cache. If two cores are requesting the same large block of data but only updating it occasionally, the read-only portion of the data can be dropped in the CPU’s private cache (typically L2), while the shared version is tucked into an optimized location.

These protocol changes can be extended to tell the chip when to look into the LLC on a cache miss and when data should be pushed to the LLC or evicted from it altogether. By offering better information tracking as opposed to the Least Recently Used protocol (that’s currently used to determine whether or not cache data should be evicted), the team found it could substantially improve performance.

The final figures are impressive. As Ars Technica reports, after simulating a set of benchmarks on a 64-core processor, the team found that various iterations of its locality-aware protocol improved power consumption by 13-21%, while benchmark completion time was improved by 4-13%. It’s not clear how effectively these metrics would scale down to smaller chips, but we know both Intel and AMD are pursuing better power consumption by any means necessary.

With Moore’s law price scaling no longer functioning, any steps that can improve cache performance and possibly allow for smaller caches or lower performance are going to be critically important to the future of mobile. Meanwhile, the problem of exascale computing is dominated by waste heat and access latencies. Our biggest problem isn’t putting enough cores in a box to hit one exaflop, but doing so within a power envelope that doesn’t require two new Hoover Dams and a three-week synchronization period to bring benchmarks online.