It’s been almost a decade since CPU developers began talking up many-core chips with core counts potentially into the hundreds or even thousands. Now, a recent paper at the 2016 Symposium on VLSI Technology has described a 1,000-core CPU built on IBM’s 32nm PD-SOI process. The “KiloCore” is an impressive beast, capable of executing up to 1.78 trillion instructions per second in just 621 million transistors. The chip was designed by a team at UC Davis.

First, a clarifying note: If you Google “KiloCore,” most of what shows up is related to much older IBM alliance with a company named Rapport. We reached out to project lead Dr. Bevan Baas, who confirmed to us that “This project is unrelated to any other projects outside UC Davis other than that the chip was manufactured by IBM. We developed the entire architecture, chip, and software tools ourselves.”

The KiloCore is similar to other many-core architectures we’ve seen from other companies, in that it relies on an on-chip network to carry information across the CPU. What sets the KiloCore apart from these other solutions is that it doesn’t include L1/L2 caches or rely on expensive cache coherency circuitry.

The historic problem with attempting to build large arrays of hundreds or thousands of CPU cores on a single die is that even very small CPU caches drive up power consumption and die size very quickly. GPUs utilize both L1 and L2 caches, but GPUs are also designed for a power budget orders of magnitude higher than CPUs like KiloCore, with much larger die sizes. According to the VLSI whitepaper, KiloCore cores store data inside very small amounts of local memory, within other nearby processors, in independent on-chip memory banks, or in off-chip memory. Information is transferred within the processor via “a high throughput circuit-switched network and a complementary very-small-area packet-switched network.”

Taken as a whole, the KiloCore is designed to maximize efficiency by only spending power to transfer data when that transfer is necessary for a given task. The routers, independent memory blocks, and processors can all spin up or down as needed for any task, while the cores themselves are in-order with a seven-stage pipeline. Cores that have been clock-gated to off leak no power at all, while idle chips leak just 1.1% of their estimated energy consumption. Total RAM in the independent memory blocks is 64KB * 12 blocks, or 768KB total and the entire chip fits into a package measuring 7.94 mm by 7.82 mm.

Why build such tiny cores?

The numerous research projects into many-core architectures over the past 5-10 years are at least partly a reaction to the death of single-core scaling and voltage reductions at new process nodes. Before 2005, there was little reason to invest in building the smallest, most power-efficient CPU cores available. If it took five years to move your project from the drawing board to commercial production, you’d be facing down Intel and AMD CPUs that were cheaper, faster, and more power efficient than the cores you started off trying to beat. Issues like this were part of why cores from companies like Transmeta failed to gain traction, despite arguably pioneering power-efficient computing.

The failure of conventional silicon scaling has brought alternate approaches to computing into sharper focus. Each individual CPU inside a KiloCore offers laughable performance compared to a single Intel or even AMD CPU core, but collectively they may be capable of vastly higher power efficiency in certain specific tasks.

“The cores do not utilize explicit hardware caches and they operate more like autonomous computers that pass information by messages rather than a shared-memory approach with caches,” Dr. Baas told Vice. “From the chip level point of view, the shared memories are like storage nodes on the network that can be used to store data or instructions and in fact can be used in conjunction with a core so it can execute a much larger program than what fits inside a single core.”

The point of architectures like this is to find extremely efficient methods of executing certain workloads, then adapt said architectures to further adapt for efficiency or improve on execution speed without compromising the extremely low power consumption of the initial platform. In this case, the KiloCore’s per-instruction energy can be as low as 5.8 pJ, including instruction execution, data reads/writes, and network accesses.