CPUs no longer deliver the same kind of of performance improvements as in the past, raising questions across the industry about what comes next.

The growth in processing power delivered by a single CPU core began stalling out at the beginning of the decade, when power-related issues such as heat and noise forced processor companies to add more cores rather than pushing up the clock frequency. Multi-core designs, plus a boost in power and performance at the next processor node, provided enough improvement in performance to sustain the processor industry for the past several process nodes. But as the benefits from technology scaling slow down, or for many companies stop completely, this is no a longer viable approach.

This reality has implications well beyond CPU design. Software developers have come to expect ever-growing compute and memory resources, but the CPU no longer can deliver the kinds of performance benefits that scaling used to provide. Software programmability and rich feature sets have been a luxury provided by Moore’s Law, which has provided a cushion for both hardware and software engineers.

“Because of Moore’s Law, the way that computing has grown and accelerated is partly because Intel and others kept pushing on the next generation node, and thus the need to optimize the compute engine itself has been less important,” says Nilam Ruparelia, senior director for strategic marketing in Microsemi, a Microchip company. “But it also happened because software productivity has gone up faster than Moore’s Law. If you make it easy to program, you enable a greater number of people to program. The ability of software to do a variety of things has grown up significantly.”

This is becoming far more difficult at each new node. “Processors are no longer the one-size-fits-all answer for processing,” says Geoff Tate, CEO of Flex Logix. “Look at data centers. It used to be that the only processing device was an x86. Now almost all data centers have added both FPGA and GPU processors in various configurations.”

This heterogeneous approach is particularly apparent in AI/ML designs. “Offloading the matrix operations or dense linear algebra operation onto GPUs became necessary,” says Kurt Shuler, vice president of marketing at Arteris IP. “You could optimize further by doing your own ASIC or put some stuff into an FPGA. You might still have some CPUs managing the high-level data-control-flow but the processing elements are becoming more in number. They are becoming more complex. And if you look outside the data center, they are becoming more heterogeneous architectures.”

At the root of this shift are the laws of physics, which are fixed. “We are hitting limits in what can be achieved by RISC and CISC architectures,” warns Gordon Allan, Questa product manager at Mentor, a Siemens Business. “Today the issue is programmability versus conventional logic, with custom logic implementing common functions and smart interconnect keeping that all together, rather then software putting together micro-operations to form an algorithm.”

This certainly doesn’t mean the CPU will disappear or stop improving. But the job of CPU architects has become much more difficult. “The general-purpose CPU architecture and microarchitecture will continue to evolve and be efficient enough for most tasks, keeping design and ecosystem costs as well as complexity at sustainable levels,” says Tim Whitfield, vice president of strategy in Arm’s Embedded & Automotive Line of Business.

One of the largest barriers to change is programmability. “Programmability, or the lack of it, does not have the biggest impact on overall efficiency,” says Russell Klein, HLS Platform program director at Mentor. “Traditional CPUs, DSPs, many-core CPUs, and FPGAs are all programmable, but have vastly different efficiencies and different levels of difficulty of programming. Programmability introduces some inefficiency, but parallelism is far more impactful. DSPs gain efficiency over CPUs by having task specific capabilities. GPUs also have domain specific computational units, but also introduce parallelism. FPGAs and ASICs benefit from parallelism even more.”

Breaking free of old software paradigms will be tough. “The industry has transitioned through application programing to accommodate a wider array of silicon,” says Allan. “This has led to new software ecosystems and new APIs, but all of that is just building more and more layers on top of the technology. This is all trying to compensate for a processor that has hit its performance limits, its low power limits, and we need to break free of that and do something different.”

“This situation is swinging back from software development to hardware, as real power saving can occur only with the appropriate hardware,” says Yoan Dupret, managing director and vice president of business development for Menta. “This is ultimately pushing for heterogeneous chips with high flexibility. This was predicted by Dr Tsugio Makimoto, and today we are entering the ‘highly flexible super integration’ era.”

Improving the CPU

Perhaps the CPU should be viewed in Mark Twain’s terms: “The reports of my death are greatly exaggerated.”

There are several directions that CPUs can take. One is to add special instructions tailored to specific functions. “If you are programmable and add instructions, you are adding complexity into the hardware,” warns Martin Croome, vice president of business development at GreenWaves Technologies. “If low power is required, every transistor leaks, and this is bad. It adds both cost and power. You have to be very careful about what you are adding in terms of instructions, and whether you will get the benefit from them.”

The trend has been to optimize processors. “As more gates became available to processor designers, they added more and more functions to accelerate the single-threaded programs running on them,” says Mentor’s Klein. “They added things like speculative execution, branch prediction, register aliasing, and many more. But these functions had diminishing returns.”

In many cases, the integration of capability can yield benefits. “Throughout the history of processing we have seen many iterations of accelerators being added alongside a general-purpose CPU,” explains Arm’s Whitfield. “Often, what started life as an accelerator is integrated back into the general-purpose CPU through the architecture and microarchitecture. Examples of this cycle include floating point and cryptographic accelerators.”

The other direction is to take things away. “We are seeing RISC-V as a new and better approach that helps us overcome this to some extent,” says Microsemi’s Ruparelia. “A ground-up, optimized architecture is providing us with more ways to overcome the challenges posed by the end of Moore’s Law. If the silicon scaling ends and silicon-related performance doesn’t come for free and like clockwork, the answer is that you have to optimize all layers of the stack — the CPU, domain specific architectures, the toolchain. Even compilers need to become application-specific.”

And processors have to be cognizant of the end-product goals. “Comparing the energy efficiency of different processors shows that algorithms can be executed using the least amount of energy (think joules per computation) on relatively simple processors,” adds Klein. “Bigger processors will perform the work faster, but they are much more inefficient. It is much more power-efficient to run an algorithm in several small processors, in parallel, than on one big one. As more simple cores are added, both voltage and clock frequencies can be scaled back, further improving efficiency.”

Optimized compute engines

The creation of cores for specific tasks has been a successful strategy. “A programmable DSP is an ideal candidate for offloading advanced applications from the CPU,” says Lazaar Louis, senior director of product management, marketing and business development for Tensilica IP at Cadence. “DSPs offer both flexibility and programmability. Support for open standard cross-platform acceleration like Open VX and Open CL allows for easy porting of applications to DSPs. For certain applications that are well known, DSPs also can be paired with specialized hardware accelerators to benefit from the higher power efficiency of the accelerators and the DSP’s programmability to meet the evolving needs of the application over the product’s life.”

Many architectures have failed because they did not provide a robust software development environment. “The GPU is a classic example where the benefits can sustain the cost of maintaining a separate development and software ecosystem,” says Whitfield. “It appears that some machine learning algorithms will require specialist accelerators and will persist as a coarse-grained acceleration engine alongside a general-purpose CPU.”

The GPU has taken an interesting path. “The GPU is a domain-specific architecture, which was more oriented toward gaming and then got re-used for blockchain and neural networks,” points out Ruparelia. “Optimizing domain specific architectures gives you the advantage of both higher compute performance per unit and better software productivity, or easier software outcome, by enabling certain functionality that would be very hard to enable on a traditional CPU. Neural networks are a classic example, where if you try to run on a CPU it will take 10X the time and 10X the power compared to another programmable platform, but designed for the intended function.”

But the GPU is not optimized for neural networks. “Within a convolution neural network, 80% of the time is spent doing convolution,” says GreenWaves’ Croome. “Convolutions come in all shapes and sizes. You have concepts such as padding and dilation and stride and the size of the filter, etc. There are many parameters to a convolution. If you try to build something that can do all convolutions in hardware, it will use a lot of hardware and only covers what is known today. You have to take the most common convolutional parameters and built something that still retains enough flexibility.”

So can anyone afford a complete software development environment for custom accelerators?

“We are writing optimized kernels, we do some hand coding for vectorized stuff,” continues Croome. “We use standards vector operations but even then, when you are writing code at that level, you are thinking about register loading, you are thinking about how to optimize code such that the compiler will target it in a very specific way.”

This is where it starts to get difficult. “The training of neural networks, using a bank of GPUs or CPUs and then running it on a GPU is very accessible,” says Gordon Cooper, product marketing manager in the Solutions Group at Synopsys. “People can go into Caffe or TensorFlow and do that. When we talk about going to specialized hardware to take advantage of embedded requirements, such as low power or small area, a GPU will give you performance. But it will not provide the power savings. The downside is that using a heterogeneous approach, whether you call it an accelerator or a dedicated processor, there is a different tool chain or multiple tools chains that you have to learn and manager and this is not as simple as programming for a GPU.”

It is a delicate balance. “Too flexible means you will not win on power and area which is the advantage of not going to a GPU, but if you are not flexible enough and not programmable or easy to use, then that can knock you as well,” adds Cooper. “It will never be as simple as writing code for a CPU. You are trying to optimize things in a similar way that you would in the DSP world where you might write in C and then optimize the inner loops. There is a balance.”

Changing the hardware

The FPGA has long held promise as a programmable hardware element. “Hardware RTL engineers can use FPGA as a programmable platform and they have no problems,” says Rupatelia. “It is when it comes to software engineers using FPGAs as a programmable platform. That has been a challenge for the industry for a long time.”

Today, FPGAs are being embedded into ASICs as well. “The usage of eFPGA IPs as part of heterogeneous architectures really lies in the architecture definition and code partitioning,” says Menta’s Dupret. “HLS tools can help for this, but the ultimate goal is to have a way to automate code partitioning for heterogenous architectures. We are not quite there yet, but we are certain that this is the direction in which the industry is moving.”

This may become an important part of hardware development for the IoT. “How do we make sure that IoT devices are flexible and field upgradeable?” asks Allan. “A combination of SW and smart FPGA technology may be required. They are all pieces of the solution in a post CPU world. We are talking about relying less on the HW/SW interaction to define the product and relying more on compiled logic and compiled memory and compiled programmable fabrics to define flexible products.”

And this may mean taking a different view of software. “The FPGA toolchain has still not enabled a software engineer to directly FPGAs without knowledge of them,” points out Ruparelia. “I do not see that changing too much in a generic manner. What I do see is making it easier to use in a domain specific or application specific manner. For a neural network, we are working on very specific middleware that abstracts away the FPGA complexity and retains the flexibility. It offers that to the upper layers.”

Another part of the architecture under pressure is memory. “More of the available memory is being deployed within the hardware accelerators,” says Shuler. “The less they have to go off chip to a DRAM or HBM2, the better the efficiency. How do we keep all of the data within the processing elements and passed between them? Sometimes they will have their own scratchpad memory, sometimes they will be connected to a separate memory within a mesh array and in that case, it would be non-coherent. The memory is becoming sprinkled throughout the architecture.”

“With silicon and processors, we evolved multi-level cache architectures where we had content addressable memory being a key technology to control that optimization,” explains Allan. “Then we moved towards coherent cache architectures where multiple processors collaborated around a shared memory space. Now as we enter an age where we are introducing neural networks into the fabric of compute solutions, memory is a key element. Memory will continue to evolve, and we will see some new paradigms emerge. We will see HLS evolve to allow for customized memory architectures to help accelerate particular algorithms. There is a lot of innovation possible in that space to spot the algorithm being fed into HLS flow and to optimize the solution with smart memory techniques.”

At the extreme end away from CPUs, are dedicated hardware solutions. “It is the single-threaded programming model that is really limiting,” points out Klein. “Moving an algorithm from a CPU to a bespoke hardware implementation without introducing any parallelism would improve efficiency, but not by as much as one would expect. The real key to improvement is more about identifying and exploiting the parallelism in the algorithm.”

Ultimately, that requires a change in the software paradigm, one that pushes designers away from solving problems in a serial fashion.

Related Stories