In an optimized system, no component is waiting for another component while there is useful work to be done. Unfortunately, this is not the case with the processor/memory interface.

Put simply, memory cannot keep up. Accessing memory is slow, and it can consume a significant fraction of the power budget. And the general consensus is this problem is not going away anytime soon, despite efforts to push memory harder, faster and to utilize less power?

“There are charts that show how bad the memory bottleneck is getting,” says Steven Woo, fellow and distinguished inventor at Rambus. “If you can avoid going to memory you should. That is just good standard practice. If you can avoid going to the network, you should. If you can avoid going to disk, you should. But when you look at applications like AI and the growth of the network sizes that people are wanting to implement, they are growing faster than other technology curves can keep up. The only recourse is to use DRAM. Of course, people want to avoid using memory, but I don’t see how that is practical for the tougher, more demanding networks.”

The problem is that as soon as you are forced to stop using SRAM, which is integrated on-chip, performance and power quickly become problematic. Studies have shown that when analyzing the energy consumed by a simple mathematical operation like a multiply or add, a large fraction of the power is spent setting up the computation.

“If you go off-chip to DRAM to get the data, a lot of energy is spent doing that and moving the data around, then across the chip,” says Woo. “That can be 95% of the power. What this tells you is that it is an energy problem that we have, and this is why everyone wants to bring computation and data closer together.”

Memory and standards bodies are not standing still. “Now is an interesting time for memory,” says Vadhiraj Sankaranarayanan, technical marketing manager for Synopsys. “We have LPDDR5 that was released by JEDEC earlier this year, and the new DDR5 will be released shortly. These memories are taking the speeds to a higher level than their predecessors. Both LPDDR5 and DDR5 will both have a max speed of 6400Mbps, and that is a considerable speed increase. The memory standards are evolving to both increase the performance, and architecturally by addressing reliability, availability and serviceability (RAS), which is directly tried to the robustness of the channel. Some of the new low-power features also will bring system power down.”

Moving memory closer to processing

Shorter wires have lower capacitance, which helps both performance and power. “The goal is to minimize the overall latency,” says Karthik Srinivasan, senior product manager for ANSYS. “That makes the memory appear as close to the compute as possible. For that, the best we have so far is high-bandwidth memory (HBM), which actually brings the memory into the package rather than having them 5 or 6 cm apart. With HBM, memory is a couple of millimeters away from the compute. The next best thing is to have on-chip memory, which offers higher bandwidth and very minimal latency.”

But you may have to work hard to make the application fit into on-chip memory. “To overcome the memory bottleneck, you have to make partitioning decisions,” says Farzad Zarrinfar, managing director of IP at Mentor, a Siemens Business. “You have to make decisions about what you want to do in an embedded fashion and what will use external memory. We believe that if you can minimize data movement or to reduce the length of data movement, you improve your performance and minimize power.”

There are always tradeoffs. “HBM is the clear winner for devices that need very high bandwidth with the lowest energy per bit metric, and also be able to do it in a constrained amount of area of the PCB,” says Marc Greenberg, group director for product marketing in the IP Group of Cadence. “There are other metrics where HBM does not stack up so well. You can expect all of the DDR technologies to be lower in cost than HBM.”

Synopsys’ Sankaranarayanan adds a few more advantages that come with HBM. “For each HBM DRAM you save a lot of area because you do not need as many GDDR PHYs or DRAMs to get the same bandwidth. HBM also has a very good power efficiency compared to GDDR. So, HBM provides a lot of bandwidth, better power efficiency, area efficiency but the important thing is that it requires an interposer to integrate the SoC and the HBM. That makes it more costly.”

The incorporation of an interposer is not a decision made lightly, and the industry is still in the learning phase for several aspects of it. “How stable is the overall structure?” asks ANSYS’ Srinivasan. “What is the overall reliability? When you have thermal cycling, how will that impact warpage fatigue? Organic substrates are much thicker and more stable, but we are looking at much thinner silicon, especially when stacking multiple dies. As you put more compute and more memory into a smaller form factor, the power density is higher, which impacts thermal which in turn impacts the fatigue, warpage etc. There is a stated need in the industry to look at the structural aspects of these multi-die systems.”

Architecting the right memory

Von Neumann established the architecture that we use for computing systems. It was simple, scalable and flexible. But today, every decision has to be re-examined, and it may not offer the best solution for all problems.

This is particularly true as more power states and use cases are designed into devices, and it can include non-volatile memories such as flash, MRAM, and phase-change memory, as well as volatile memories such as DRAM and SRAM.

“We had a customer recently working on a Bluetooth Low Energy application where they were constantly accessing the device in read mode,” said Paul Hill, director of marketing at Adesto Technologies. “They’re fetching the software from the device, flowing that into the cache and executing the code. For that reason, they have a power consumption issue in read mode. But occasionally, when the BLE device goes dormant, they want to turn off the chip’s memory device and go in through an ultra-low-power mode. The problem with that is the ultra-low-power mode has a longer wake-up time, so when the BLE device goes active again there is a longer latency before they can get the next read instruction. We have to take that into account. We have different power modes that our memory device can operate in. There is standby mode, low-power mode and ultra-low-power mode. The customer can then determine what mode is appropriate.”

That also has an impact on where exactly the data is being stored, which has been a long-running problem in the memory world.

“Data locality is an issue that the industry has struggled with for many years and it not completely solved,” says Cadence’s Greenberg. “When people are doing dedicated hardware, they can control the locality of data a little better by matching the memory management to the application. They may be able to organize data to all be in the same page of memory, which would reduce the amount of power used. Also, you want to organize data in a way that, if you request a burst of data from the memory, you actually are able to use all of the data requested. Some algorithms do that very well. Some things like cache line fills and evictions do that very well. Sometimes video data is poor at that.”

It is this problem that caused GDDR to be created. “If you look at the graphics industry, they are a great example of CAS granularity,” adds Woo. “The application has a natural granularity of data that it wants, and giving it more than that is detrimental. It wastes resources, it destroys caching algorithms, etc. Graphics wants 32-byte accesses. A couple of times the industry has tried to make a graphics memory with 64-byte access granularity, and both times the future standards went back to 32-byte access granularity. With GDDR6 compared to GDDR5, instead of having all 32-bits of the DRAM interface dedicated to a single request, we split it up into two 16-bit interfaces and treat them almost like two separate DRAMs. Because each is now half the width, you can afford to dump out twice as many bits on those wires and still get the same column granularity. So it is 16 bits wide, but it is twice as deep, and that helps the design of the core. It more closely matches the DRAM desire to dump out more data per request on each wire, with the more natural granularity wanted by the application.”

Is it possible that AI will have specific needs in terms of channel width that eventually a new memory standard will be optimized for this application? “Mobile phones used to use DDR and then when volume got high enough, and the needs became different enough from the mainstream, that it warranted a new standard,” Woo says. “The same happened with graphics. Early graphics used DDR, and eventually there was enough volume and demand to develop a new type of memory. Something similar could happen for AI. The question is, as the needs for that community evolve, does HBM no longer meet those needs? And do you need something even more specific? The things that will determine that are the longevity of the demand for this type of use case. If there is money available and enough volume, that could motivate enough DRAM manufacturing capacity to build to a new standard. Historically, we have seen that the market will come in and build the new standard around it.”



Fig. 1: Common memory systems for AI applications. Source: Rambus

Driving memory faster

All of the memory interfaces reside under the standards umbrella of JEDEC, an organization that has existed since 1958. As previously noted, JEDEC is actively advancing all of the memory standards. “There are four current and next-generation standards that are all under active development,” explains Greenberg. “GDDR6 is out and we expect faster GDDR6 parts over time. HBM2E has been out for a short period of time and there was an announcement in September about a higher speed grade of device that is in excess of the JEDEC standard. We have seen announcements from memory vendors about their plans for DDR5. We are in the early days of that standard, so we can expect that to get faster over time. And LPDDR5 standard has the potential for a mid-life extension to the frequency range of that technology.”

Signal integrity (SI) and power integrity (PI) are often the limiting factors for how fast the interface can operate. “DDR is architected for module-based memory systems that are found in servers and PCs,” says Woo. “That bus topology is not as clean, from a signal integrity perspective, where you have to go through these connectors and discontinuities. That is a reason why it is harder to scale the speeds so quickly. But if you take a look at HBM or GDDR, where it is soldered straight to a board, it is a much cleaner interface and doesn’t have to go through a connector. That makes it a little easier from a signal integrity standpoint, at least up until now, to ramp the speed. The physical implementation is cleaner.”

Still, a lot of analysis has to be performed. “SI simulations are used to model the channel, along with the power delivery network, in order to ensure the fidelity of the signals from the drivers to the receiver is actually maintained,” says Srinivasan. “You also need to consider a significant amount of coupling or loss because of power noise or the coupling between various interconnects. One of the biggest challenges is the simulation capacity. For HBM you are looking at a 128-bit channel for each stack. You have to simulate the entire signal traces and this traverses from one die to another using through silicon vias (TSVs) down to the interposer traces, across to the parent logic die, along with all of the power delivery network.”

The standards also build in advanced capabilities to ensure reliable communications. “The intent of every standard is to offer a higher speed and at a lower I/O voltage,” says Sankaranarayanan. “We do not want to lose sight of power. So you are increasing the speed and bringing down the voltage, and that is accomplished through electrical and architectural features. From the electrical point of view, a new feature in LPDDR5 is decision feedback equalization (DFE). What this does is to open up the eye for the right data. As the data is sent by the PHY, the DRAM captures that, and the DFE would be on the front end of that and would open up the margin of the eye. So as the sampler reads the data, you have a higher probability of capturing it correctly. It is quite common for the controller PHY to have DFE for the read data. As the speeds increase, these measures taken on the channel allow us to operate with higher reliability.”

Conclusion

For many applications, the processor/memory bottleneck is here to stay, although it will improve at times and get worse at others as standards evolve. High-volume applications do have the ability to introduce new memory architectures and interfaces, as has been witnessed a couple of times already for mobile and graphics. An interesting question for the future is whether AI/ML, or new non-volatile memories, will bring about new memory standards.

Related Stories

Solving The Memory Bottleneck

Moving large amounts of data around a system is no longer the path to success. It is too slow and consumes too much power. It is time to flip the equation.

Will In-Memory Processing Work?

Changes that sidestep von Neumann architecture could be key to low-power ML hardware.

Using Memory Differently To Boost Speed

Getting data in and out of memory faster is adding some unexpected challenges.

In-Memory Computing Challenges Come Into Focus

Researchers digging into ways around the von Neumann bottleneck.

Machine Learning Inferencing At The Edge

How designing ML chips differs from other types of processors.

Memory Knowledge Center

Special reports, top stories, videos, technical papers and blogs about Memory.