For systems to become faster and consume less power, they must stop wasting the power required to move data around and start adding processing near memory. This approach has been proven, and products are entering the marketplace designed to fill a number of roles.

Processing near memory, also known as computational memory, has been hiding in the shadows for more than a decade. Ever since the introduction of flash memory, powerful processing capabilities were required in the memory just to make it work. Wear leveling and garbage collection are just a few of the functions that are being performed.

But a lot more processing could be done in the memory. Consider a system that is meant to be secure. Why not process the data in the memory, where the data is encrypted, rather than having to unencrypt it and transfer the data across the bus where it could be intercepted? Why not perform searches on large amounts of data in the memory, only transferring the likely matches for more in-depth processing?

A recent Arm whitepaper notes that “solid state storage devices (SSDs) already have a large amount of DRAM, typically >1GB per terabyte (TB) of NAND. In terms of the overall power and cost of the components in an SSD drive, the requirement of the processing enables computational storage to be deployed without significantly increasing power or cost. Some devices may rely on the processing that is already available—either working on background tasks or when the drive is less loaded, for example, overnight. For more computational storage performance, additional processing can be added to the drive.”

Flash plus computation

One such example is provided by NGD Systems. “We have an NVMe-attached SSD, which is attached to the PCIe bus,” says Scott Shadley, vice president of marketing for NGD Systems. “We said, ‘Lets do more with that storage, especially as the devices are getting larger. I want to be able to manipulate data in place and provide value to the user about what that data actually represents.’ We provided the ability to move process execution into the storage device.”



Fig. 1: Inside an NGD SSD. Source NGD Systems

Consider that you have a haystack of information and what you really want is to find is the needle that is hiding amongst that. “We allow the user to process the haystack inside the storage device and just provide the needles to the host processors,” adds Shadley. “It is much more efficient, has lower latency, and frees up bandwidth. Networking, connectivity, buses—those are the limiters, not the processing power, not the size of storage. It is moving the data. We allow users to look through long-term stored data in a way they never have been able to in the past. If you have millions of pictures stored across several drives of data, you may know roughly where each of those pictures is, but you still have to do an algorithmic search of that data to get the ones you really want back.”

This requires a change of mindset and programming model, and some companies are taking it one step at a time and extending memory devices with additional predefined behaviors.

“Memory is generally regarded as a necessary commodity technology that is bought solely based on the dollar per GB cost,” says Amr Elashmawi, vice president of Cypress’ Memory Product Division. “However, in industrial, medtech and automotive, ‘dumb’ commodity memory is simply no longer sufficient. I am not just storing data. There has to be a hardware root of trust. For example, you do not want transactions to be impersonated. You do not want people to be able to tamper with the memory. You do not want to enable them to do a rollback to a previous version, or to a version they have inserted into the system. There are a host of different things that are dependent on the memory itself. We can perform various functions in the memory, such as security, and functional safety – but potentially things like artificial intelligence.”



Fig. 2: Cypress Semper NOR flash memory architecture. Source: Cypress Semiconductor

Some of the functions may seem basic, but others are vital for a secure, reliable system. “I can add security and add cryptography, securely boot the processor, store the keys in the memory, and protect them,” adds Elashmawi. “I can encrypt the image on the flash. I can be using it as a storage device, booting the processor, but when it is in a working mode I can use the memory to monitor functional safety. I can use it to do a secure boot. I can do some offload of AI processing in the memory because it is more efficient to do it there. As a user, I can decide where I want to do things, and that depends on the architecture.”

DRAM plus computation

Processing near memory is not restricted to flash drives. UPMEM is attempting to make this technology even more pervasive. “We put processors in DRAMs,” says Gilles Hamou, CEO at UPMEM. “For data intensive operations, as found in genomics or database applications, we get acceleration in the range of 20X and energy efficiency gains in the range of 10X. We put thousands of cores into a single server. We have terabytes of data bandwidth on a single server, and so we end up being much more efficient.”



Fig. 3: Adding processing to DRAMs. Source: UPMEM

We normally think of the DRAM process being highly optimized for the memory cell and being a commodity product. “It is fairly low cost because the insertion of logic is a small increase in the size of the DRAM,” adds Hamou. “However, it is not easy. The processor is not as good as that available with an ASIC, but that is compensated for by the reduced data movement. Moving data takes 1000s of pico-Joules and doing an operation is in the range of 10s of pico-Joules. So, you can afford to be less good. We can also access the memory with finer granularity. We can go from 8-byte to 2K Byte of granularity, and our 1GB/s bandwidth is efficient when we are talking about irregular accesses. The more irregular the access patterns, the more our relative performance increases.”

New memory types with computation

There are several new memory types being developed, but one specifically is ideal for this type of application — ReRAM. “It is a CMOS back-end of line (BEOL) technology,” says Sylvain Dubois, vice president of business development and strategic marketing for Crossbar. “This means we can integrate the ReRAM element between the metal routing layers of the CMOS. You can integrate this with any CMOS, so you go to the big foundries and you can integrate that memory space on top of the controllers.”



Fig. 4: Integrating logic and ReRam. Source: Crossbar

Being monolithic means that you can provide huge bandwidth between the memory and compute. “We have been showcasing a 50GB/s interface from logic for inference, object detection, face recognition directly connected to the memory,” adds Dubois. “If you compare that with external DRAM memory you will see that is more like 3GB/s. Nothing prevents companies from instantiating multiples of these macros so they can get however much they need to fuel the compute with a massive amount of data. This is crucial to AI and inference, where you have to react to some context or environment.”

But ReRAM can be taken even further than that into the realm of true in-memory processing. “We are investigating some of the more advanced flavors of this, where you are doing the processing in an analog fashion,” says Gideon Intrater, CTO at Adesto Technologies. “Using conventional approaches for a matrix that is 1,000 x 1,000, you have to go serially through and multiply elements in a vector by a column of the matrix and then add them all up. But if you have a way to do that in an analog fashion, you could do all of the multiplies instantly and reduce the complexity of the operation dramatically.”

Intrater explains how this works. “The weights in a system are stored in the ReRAM, where instead of having every bit of the weight stored in a separate cell of the memory, you store the whole thing in one cell and have the resistance be a linear function of the value you want to store. Then, if you pass current through the cell, the result will be the multiplication of the current you were driving times the resistance—Kirchhoff’s Law. From there, if you sum up a bunch of these currents, you get the sum of the multiplication. Doing that can provide you with a whole vector by column in a single operation—something that cannot be done in a digital fashion unless you have tons of multipliers in parallel. This is one of the most intriguing ways of doing this AI processing. This is really in-memory processing.”

Boosting overall performance

To reap the benefits of near-memory computing, some changes do have to be made at the applications level.

“In one project, we put our drives into one of our clients’ test platforms and we executed it just as if it were a drive plugged in to a slot and executed the program,” says NGD’s Shadley. “We then turned on our compute engine and it reduced the amount of time to move the data by 5X to 6X. Unfortunately, the code was expecting some delay in the data gathering, so the processing time did not actually change. If the application isn’t rewritten to some extent, you may get benefit, but you will not get a lot of net benefit. So then they modified their code and got a 40X net improvement. This is where hardware and software folks have to talk. We provided a hardware solution, but the software will not see the value of it until they use it.”

The application has to be parallelized so that part of it can be farmed out to the computation in memory. “The memory plus processor becomes your compute engine, but it is not doing the full computation,” says Adesto’s Intrater. “It is accelerating a parallelizable portion of it, but you still typically need a microprocessor to do the stuff that cannot run in parallel. This is very similar to a general-purpose CPU and an accelerator that does a lot of the heavy lifting, but the processor still needs to do a lot of the pre- and post-processing to the workload. You can offload the work to a number of these accelerators. So there are tradeoffs, and we are a long way today from knowing where the dust will settle.”

There is investment and work required at the software level. “One key takeaway from the latest conference is that any hardware company has to invest and hire software engineers so that they can make use of the new hardware,” says Crossbar’s Dubois. “Many companies started as hardware companies and just produced chips, but now there is a huge wave of software new hires to make sure that the hardware will be efficiently used for the end applications.”

Today, each product uses a different processor and has different interfaces and APIs, making the solutions less plug-and-play. “It is a bit of the Wild West, just like it was when PCI SSDs came out,” admits Shadley. “Over time, the market created NVMe, which solved the problem. You will see that happen with persistent memory, in-memory processing, and in-storage processing. The markets and the people providing those solutions realize they need to take the Wild West out of it so that everyone can take advantage of it.”

The standards efforts have started. “The Storage Networking Industry Association (SNIA) Computational Storage Technical Working Group has formed a working group, but not a standards body yet,” says Ben Whitehead, storage specialist in the emulation division of Mentor, a Siemens Business. “There is a certain level of anxiety that standards do not constrain creativity too early. We can see where the industry is going, and we have to understand why certain things are so important to them. They know what they need and they are very demanding of us to provide the tools that they need.”

Arm explains that more than 40 companies are represented and working together to define relevant approaches for different types of computational storage. In most cases, the server system must be able to deploy workloads to the drive, and then invoke these workloads and receive results. However, dedicated standalone capabilities also have applications. Methods to provide computational storage drive services and capabilities are being developed to ensure the drives are standardized, and that drives from multiple vendors can be adopted and deployed.

When can we expect to see these start to appear in products? “The infrastructure does not need to change,” says Whitehead. “This is a big deal. It is not a huge uplift to a standard SSD to make it a CSD. A perfect storm of many things is coming together.”

Related Stories

Using Memory Differently

Optimizing complex chips requires decisions about overall system architecture, and memory is a key variable.

New Memory Options

Using data as the starting point for designs opens up new architectural choices.

In-Memory Vs. Near-Memory Computing

New approaches are competing for attention as scaling benefits diminish.

HBM2 Vs. GDDR6: Tradeoffs In DRAM

Choices vary depending upon application, cost and the need for capacity and bandwidth, but the number of options is confusing.

In Memory And Near-Memory Compute

Steven Woo, Rambus fellow and distinguished inventor talks about how much power is spent storing and moving data.