The cost associated with moving data in and out of memory is becoming prohibitive, both in terms of performance and power, and it is being made worse by the data locality in algorithms, which limits the effectiveness of cache.

The result is the first serious assault on the von Neumann architecture, which for a computer was simple, scalable and modular. It separated the notion of a computational unit from a contiguous memory space that contained instructions and data. But machine learning is making some reconsider this direction, particularly as system design pushes memory progressively closer to computation. The shorter the wires, the faster the data can be moved, and the less energy expended.

The question now is whether that transfer be eliminated altogether. Placing computation inside of memory runs counter to the von Neumann approach, but a growing number of people believe this direction is inevitable for machine learning designs. As with any new concept, there are significant barriers to be overcome and a lot of naysayers. Still, it may represent one of the most promising directions for the future.

There is a slight confusion in terms when talking about in-memory computing. This term has been used for some time in the relational database and middleware worlds to mean keeping data in main memory instead of it being in disk storage. That is not what is being discussed in this article.

The new direction is to incorporate logic into the bit lines inside of memory. This could be done either in main memory or in cache, and most memory types could be considered, including DRAM, SRAM, MRAM and ReRAM. The downside is that the memory is fundamentally altered, which could have negative implications. If the same memory is used to hold both instructions and data, area and power is being consumed when not required, meaning that the solution would be inferior for non-machine learning applications.

In-memory solutions

Most of this work is still in the research phase today.

“Researchers are focusing on the elimination of data movement altogether,” says Dave Pursley, product management director at Cadence. “If you read the academic papers, much of the research used to be about how to reduce the amount of computation. Now we are in a phase where we are looking at the reduction in data movement or improving locality so that you don’t need such massive amounts of storage and those very costly memory accesses in terms of power.”

Research teams are looking at several possible ways to do in-memory computation. Some of them deploy digital techniques while others go for a mixed-signal approach. Reetuparna Das, assistant professor in the EECS department at the University of Michigan, proposes a digital solution. Her technique would store data in a transposed format on the bit-lines of an SRAM array used for a cache. A small amount of logic is added such that each of the arrays within the cache can be used to accelerate deep learning algorithms. It is a bit-serial computation. She initially utilized this technique for logic operations, then advanced these to addition and multiplication. Her claim is that it can perform better than a GPU. (“Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks,” by Charles Eckert et al., ACM/IEEE 4th Annual International Symposium on Computer Architecture, 2018.)

Some of the new memory types offer an alternative solution. “We are attempting to calculate an analog dot product computation in the analog domain with the bit-lines that connect into a resistive mesh,” says Engin Ipek, associate professor in the departments of computer science and electrical engineering at University of Rochester. “With ReRAM we have a resistive memory that can be programmed to plus or minus some resistance ranges, and you can calculate the dot product by mapping the coefficients in an inverse proportion to the resistances of the mesh and providing the other with the proportion to the coefficients of the second vector such that the current that will be drawn is proportional to the dot product.”

Fig. 1: In-memory processing concept. Source: DAC presentation/Mythic/Semiconductor Engineering

So how does the performance of these compare to traditional methods? “They are using analog multiply/accumulate (MAC) inside the memory and this is highly efficient,” explains Patrick Soheili, vice president of business and corporate development for eSilicon. “They are 10X or 100X better performance than an SRAM sitting next to a MAC in a typical semiconductor process.”

The thought of analog computation would appear to be going back to the past. “A lot of the challenges associated with these techniques are associated with the precision or the signal-to-noise ratios that are required, and these are unachievable any time in the near future,” says Ipek. “But you can accept the devices for what they are and deal with this at the architecture level. Can I design an architecture that works around the imperfections? We believe the answer to that is yes. In many cases you don’t need more than one bit of precision. The nice thing is that we are dealing with linear algebra, which means I can take a matrix and split it into bit slices. Then I repeat an operation for each of those slices and put everything back together—with linearity—so we only require the memories to work as binary switches.”

Progress is being made to bring some of these solutions to market. “You have companies like Mythic that are doing it in flash, and there are a bunch of startups that are not yet out of stealth mode using in-memory based on SRAM,” says Soheili. “There are MRAM and other novel types of architectures and materials that are also being experimented with, and they are all have advantages and disadvantages. They are all overcoming issues that will determine when each solution will be able to go into production.”

Issues

There are a lot of issues that have to be overcome before these solutions will be ready for production.

“Consider MRAM, which is at 28nm or 22nm today,” says Daniel Morris, research scientist at Facebook. “Compare that to 7nm SRAM and you will find a 6X density gap. The gap also exists in energy per byte. So while you may not need to move the data, you do need to read the data, and there is a 6X difference between reading MRAM and reading SRAM. Write is another ballgame where NVM requires high power for write. Then there is the logic performance. If you are stuck in a trailing technology, such as 28nm, your logic will be a lot less efficient and lower performance per watt. If you have to keep to the same power budget, you will get less performance. That equates to a 4X difference. So, when you architect a conventional accelerator, you have all of these choices and can choose the best memory technology and logic technology whereas if we are using an embedded NVM, you are handcuffed to using a more limited technology.”

Shekhar Borkar, senior director of technology for Qualcomm Technologies, agrees. “There are two different technologies—logic technology, which is optimized for power and performance, and memory technology that is optimized for density. If you try to put them together you get the worst of both. At the end of the day it is about money.”

Memory companies have used a lot of tricks to increase yield. “Memories have yield issues, reliability issues, variation issues,” says Morris. “With conventional memory products all of these are handled nicely with techniques such as repair or dealing with failing bits. These have been solved with conventional memory. There is a lot of work to solve them if you add compute into memory.”

There are other issues to contend with, too. Some or all of the process and voltage variability, which could affect the results of in-memory processing, could be mitigated by the inclusion of sensors within the array, and highly skilled analog designers could then design calibration circuitry into the ADCs and DACs to offset the variability.

Even the researchers admit there are significant problems to overcome. “Many matrices in the real world are sparse (contain many zeros), because quite often the natural phenomena that you are trying to model has a lot of interactions that are strong and other interactions that are not strong and can be ignored,” says Ipek. “When you are working with a crossbar, the energy cost of activating and sampling the results are pretty much fixed. So the question is, given a sparse matrix, how can you get the computation to be energy-efficient? One solution is a heterogenous collection of crossbars, and by carefully partitioning the matrixes it turns out that you can map much of the problem onto a subset of crossbars that efficiently executes that.”

Mapping of problems onto these accelerators is a challenge. “Conventional accelerators can scale memory capacity independent of processing elements,” says Morris. “Each layer of a neural network has very different structure and dimensions for filters, different amount of weights, different compute, so always being able to use a fixed memory structure is challenging.”

Solutions require changes in mindset for both the hardware and software. “Analog in-memory solutions require ADCs and DACs,” says eSilicon’s Soheili. “If you are a good ADC/DAC designer and you know what you are doing in the analog domain, the power that you can save, from an overall system perspective—at least from the numbers we have seen—is overwhelming. So you may spend more time and need a different set of the expertise, but your gain from the in-memory computation far outweighs the extra effort. It requires a different skill set.”

Conclusion

When industrial engineers look at the technology and view it as additional logic placed into existing memory, it does not paint a rosy picture for them. Qualcomm’s Borkar analyzed several industrial-sized workloads and comes up with a grim conclusion. “When we look at performance improvement for all workloads, there are only 2 or 3 examples that are above zero. Looking at energy, almost all cases have gone up. On average I see a 5% reduction in performance with 33% increase in energy. So while compute in memory may look sensible, please don’t do it.”

Nuwan Jayasena, principal member of technical staff at AMD, has a more positive view. “There are challenges with processing in memory, and they will take a lot of time and investment, but there is a lot of opportunity. When we did get GPUs to work, it enabled the current revolution that we are seeing today. Processing in memory has a similar potential once you get it to work, and it will enable the next step in machine learning.”

If the industry is willing to give von Neumann the boot, it should perhaps go the whole way and stop considering memory to be something shared between instructions and data and start thinking about it as an accelerator. Viewed that way, it no longer has to be compared against logic or memory, but should be judged on its own merits. If it accelerates the task and uses less power, then it is a purely economic decision if the area used is worth it, which is the same as every other accelerator.

Related Stories

Von Neumann Architecture Knowledge Center

Top stories, white papers, videos, blogs and more on Von Neumann Architecture