The second key property that enables brain-inspired computing is the accumulative behavior arising from the crystallization dynamics.As shown in Fig., one can induce progressive reduction in the size of the amorphous region (and hence the device resistance) by the successive application of SET pulses with the same amplitude. However, it is not possible to achieve a progressive increase in the size of the amorphous region. Hence, the curve shown in Fig.typically referred to as the accumulation curve, is unidirectional. The SET pulses typically consume less energy (approx. 5 pJ) compared to the RESET pulses. As we will see later on, it is often desirable to achieve a linear increase in conductance as a function of the number of SET pulses. However, as shown in Fig., this desired behavior is not what real devices tend to exhibit. It can also be seen that there is significant cycle-to-cycle randomness associated with the accumulation process attributed to the inherent stochasticity associated with the crystallization process.

Even though it is possible to achieve a desired resistance value through iterative programming, there are significant temporal fluctuations associated with the resistance values [see Fig.]. For example, PCM devices exhibit significant 1/noise behavior.There is also a temporal evolution of resistance arising from a spontaneous structural relaxation of the amorphous phase.The thermally activated nature of electrical transport also leads to significant resistance changes resulting from ambient temperature variations.

The first key property of PCM that enables brain-inspired computing is its ability to achieve not just two levels but a continuum of resistance or conductance values.This is typically achieved by creating intermediate phase configurations by the application of suitable partial RESET pulses.For example, in Fig., it is shown how one can achieve a continuum of resistance levels by the application of RESET pulses with varying amplitude. The device is first programmed to the fully crystalline state. Thereafter, RESET pulses are applied with progressively increasing amplitude. The resistance is measured after the application of each RESET pulse. It can be seen that the device resistance, related to the size of the amorphous region (shown in red), increases with increasing RESET current. The curve shown in Fig.is typically referred to as the programming curve. The programming curve is usually bidirectional (can increase as well as decrease the resistance by modulating the programming current) and is typically employed when one has to program a PCM device to a certain desired resistance value. This is achieved through iterative programming by applying several pulses in a closed-loop manner.The programming curves are shown in terms of the programming current due to the highly nonlinearcharacteristics of the PCM devices. A slight variation in the programming voltage would result in large variations in the programming current. For example, for the devices shown in Fig., a voltage drop across the PCM devices of 1.0 V corresponds to 100A and 1.2 V corresponds to 500A. The latter results in a dissipated power of 600W and the energy expended assuming a pulse duration of 50 ns is 30 pJ. An additional consideration is that the amorphous phase-change material has to undergo threshold switching prior to being able to conduct such high currents at such low voltage values.This could necessitate voltage values of up to 2.5 V.

A PCM device consists of a nanometric volume of this phase change material sandwiched between two electrodes. A schematic illustration of a PCM device with a “mushroom-type” device geometry is shown in Fig.. The phase change material is in the crystalline phase in an as-fabricated device. In a memory array, the PCM devices are typically placed in series with an access device such as a field effect transistor (FET) referred to as a 1T1R configuration. When a current pulse of sufficiently high amplitude is applied to the PCM device (typically referred to as the RESET pulse), a significant portion of the phase change material melts owing to Joule heating. The typical melting temperature of phase-change materials is approx. 600 °C. When the pulse is stopped abruptly so that temperature inside the heated device drops rapidly, the molten material quenches into the amorphous phase due to glass transition. In the resulting RESET state, the device will be in a high resistance state if the amorphous region blocks the bottom electrode. A transmission electron micrograph of a PCM device in the RESET state is shown in Fig.. When a current pulse (typically referred to as the SET pulse) is applied to a PCM device in the RESET state, such that the temperature reached in the cell via Joule heating is high, but below the melting temperature, a part of the amorphous region crystallizes. The temperature that corresponds to the highest rate of crystallization is typically ≈400 °C. In particular, if the SET pulse induces complete crystallization, then the device will be in a low resistance state. In this scenario, we have a memory device that can store one bit of information. The memory state can be read by biasing the device with a small amplitude read voltage that does not disturb the phase-configuration.

In the brain, memory and processing are highly entwined. Hence, the memory unit can be expected to play a key role in brain-inspired computing systems. In particular, very high-density, low-power, variable-state, programmable and non-volatile memory devices could play a central role. One such nanoscale memory device is phase-change memory (PCM).PCM is based on the property of certain compounds of Ge, Te, and Sb that exhibit drastically different electrical characteristics depending on their atomic arrangement.In the disordered amorphous phase, these materials have very high resistivity, while in the ordered crystalline phase, they have very low resistivity.

We are on the cusp of a revolution in artificial intelligence (AI) and cognitive computing. The computing systems that run today's AI algorithms are based on the von Neumann architecture where large amounts of data need to be shuttled back and forth at high speeds during the execution of these computational tasks (see Fig.). This creates a performance bottleneck and also leads to significant area/power inefficiency. Thus, it is becoming increasingly clear that to build efficient cognitive computers, we need to transition to novel architectures where memory and processing are better collocated. Brain-inspired computing is a key non-von Neumann approach that is being actively researched. It is natural to be drawn to the human brain for inspiration, a remarkable engine of cognition that performs computation on the order of peta-ops per joule thus providing an “existence proof” for an ultralow power cognitive computer. Unfortunately, we are still quite far from attaining a comprehensive understanding of how the brain computes. However, we have uncovered certain salient features of this computing system such as the collocation of memory and processing, a computing fabric comprising large-scale networks of neurons and plastic synapses and spike-based communication and processing of information. Based on these insights, we could begin to realize brain-inspired computing systems at multiple levels of inspiration or abstraction.

II. COMPUTATIONAL MEMORY Section: Choose Top of page ABSTRACT I. INTRODUCTION II. COMPUTATIONAL MEMORY << III. DEEP LEARNING CO-PRO... IV. SPIKING NEURAL NETWOR... V. DISCUSSION AND OUTLOOK CITING ARTICLES

4 16 25, 5975 (2013). 16. M. Cassinerio, N. Ciocchini, and D. Ielmini, Adv. Mater., 5975 (2013). https://doi.org/10.1002/adma.201301940 17,18 23, 2248 (2013). 17. C. D. Wright, P. Hosseini, and J. A. V. Diosdado, Adv. Funct. Mater., 2248 (2013). https://doi.org/10.1002/adfm.201202383 36, 975 (2015). 18. P. Hosseini, A. Sebastian, N. Papandreou, C. D. Wright, and H. Bhaskaran, IEEE Electron Device Lett., 975 (2015). https://doi.org/10.1109/LED.2015.2457243 19 8, 1115 (2017). 19. A. Sebastian, T. Tuma, N. Papandreou, M. Le Gallo, L. Kull, T. Parnell, and E. Eleftheriou, Nat. Commun., 1115 (2017). https://doi.org/10.1038/s41467-017-01481-9 At a basic level, a key attribute of brain-inspired computing is the co-location of memory and processing. It can be shown that it is possible to perform in-place computation with data stored in PCM devices. The essential idea is not to treat memory as a passive storage entity, but to exploit the physical attributes of the memory devices as described in Sec. I , and thus realize computation exactly at the place where the data are stored. We will refer to this first level of inspiration as in-memory computing and refer to the memory unit that performs in-memory computing as computational memory (see Fig.). Several computational tasks such as logical operations,arithmetic operationsand even certain machine learning taskscan be implemented in such a computational memory unit.

20 et al. , Adv. Phys. X 2, 89 (2017) 20. G. W. Burr, Adv. Phys. X, 89 (2017) https://doi.org/10.1080/23746149.2016.1259585 5(a) Ax = b, the elements of A should be mapped linearly to the conductance values of PCM devices organized in a cross-bar configuration. The x values are encoded into the amplitudes or durations of read voltages applied along the rows. The positive and negative elements of A could be coded on separate devices together with a subtraction circuit, or negative vector elements could be applied as negative voltages. The resulting currents along the columns will be proportional to the result b. If inputs are encoded into durations, the result b is the total charge (e.g., current integrated over time). The property of the device that is used is the multi-level storage capability as well as the Kirchhoff circuit laws: Ohm's law and Kirchhoff's current law. The same cross-bar configuration can be used to perform a matrix-vector multiplication with the transpose of A. For this, the input voltage has to be applied to the column lines and the resulting current has to be measured along the rows. Mapping of the matrix elements to the conductance values of the resistive memory device can be achieved via iterative programming using the programming curve. 5 5. N. Papandreou, H. Pozidis, A. Pantazi, A. Sebastian, M. Breitwisch, C. Lam, and E. Eleftheriou, in International Symposium on Circuits and Systems (ISCAS) ( IEEE , 2011), pp. 329– 332 . 5(b) A is a 256 × 256 Gaussian matrix coded in a PCM chip and x is a 256-long Gaussian vector applied as voltages to the devices. It can be seen that the matrix-vector multiplication has a precision comparable to that of 4-bit fixed point arithmetic. This precision is mostly determined by the conductance fluctuations discussed in Sec. One arithmetic operation that can be realized is matrix-vector multiplication.As shown in Fig., in order to perform, the elements ofshould be mapped linearly to the conductance values of PCM devices organized in a cross-bar configuration. Thevalues are encoded into the amplitudes or durations of read voltages applied along the rows. The positive and negative elements ofcould be coded on separate devices together with a subtraction circuit, or negative vector elements could be applied as negative voltages. The resulting currents along the columns will be proportional to the result. If inputs are encoded into durations, the resultis the total charge (e.g., current integrated over time). The property of the device that is used is the multi-level storage capability as well as the Kirchhoff circuit laws: Ohm's law and Kirchhoff's current law. The same cross-bar configuration can be used to perform a matrix-vector multiplication with the transpose of. For this, the input voltage has to be applied to the column lines and the resulting current has to be measured along the rows. Mapping of the matrix elements to the conductance values of the resistive memory device can be achieved via iterative programming using the programming curve.Figureshows an experimental demonstration of a matrix-vector multiplication using real PCM devices fabricated in the 90 nm technology node.is a 256 × 256 Gaussian matrix coded in a PCM chip andis a 256-long Gaussian vector applied as voltages to the devices. It can be seen that the matrix-vector multiplication has a precision comparable to that of 4-bit fixed point arithmetic. This precision is mostly determined by the conductance fluctuations discussed in Sec. I

21 21. M. Le Gallo, A. Sebastian, G. Cherubini, H. Giefers, and E. Eleftheriou, in International Electron Devices Meeting (IEDM) ( IEEE , 2017), pp. 28– 23 . 22 15, 443 (2013). 22. S. Qaisar, R. M. Bilal, W. Iqbal, M. Naureen, and S. Lee, J. Commun. Networks, 443 (2013). https://doi.org/10.1109/JCN.2013.000083 x of length N to a measurement vector y of length M < N. If this process is linear, then it can be modeled by an M × N measurement matrix M . The idea is to store this measurement matrix in the computational memory unit, with PCM devices organized in a cross-bar configuration [see Fig. 6(a) MN) to O(N). An experimental illustration of compressed sensing recovery in the context of image compression is shown in Fig. 6(b) Compressed sensing and recovery is one of the applications that could benefit from a computational memory unit that performs matrix-vector multiplications.The objective behind compressed sensing is to acquire a large signal at a sub-Nyquist sampling rate and subsequently reconstruct that signal accurately. Unlike most other compression schemes, sampling and compression are done simultaneously, with the signal getting compressed as it is sampled. Such techniques have widespread applications in the domains of medical imaging, security systems, and camera sensors.The compressed measurements can be thought of as a mapping of a signalof lengthto a measurement vectorof length. If this process is linear, then it can be modeled by anmeasurement matrix. The idea is to store this measurement matrix in the computational memory unit, with PCM devices organized in a cross-bar configuration [see Fig.]. This allows us to perform the compression in O(1) time complexity. An approximate message passing algorithm (AMP) can be used to recover the original signal from the compressed measurements, using an iterative algorithm that involves several matrix-vector multiplications on the very same measurement matrix and its transpose. In this way, we can also use the same matrix that was coded in the computational memory unit for the reconstruction, reducing the reconstruction complexity from O() to O(). An experimental illustration of compressed sensing recovery in the context of image compression is shown in Fig.. A 128 × 128 pixel image was compressed by 50% and recovered using the measurement matrix elements encoded in a PCM array. The normalized mean square error associated with the recovered signal is plotted as a function of the number of iterations. A remarkable property of AMP is that its convergence rate is independent of the precision of the matrix-vector multiplications. The lack of precision only results in a higher error floor, which may be considered acceptable for many applications. Note that, in this application, the measurement matrix remains fixed and hence the property of PCM that is exploited is the multi-level storage capability.

19 8, 1115 (2017). 19. A. Sebastian, T. Tuma, N. Papandreou, M. Le Gallo, L. Kull, T. Parnell, and E. Eleftheriou, Nat. Commun., 1115 (2017). https://doi.org/10.1038/s41467-017-01481-9 7(a) c = 0.01, and the overall objective is to find them. Each pixel is assigned to a corresponding PCM device and the algorithm is executed as described earlier. It can be seen that after a certain period of time, the PCM devices associated with the correlated processes progress towards a high conductance value. This way, just by reading back the conductance values, we can decipher which of the binary random processes are temporally correlated [Fig. 7(b) N) to O ( k log ( N ) ) , where k is a small constant and N is the number of data streams. A detailed system-level comparative study with respect to state-of-the-art computing hardware was also performed. 19 8, 1115 (2017). 19. A. Sebastian, T. Tuma, N. Papandreou, M. Le Gallo, L. Kull, T. Parnell, and E. Eleftheriou, Nat. Commun., 1115 (2017). https://doi.org/10.1038/s41467-017-01481-9 Another interesting demonstration of in-memory computing is that of unsupervised learning of temporal correlations between binary stochastic processes.This problem arises in a variety of fields from finance to life sciences. Here, we exploit the accumulative behavior of the PCM devices. Each process is assigned to a PCM device as shown in Fig.. Whenever the process takes the value 1, a SET pulse is applied to the device. The amplitude of the SET pulse is chosen to be proportional to the instantaneous sum of all processes. With this procedure, it can be seen that the devices which are interfaced to the processes that are temporally correlated will go to a high conductance value. The simplicity of this approach belies the fact that a rather intricate operation of finding the sum of the elements of an uncentered covariance matrix is performed, using the accumulative behavior of the PCM devices. An experimental demonstration of the learning algorithm is presented involving a million pixels that are turning on and off, representing a million binary stochastic processes. Some of the pixels turn on and off with a weak correlation of0.01, and the overall objective is to find them. Each pixel is assigned to a corresponding PCM device and the algorithm is executed as described earlier. It can be seen that after a certain period of time, the PCM devices associated with the correlated processes progress towards a high conductance value. This way, just by reading back the conductance values, we can decipher which of the binary random processes are temporally correlated [Fig.]. The computation is massively parallel, with the final result of the computation imprinted onto the PCM devices. The reduction in computational time complexity is from O() to, whereis a small constant andis the number of data streams. A detailed system-level comparative study with respect to state-of-the-art computing hardware was also performed.Various implementations were compiled and executed on an IBM Power System S822LC system with 2 Power8 central processing units (CPUs) (each comprising 10 cores) and 4 Nvidia Tesla P100 graphical processing units (GPUs) attached using the NVLink Interface. A multi-threaded implementation was designed that can leverage the massive parallelism offered by the GPUs, as well as a scale out implementation that runs across several GPUs. For the PCM, a write latency of 100 ns and a programming energy of 1.5 pJ were assumed for each SET operation. It was shown that using such a computational memory module, it is possible to accelerate the task of correlation detection by a factor of 200 relative to an implementation that uses 4 state-of-the-art GPU devices. Moreover, power profiling of the GPU implementation indicates that the improvement in energy consumption is over two orders of magnitude.

23 1, 246– 253 (2018). 23. M. Le Gallo, A. Sebastian, R. Mathis, M. Manica, H. Giefers, T. Tuma, C. Bekas, A. Curioni, and E. Eleftheriou, Nat. Electron., 246–(2018). https://doi.org/10.1038/s41928-018-0054-8 The compressed sensing recovery and unsupervised learning of temporal patterns are two applications that clearly demonstrate the potential of PCM-based computational memory in tackling certain data-centric computational tasks. The former exploits the multi-level storage capability, whereas the latter mostly relies on the accumulative behavior. However, one key challenge associated with computational memory is the lack of high precision. Even though approximate solutions are sufficient for many computational tasks in the domain of AI, there are some applications that require that the solutions are obtained with arbitrarily high accuracy. Fortunately, many such computational tasks can be formulated as a sequence of two distinct parts. In the first part, an approximate solution is obtained; in the second part, the resulting error in the overall objective is calculated accurately. Then, based on this, the approximate solution is refined by repeating the first part. Step I typically has a high computational load, whereas Step II has a light computational load. This forms the foundation for the concept of mixed-precision in-memory computing: the use of a computational memory unit in conjunction with a high-precision von Neumann machine.The low-precision computational memory unit can be used to obtain an approximate solution as discussed earlier. The high-precision von Neumann machine can be used to calculate the error precisely. The bulk of the computation is still realized in computational memory, and hence we still achieve significant areal/power/speed improvements while addressing the key challenge of imprecision associated with computational memory).

Ax = b, find x). As shown in Fig. 8(a) z. This error-correction term is computed by solving Az = r with an inexact inner solver, using the residual r = b − Ax calculated with high precision. The matrix multiplications in the inner solver are performed inexactly using computational memory. The algorithm runs until the norm of the residual falls below a desired pre-defined tolerance, tol. An experimental demonstration of this concept using model covariance matrices is shown in Fig. 8(b) A practical application of mixed-precision in-memory computing is that of solving systems of linear equations (if, find). As shown in Fig., an initial solution is chosen as a starting point and is then iteratively updated based on a low-precision error-correction term,. This error-correction term is computed by solvingwith an inexact inner solver, using the residualcalculated with high precision. The matrix multiplications in the inner solver are performed inexactly using computational memory. The algorithm runs until the norm of the residual falls below a desired pre-defined tolerance,. An experimental demonstration of this concept using model covariance matrices is shown in Fig.. The model covariance matrices exhibit a decaying behavior that simulates the decreasing correlation of features away from the main diagonal. The matrix multiplications in the inner solver are performed using PCM devices. The norm of the error between the estimated solution and the actual solution is plotted against the number of iterative refinements. It can be seen that for all matrix dimensions, the accuracy is not limited by the precision of the computational memory unit. Several system-level measurements using Power 8 CPUs and P100 GPUs serving as the high-precision processing unit showed that up to 6.8× improvements in time/energy to solution can be achieved for large matrices. Moreover, this gain can increase to more than one order of magnitude for more accurate computational memory units.