NeuRRAM: RRAM Compute-In-Memory Chip for Efficient, Versatile, and Accurate AI Inference

Weier Wan1, Rajkumar Kubendran2,5, Clemens Schaefer4, S. Burc Eryilmaz1, Wenqiang Zhang3, Dabin Wu3, Stephen Deiss2, Priyanka Raina1, He Qian3, Bin Gao3*, Siddharth Joshi4,2, Huaqiang Wu3, H.-S. Philip Wong1, Gert Cauwenberghs2

1 Stanford University, CA, USA;
2 University of California San Diego, CA, USA;
3 Tsinghua University, Beijing, China;
4 University of Notre Dame, IN, USA;
5 University of Pittsburgh, PA, USA

 AI-powered edge devices such as smart wearables, smart home appliances, and smart Internet-of-things (IoT) sensors are already pervasive in our lives. Yet, most of these devices are only smart when they are connected to the internet. Under limited battery capacity and cost budget, local chipsets inside these devices are only capable of relatively simple data processing, while the more computationally demanding AI tasks are offloaded to the remote cloud.

The dependency of internet in such cloud-based solutions is not even the only issue. The long latency, risk in data privacy, and high cost associated with cloud services are all major pain points.

So how can we make AI algorithms run locally in these edge devices? The answer is energy-efficiency. Without good energy-efficiency, today’s edge hardware would drain battery too quickly if performing AI tasks locally.

To improve the energy-efficiency of AI hardware, we can draw inspiration from an extremely energy-efficient alternative – the brain. One major difference between today’s chips and the brain pertains the locations of synaptic memory (i.e. weights) and synaptic computation (i.e. multiply-accumulates). In today’s chips, they are separate: synaptic weights are stored in digital memory banks; synaptic computations are performed by separate compute units that fetch weights one by one from the memory. During AI processing, the chips constantly shuttle data back-and-forth between the memory banks and the compute units. This consumes significant energy and time.

In the brain, however, synaptic computations are performed locally by each synapse in a highly parallel fashion. As a result, brain can process information at a much higher energy-efficiency and throughput.

We designed and experimentally validated an AI inference chip – NeuRRAM [1], that implements brain-like compute-in-memory (CIM) architecture innately using resistive random-access memory (RRAM) [2]. The analog RRAM devices not only store synaptic weights, but also perform synaptic computations locally. This eliminates power-hungry data movements of synaptic weights and allows NeuRRAM to perform AI inference at a fraction of energy consumed by today’s digital AI chips.NeuRRAM chip

Fig. 1. Innovations across the full stack of hardware and algorithm enable NeuRRAM to simultaneously deliver high versatility, high efficiency, and high accuracy (Wan et al., Nature 2022 [1])

Although RRAM-CIM chips have been demonstrated in previous studies [3-14], AI benchmark results reported were often obtained through partial software simulation. Hardware experiments have only been shown for simple datasets (e.g. MNIST) and a single type of task (e.g. image classification). Meanwhile, previous studies typically only focused on improving energy-efficiency, while doing so often at the cost of the chip’s reconfigurability or inference accuracy.

NeuRRAM is the first fully integrated (including all essential modules for end-to-end neural network support) and large-scale (48 cores, 3 million synapses, and 12 thousand neurons) demonstration of a complete RRAM-CIM hardware capable of performing diverse AI tasks. It simultaneously achieves energy-efficiency 2´ better than previous RRAM-CIM chips, reconfigurability to support diverse neural network architectures, and inference accuracy comparable to software models, including 99.0% accuracy on MNIST and 85.7% on CIFAR-10 image classifications, 84.7% on Google speech command recognition, and a 70% error reduction in an image-denoising task.

Neurosynaptic Array

Unlike existing CIM chips where CMOS neurons are at the periphery of synapse array, the NeuRRAM chip physically interleaves RRAM synapses and CMOS neurons (Fig. 2). They form an array of small “corelets” that resembles network motifs with interleaved fabrics of neurons and synapses found in microcolumn circuitry in the mammalian brain. The transposable neurosynaptic array architecture gives great reconfigurability to NeuRRAM. It allows neuronal signals to be sent to and received along arbitrary directions across the corelet array. As a result, NeuRRAM is the first CIM neuromorphic chip that supports diverse neural network architectures with minimal hardware and energy overhead (Fig. 1c).

Neuron

Fig. 3. Reconfigurable voltage-mode neuron with different feedback paths around a single amplifier [1]

In addition, NeuRRAM realizes CIM in a different manner compared to existing CIM implementations. The conventional approach is to use voltage as input and measure the current as the results based on Ohm’s law. Such design cannot fully exploit the parallelism and energy-efficiency benefits of CIM. NeuRRAM implements an innovative voltage-domain computation that boosts both parallelism and efficiency (Fig. 1d). It is realized through a voltage-mode analog neuron circuit that uses voltage as both its inputs and outputs. This neuron is also versatile: it can be configured to implement leaky integrate-and-fire for biophysical spiking neural networks [11, 15] or activation functions such as sigmoid, tanh, and ReLU for artificial neural networks [1, 16]. On top of this, all these functionalities are encapsulated within a compact area footprint, by using different feedback paths around a single amplifier for different operations (Fig. 3).

Algorithm-Hardware

Fig. 4. Algorithm-hardware co-optimization techniques mitigate impact of hardware non-idealities on inference accuracy [1]

Computation accuracy of analog CIM is affected by various hardware non-idealities such as non-linearities, noise, and random variability. To mitigate the impact of these non-idealities, we developed a series of algorithm-hardware co-optimization techniques (Fig. 4): (a) model-drive hardware calibration, which calibrates the hardware using real model weights and real data from training datasets; (b) non-idealities-aware model training, which models hardware non-idealities during training to make model resilient to the non-idealities; (c) chip-in-the-loop progressive model finetuning, which uses the chip to perform the forward-pass one layer at a time during the back-propagation fine-tuning, such that the model learns to adapt to individual chips’ characteristics. By using these techniques, the inference accuracies measured on NeuRRAM were comparable to ideal software models (with 4-bit weights) across all the measured AI benchmarks.

 

References

[1] W. Wan, R. Kubendran, C. Schaefer, S. B. Eryilmaz, W. Zhang, D. Wu, S. Deiss, P. Raina, H. Qian, B Gao, S. Joshi, H. Wu, H.-S. P. Wong and G. Cauwenberghs, “A compute-in-memory chip based on resistive random-access memory,” Nature, vol. 608, no. 7923, pp. 504–512, 2022

[2] H. -S. P. Wong et al., “Metal–Oxide RRAM,” in Proceedings of the IEEE, vol. 100, no. 6, pp. 1951-1970, June 2012, doi: 10.1109/JPROC.2012.2190369.

[3] R. Mochida et al., “A 4M synapses integrated analog ReRAM based 66.5 TOPS/W neural-network processor with cell current controlled writing and flexible network architecture,” in Digest of Technical Papers – Symposium on VLSI Technology. IEEE, 10 2018, pp. 175–176.

[4] W. H. Chen et al., “CMOS-integrated memristive non-volatile computing-in-memory for AI edge processors,” Nature Electronics, vol. 2, no. 9, pp. 420–428, 9 2019.

[5] R. Khaddam-Aljameh et al., “HERMES Core-A 14nm CMOS and PCM-based In-Memory Compute Core using an array of 300ps/LSB Linearized CCO-based ADCs and local digital processing,” in IEEE Symposium on VLSI Circuits, Digest of Technical Papers. IEEE, 2021.

[6] J. M. Hung et al., “A four-megabit compute-in-memory macrowith eight-bit precision based on CMOS and resistive random-access memory for AI edge devices,” Nature Electronics, vol. 4, no. 12, pp. 921–930, 12 2021.

[7] C. X. Xue et al., “A 1Mb Multibit ReRAM Computing-In-Memory Macro with 14.6ns Parallel MAC Computing Time for CNN Based AI Edge Processors,” in Digest of Technical Papers – IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 3 2019, pp. 388–390.

[8] F. Cai et al., “A fully integrated reprogrammable memristor–CMOS system for efficient multiply–accumulate operations,” Nature Electronics, vol. 2, no. 7, pp. 290–299, 7 2019.

[9] M. Ishii et al., “On-Chip Trainable 1.4M 6T2R PCM Synaptic Array with 1.6K Stochastic LIF Neurons for Spiking RBM,” in Technical Digest – International Electron Devices Meeting (IEDM). IEEE, 2019, p. 14.2.1–14.2.4.

[10] B. Yan et al., “RRAM-based Spiking Nonvolatile Computing-In-Memory Processing Engine with Precision-Configurable in Situ Non-linear Activation,” in Digest of Technical Papers – Symposium on VLSI Technology. IEEE, 6 2019, pp. T86–T87.

[11] W. Wan et al., “A 74 TMACS/W CMOS-RRAM Neurosynaptic Core with Dynamically Reconfigurable Dataflow and In-situ Transposable Weights for Probabilistic Graphical Models,” in Digest of Technical Papers – IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2 2020, pp. 498–500.

[12] Q. Liu et al., “A Fully Integrated Analog ReRAM Based 78.4TOPS/W Compute-In-Memory Chip with Fully Parallel MAC Computing,” in Digest of Technical Papers – IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2020, pp. 500–502.

[13] C. X. Xue et al., “A 22nm 4Mb 8b-Precision ReRAM Computing-in-Memory Macro with 11.91 to 195.7TOPS/W for Tiny AI Edge Devices,” in Digest of Technical Papers – IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2021.

[14] P. Narayanan et al., “Fully On-Chip MAC at 14 nm Enabled by Accurate Row-Wise Programming of PCM-Based Weights and Parallel Vector-Transport in Duration-Format,” IEEE Transactions on Electron Devices, vol. 68, no. 12, pp. 6629–6636, 12 2021.

[15] R. Kubendran, W. Wan, S. Joshi, H.-S. P. Wong and G. Cauwenberghs, “A 1.52 pJ/Spike Reconfigurable Multimodal Integrate-and-Fire Neuron Array Transceiver,” International Conference on Neuromorphic Systems (ICONS), 2020

[16] W. Wan, R. Kubendran, B. Gao, S. Joshi, P. Raina, H. Wu, G. Cauwenberghs and H.-S. P. Wong, “A Voltage-Mode Sensing Scheme with Differential-Row Weight Mapping for Energy-Efficient RRAM-Based In-Memory Computing,” IEEE Symposium on VLSI Technology, 2020