Professional Documents
Culture Documents
RRAM Architecture
Qiwen Wang1,2, Xinxin Wang1,2, Seung Hwan Lee1, Fan-Hsuan Meng1, and Wei D. Lu1*
1
Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan 48109, USA
2
These authors contributed equally to this work.
*Email: wluee@umich.edu
Abstract—State-of-the-art deep neural networks (DNNs) II. MAPPING LARGE-SCALE DNNS ONTO TILED
have been successfully mapped on an RRAM-based tiled in- CROSSBAR ARRAYS
memory computing (IMC) architecture. Effects of moderate
array size and quantized partial products (PPs) due to ADC In this work, we focus on inference operations, which are
precision constraints have been analyzed. Methods were expected to be popular with edge type devices. Weights from pre-
trained DNN models are programed as conductance values of
developed to solve these challenges and preserve DNN
RRAM devices. Since the weight updates are very in-frequently,
accuracies and IMC performance gains in the tiled architecture.
write-verify scheme can be used during model loading to
Popular models including VGG-16 and MobileNet have been improve accuracy of the programmed conductance. During
successfully implemented and tested on ImageNet dataset. inference, the RRAM cells operate in read mode. More
I. INTRODUCTION specifically, we target implementation of 8bit models, since most
state-of-the-art DNNs have been successfully implemented in
DNNs are widely used for artificial intelligent applications 8bit digital pipeline for inference [4]. The model mapping and
with considerable success. However, neural networks often simulation framework are shown in Fig. 3.
come with high computation complexity and cost, as traditional For practical DNNs, weights from a typical layer cannot fit
computing architectures are not well optimized for DNN on a single array. We have developed methods to map common
computation. DNN accelerators are crucial in enabling wider neural network layers of any size through a tiled architecture,
adaption, particularly for edge use cases. Among these, shown in Fig. 4. To map the signed weights, two approaches are
accelerators based on analog IMC concepts implemented on studied. One is to map positive and negative weights on two
RRAM arrays have gained increasing interest. RRAM arrays different arrays (termed “dual array”), and the other is to map the
can perform vector-matrix multiplication (VMM) in analog weight as the difference in conductance between two devices on
domain efficiently by accumulating total current or charge at two adjacent rows (termed “dual row”). For fully connected (FC)
each column. At the same time, the high density and non-volatile layers, the weights of each output neuron are mapped on to a
properties make it possible to store entire DNN models on chip, column. If the number of weights is larger than the number of
thus eliminates the inefficient off-chip memory access and rows in the array, the weights are divided into multiple arrays
promises much higher energy efficacy. However, prior studies (Fig. 7(a)). For regular convolution (Conv) layer, each filter is
on RRAM-based accelerators have focused on small networks flattened to a 1D vector and mapped the same way as a FC layer
and databases such as MNIST [1], which may not capture the (Fig. 7(b)).
challenges faced when implementing advanced models and Beyond these common layer types, MobileNet requires
complex tasks (e.g. ImageNet, Fig. 1). For example, device and depthwise convolution (DW Conv) layers. Here each 2D filter
circuit non-idealities limit the maximum size of the crossbar corresponds to a single input channel and leads to poor array
array (Fig. 2), making it impractical to map a whole layer in utilization. We proposed two methods to improve utilization rate.
state-of-the-art DNN models on a single crossbar, contrary to First, the overlap between adjacent convolution windows in the
assumptions in prior studies. Complex tasks such as ImageNet input can allow filters to be mapped on adjacent columns with an
may also be more sensitive to quantization errors and device offset (Fig. 7(c)). In this way, multiple outputs can be
nonidealities. In this work, we implemented large-scale DNNs simultaneously calculated through the VMM operations. Second,
in a reconfigurable tiled RRAM architecture, where weights in when different filters are stored without offset, the computation
a single layer are mapped onto multiple crossbar arrays tiled for each filter needs to be carried out sequentially since they act
through digital interfaces. This approach is scalable and on different input channels. However, since time-multiplexing of
practical. However, the output activations need to be calculated the ADCs may already be necessary (discussed in more detail
by summing multiple PPs from the arrays, and loss of later), the different channels can be computed in a sequential
information when generating PPs due to limited ADC precision manner to improve array utilization without overall speed penalty
can severely degrade model performance. We carefully on the system performance.
examined the performance of popular models such as VGG-16 To perform VMM, all input rows and all output columns of
[2] and MobileNet V1 [3] and developed effective mitigation the arrays are simultaneously activated. The input activations are
methods to preserve the model accuracy while maintaining the
IMC performance gains using practical 8-bit ADCs.
IEDM19-319 14.4.2
Authorized licensed use limited to: University of Newcastle. Downloaded on May 30,2020 at 08:21:50 UTC from IEEE Xplore. Restrictions apply.
REFERENCES
[1] P. Y. Chen, et al., IEDM, pp. 6-1. 2017. [2] K. Simonyan, et al., arXiv preprint,
2014. [3] A. G. Howard, et al., arXiv preprint, 2017. [4] R. Krishnamoorthi, arXiv
preprint, 2018. [5] B, Jacob, et al., CVPR, pp. 2704-2713. 2018. [6] Datasheet from
vendor. [7] M. Zidan, et al., Nature Electronics, p.411, 2018.
Fig. 2. (a) SEM image of a typical RRAM crossbar array. (b) Device non-
idealities include finite on/off and variability.
Fig. 1. Different types of layers in DNN: fully connected layers (a), convolution layers
(b), and depthwise convolution layers (c).
Fig. 5. Quantized partial product pipeline that takes into account weight and
activation quantization. Scaling of the quantized PPs is fused together with the
scaling of the quantized 8bit digital model, so only one digital multiplication
operation is needed for each PP without additional overhead.
104
(a)
MobileNet, cb_size=256x64, dual array
MobileNet, cb_size=256x64, dual row
3
10
# of crossbar
Fig. 3. Simulation framework of the tiled RRAM accelerator. The users can map their
102
own neural networks by calling different layer functions and initialize the weights by
calling different weight mapping functions through Crossbar class. The activation
mapping functions are used in the layer functions to map input activations into vectors,
101
which are stored in the Activation class. Right shows classification results of an input in
the ImageNet database by the implemented MobileNet model.
100
5 10 15 20 25
Layer
105
(b)
VGG-16, cb_size=256x64, dual array
104 VGG-16, cb_size=256x64, dual row
# of crossbar
103
102
101
100
5 10 15
Layer
Fig. 6: Number of crossbar arrays needed for each layer,
Fig. 4. Tiled architecture. As the deep neural networks contain a large number of parameters, a single layer after mapping MobileNet (a) and VGG-16 (b), with dual
may need to be mapped onto multiple crossbar arrays. During inference, each array will produce a PP. The array and dual row mapping approaches.
PPs are quantized into 8-bit and stored in the SRAM buffers. The PPs in the same channel are then added up
digitally to produce the output neuron activation.
14.4.3 IEDM19-320
Authorized licensed use limited to: University of Newcastle. Downloaded on May 30,2020 at 08:21:50 UTC from IEEE Xplore. Restrictions apply.
Fig. 7. Weight mapping of FC layers (a), Conv layers (b) and DW Conv layers (c). The weights in FC layers are divided to different blocks to fit the size of the
crossbar arrays. The Conv kernels are first reshaped to a 1D vector and then mapped in the same way as the FC weights. In the mapping of the DW Conv kernels,
two parameters, duplicate and time multiplex can help regulate the mapping to improve the array utilization and the computation time.
Floating
4x103
8bit digital
8bit Tiled single ADC Range
80 3x103
71.5 71 70.9 70.1
Count
68.59
2x103
60
Accuracy (%)
1x103
40
0
0 10 20
Range(µA)
20
8x10
0.32 6x10
0
Vgg MobileNet Fig. 10. Simulation results for (a) single
ADC range with different ADC
Count
Model 4x10
Fig. 8. Model accuracies, comparing the precisions, (b) 8-bit ADC with arbitrary
floating point and 8-bit digital models 2x10
range per-layer (black and blue dots),
reported by Google and the 8-bit models and different ADC ranges per column
mapped on the tiled architecture using only 0
(red and green lines) scenarios, (c) 8-bit
one 8bit ADC range. 0 10 20 30 ADC with 2% device variation for
Range(µA)
different on/off ratios and the two
Fig. 9. Partial product ran mapping methods. 8 and 4 ADC ranges
distribution. The PP distributions p are used for MobileNet and VGG-16 in
column are shown in (a) (f (c).
MobileNet) and (b) (for VGG-16).
6
(a) (b) (c) (d)
MobileNet, cb_size 128x128 VGG 16, cb_size 128x128 20
MobileNet, cb_size 256x64 107 5 MobileNet VGG-16
VGG 16, cb_size 256x64
105
MobileNet, cb_size 256x64, with copies and sharing VGG 16, cb_size 256x64, with copies and sharing MobileNet, with copies and sharing VGG-16, with copies and sharing
4 15
Latency (ms)
Latency (ms)
# of ADC
# of ADC
105
3
10
103
2
103
5
1
101 101 0 0
5 10 15 20 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 5 10 15 20 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Layer Layer Layer Layer
Fig. 11. Number of ADCs required for different layers for MobileNet (a) and VGG-16 (b), and latencies of different layers for the two models ((c) and (d)). By creating
multiple parallel copies in the first a few layers and sharing ADC with multiple columns in the last a few layers, the latency can be balanced among the layers and result
in greatly reduced overall system latency (blue bars in c,d).
VGG-16 MobileNet
# of crossbar # of ADC Latency with Energy Latency Energy
# of # of Latency Power Area Latency per Power Area
with copies with copies copies and TOPS TOPS/W per image per image per image
crossbar ADC (ms) (W) ( ) image (ms) (W) ( )
and sharing and sharing sharing (ms) (mJ) (ms) (mJ)
VGG-16 16908 18240 1081920 87168 20.07 0.314 RRAM 1.48 50.50 0.34 13.57
65nm 0.314 111.58 5.90 5.94 0.314 1.29
ADC 17.43 261.50 3.78 56.54
MobileNet 4732 4900 151424 18848 5.02 0.314
RRAM 1.48 50.50 0.34 13.57
Table 1: Optimization of the mapping leads to reduced # of ADCs and Projected
ADC 1.48 50.50
0.314 111.58 37.70 0.93
0.34 13.57
0.314 0.21
latency for both models.
Table 2: System performance estimation based on 256×64 RRAM tiles in 65nm technology, and
projected performance if the ADC power can be balanced with the RRAM power through scaling
and circuit optimizations.
IEDM19-321 14.4.4
Authorized licensed use limited to: University of Newcastle. Downloaded on May 30,2020 at 08:21:50 UTC from IEEE Xplore. Restrictions apply.