A Deep Neural Network Accelerator Based On Tiled RRAM Architecture

A Deep Neural Network Accelerator Based on Tiled
RRAM Architecture
Qiwen Wang1,2, Xinxin Wang1,2, Seung Hwan Lee1, Fan-Hsuan Meng1, and Wei D. Lu1*
1
Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan 48109, USA
2
These authors contributed equally to this work.
*Email: wluee@umich.edu
Abstract—State-of-the-art deep neural networks (DNNs) II. MAPPING LARGE-SCALE DNNS ONTO TILED
have been successfully mapped on an RRAM-based tiled in- CROSSBAR ARRAYS
memory computing (IMC) architecture. Effects of moderate
array size and quantized partial products (PPs) due to ADC In this work, we focus on inference operations, which are
precision constraints have been analyzed. Methods were expected to be popular with edge type devices. Weights from pre-
trained DNN models are programed as conductance values of
developed to solve these challenges and preserve DNN
RRAM devices. Since the weight updates are very in-frequently,
accuracies and IMC performance gains in the tiled architecture.
write-verify scheme can be used during model loading to
Popular models including VGG-16 and MobileNet have been improve accuracy of the programmed conductance. During
successfully implemented and tested on ImageNet dataset. inference, the RRAM cells operate in read mode. More
I. INTRODUCTION specifically, we target implementation of 8bit models, since most
state-of-the-art DNNs have been successfully implemented in
DNNs are widely used for artificial intelligent applications 8bit digital pipeline for inference [4]. The model mapping and
with considerable success. However, neural networks often simulation framework are shown in Fig. 3.
come with high computation complexity and cost, as traditional For practical DNNs, weights from a typical layer cannot fit
computing architectures are not well optimized for DNN on a single array. We have developed methods to map common
computation. DNN accelerators are crucial in enabling wider neural network layers of any size through a tiled architecture,
adaption, particularly for edge use cases. Among these, shown in Fig. 4. To map the signed weights, two approaches are
accelerators based on analog IMC concepts implemented on studied. One is to map positive and negative weights on two
RRAM arrays have gained increasing interest. RRAM arrays different arrays (termed “dual array”), and the other is to map the
can perform vector-matrix multiplication (VMM) in analog weight as the difference in conductance between two devices on
domain efficiently by accumulating total current or charge at two adjacent rows (termed “dual row”). For fully connected (FC)
each column. At the same time, the high density and non-volatile layers, the weights of each output neuron are mapped on to a
properties make it possible to store entire DNN models on chip, column. If the number of weights is larger than the number of
thus eliminates the inefficient off-chip memory access and rows in the array, the weights are divided into multiple arrays
promises much higher energy efficacy. However, prior studies (Fig. 7(a)). For regular convolution (Conv) layer, each filter is
on RRAM-based accelerators have focused on small networks flattened to a 1D vector and mapped the same way as a FC layer
and databases such as MNIST [1], which may not capture the (Fig. 7(b)).
challenges faced when implementing advanced models and Beyond these common layer types, MobileNet requires
complex tasks (e.g. ImageNet, Fig. 1). For example, device and depthwise convolution (DW Conv) layers. Here each 2D filter
circuit non-idealities limit the maximum size of the crossbar corresponds to a single input channel and leads to poor array
array (Fig. 2), making it impractical to map a whole layer in utilization. We proposed two methods to improve utilization rate.
state-of-the-art DNN models on a single crossbar, contrary to First, the overlap between adjacent convolution windows in the
assumptions in prior studies. Complex tasks such as ImageNet input can allow filters to be mapped on adjacent columns with an
may also be more sensitive to quantization errors and device offset (Fig. 7(c)). In this way, multiple outputs can be
nonidealities. In this work, we implemented large-scale DNNs simultaneously calculated through the VMM operations. Second,
in a reconfigurable tiled RRAM architecture, where weights in when different filters are stored without offset, the computation
a single layer are mapped onto multiple crossbar arrays tiled for each filter needs to be carried out sequentially since they act
through digital interfaces. This approach is scalable and on different input channels. However, since time-multiplexing of
practical. However, the output activations need to be calculated the ADCs may already be necessary (discussed in more detail
by summing multiple PPs from the arrays, and loss of later), the different channels can be computed in a sequential
information when generating PPs due to limited ADC precision manner to improve array utilization without overall speed penalty
can severely degrade model performance. We carefully on the system performance.
examined the performance of popular models such as VGG-16 To perform VMM, all input rows and all output columns of
[2] and MobileNet V1 [3] and developed effective mitigation the arrays are simultaneously activated. The input activations are
methods to preserve the model accuracy while maintaining the
IMC performance gains using practical 8-bit ADCs.
978-1-7281-4032-2/19/$31.00 ©2019 IEEE 14.4.1 IEDM19-318

Authorized licensed use limited to: University of Newcastle. Downloaded on May 30,2020 at 08:21:50 UTC from IEEE Xplore. Restrictions apply.
represented in bit serial format as fixed read voltage pulses. As Fig. 10(c) shows, accuracies drop significantly with the
The VMM output represents the PP and is readout and conventional dual array mapping due to the finite HRS effects.
quantized by 8bit ADC. The PPs are then added together in For example, in DW Conv layers, more than 96% of devices
digital domain to produce the neuron output activations (Fig. are in HRS in the 256×64 array, and the input-dependent offset
5). can pose significant error in the PPs and destroy the accuracy
III. ADC QUANTIZATION EFFECTS when the on-off ratio is not very high. The dual row approach
can effectively mitigate the offset issue and lead to satisfactory
The summation of many intermediate low precision PPs can performance with moderate on-off ratios of 100, in the presence
cause significant loss of information and deteriorate network of reasonable device variations (2%).
performance. We selected MobileNet and VGG-16 as examples
to study this effect. For VGG-16, we quantized the pre-trained V. SYSTEM PERFORMANCE
floating-point model to 8bit, and found this simple quantization We then analyzed the system performance using known
does not degrade accuracy. VGG-16 is a standard large-scale parameters in 65nm technology. The area of each RRAM cell
model for image classification. It is over parameterized and thus
is 1.69×10-7 mm2 , and the highest current during inference is
not very sensitive to error in computation. MobileNet, on the
other hand, contains much smaller number of parameters, and 3μA per cell at 0.3V [6]. To perform VMM, we synthesized a
is much more sensitive to computation error. For MobileNet, bit-serial ADC using current as input signals. Results from each
we thus used quantization-aware-trained 8bit model [5]. bit are then shifted and added. The area and power consumption
Afterwards, the models are mapped on the tiled RRAM of each ADC are estimated to be 3×10-3 mm2 and 2×10-4 W, at
crossbar (Fig. 6). We first tested the case where all the 8bit an operating frequency of 5MHz [6]. Based on these analyses,
ADCs are configured identically. Fig. 8 shows that while PPs each ADC can fit under two columns for the 256×64 array. The
in VGG-16 can be directly quantized to 8bit with minimal numbers of ADCs required for MobileNet and VGG-16 are
accuracy loss for practical array sizes, MobileNet fails shown in Fig. 11. The 256×64 crossbar enables higher TOPS
miserably. To recover the network accuracy, 12bit ADCs are and more analog operations per ADC compared to 128×128
needed to quantize the PPs for MobileNet (Fig. 10(a)), which is and smaller crossbars, while the area utilization may be lower
impractical. for arrays with more than 64 columns considering the number
To investigate the cause of the poor performance of of output channels in these models, so 256×64 crossbars are
MobileNet, we inspected the distribution of the maximum selected for implementation. With pipelining, the total latency
output currents for each column in all crossbar arrays. We found of the network is determined by the slowest layer. The latencies
that the output current distribution for MobileNet is much more of each layer in MobileNet and VGG-16 are shown in Fig. 11(c)
spread-out, with large number of lower value columns, whereas and (d). It can be observed that with simple mapping the first
distribution for VGG-16 is more concentrated (Fig. 9). The several layers are very slow and the last several layers are
large range of maximum current distribution means columns extremely fast. To increase throughput, the latencies among
with small PP values are poorly utilizing the ADC dynamic layers should be balanced. This can be achieved by creating
range, thus causing large error in the final neuron activation. To parallel copies for the slow layers (e.g. the first a few layers) to
mitigate this issue, configurable ADCs with multiple ranges can speed them up. On the other hand, since the FC layers and the
be used for different columns, depending on the output current last few Conv layers are extremely fast, ADCs can be time-
distribution. Since only the front-end current sensing circuit multiplexed among columns to reduce the number of required
needs be changed, configuration can be achieved with little area ADCs. The summary of the optimized mapping results is
and power penalty. This approach is feasible for inference since shown in Table 1, showing that both the total latency and the
the output distributions are known for a given model. The PPs number of ADCs needed can be significantly improved.
from the ADCs are then rescaled and accumulated to obtain the Performance numbers of this system were calculated and
final output (Fig. 5). With this approach, we can recover shown in Table 2. In the 65nm design, the power consumption
MobileNet model performance with 8 ADC ranges (Fig. 10(b)). is dominated by the ADCs, leading to an overall power
IV. DEVICE NON-IDEALITY EFFECTS efficiency of 5.9TOPS/W for the system, where each OP is
defined as an 8bit MAC operation. If the ADC power is
RRAM nonidealities, as shown in Fig. 2, can also degrade
balanced with the RRAM power by using more advanced
the DNN performance. As the RRAM on/off ratio cannot be
technology (e.g. 14nm) and through further optimizations [7],
infinite, the zeros in the weights, which are mapped as the HRS,
the system can achieve a power efficiency of 37.7TOPS/W.
still lead to a small but non-zero output current during current
More importantly, since all weights are stored on-chip, off-chip
accumulations. This will cause the obtained PP to deviate from
DRAM access is no longer necessary, and the high energy
the ideal results. Additionally, the device conductance variation
efficiency can be maintained end-to-end during the model
will cause error in the accumulated currents. To evaluate the
operation, leading to very low energy/image (0.94mJ for VGG-
influence of these effects, we simulated VGG-16 and
16 and 0.21mJ for MobileNet).
MobileNet with 10, 100 and 1000 on/off ratio and 2% device
variation. The simulations were performed with the proposed VI. ACKNOWLEDGEMENTS
8-bit ADC approach. 4 and 8 ADC ranges are utilized for VGG- This work was supported in part by SRC and DARPA
16 and MobileNet simulations, respectively. through the Applications Driving Architectures (ADA)
Research Centre, and by Applied Materials.
IEDM19-319 14.4.2
REFERENCES
[1] P. Y. Chen, et al., IEDM, pp. 6-1. 2017. [2] K. Simonyan, et al., arXiv preprint,
2014. [3] A. G. Howard, et al., arXiv preprint, 2017. [4] R. Krishnamoorthi, arXiv
preprint, 2018. [5] B, Jacob, et al., CVPR, pp. 2704-2713. 2018. [6] Datasheet from
vendor. [7] M. Zidan, et al., Nature Electronics, p.411, 2018.
Fig. 2. (a) SEM image of a typical RRAM crossbar array. (b) Device non-
idealities include finite on/off and variability.
Fig. 1. Different types of layers in DNN: fully connected layers (a), convolution layers
(b), and depthwise convolution layers (c).
Fig. 5. Quantized partial product pipeline that takes into account weight and
activation quantization. Scaling of the quantized PPs is fused together with the
scaling of the quantized 8bit digital model, so only one digital multiplication
operation is needed for each PP without additional overhead.
104
(a)
MobileNet, cb_size=256x64, dual array
MobileNet, cb_size=256x64, dual row
3
10
# of crossbar
Fig. 3. Simulation framework of the tiled RRAM accelerator. The users can map their
102
own neural networks by calling different layer functions and initialize the weights by
calling different weight mapping functions through Crossbar class. The activation
mapping functions are used in the layer functions to map input activations into vectors,
101
which are stored in the Activation class. Right shows classification results of an input in
the ImageNet database by the implemented MobileNet model.
100
5 10 15 20 25
Layer
105
(b)
VGG-16, cb_size=256x64, dual array
104 VGG-16, cb_size=256x64, dual row
# of crossbar
103
102
101
100
5 10 15
Layer
Fig. 6: Number of crossbar arrays needed for each layer,
Fig. 4. Tiled architecture. As the deep neural networks contain a large number of parameters, a single layer after mapping MobileNet (a) and VGG-16 (b), with dual
may need to be mapped onto multiple crossbar arrays. During inference, each array will produce a PP. The array and dual row mapping approaches.
PPs are quantized into 8-bit and stored in the SRAM buffers. The PPs in the same channel are then added up
digitally to produce the output neuron activation.
14.4.3 IEDM19-320
Fig. 7. Weight mapping of FC layers (a), Conv layers (b) and DW Conv layers (c). The weights in FC layers are divided to different blocks to fit the size of the
crossbar arrays. The Conv kernels are first reshaped to a 1D vector and then mapped in the same way as the FC weights. In the mapping of the DW Conv kernels,
two parameters, duplicate and time multiplex can help regulate the mapping to improve the array utilization and the computation time.
Floating
4x103
8bit digital
8bit Tiled single ADC Range
80 3x103
71.5 71 70.9 70.1
Count
68.59
2x103
60
Accuracy (%)
1x103
40
0
0 10 20
Range(µA)
20
8x10
0.32 6x10
0
Vgg MobileNet Fig. 10. Simulation results for (a) single
ADC range with different ADC
Count
Model 4x10
Fig. 8. Model accuracies, comparing the precisions, (b) 8-bit ADC with arbitrary
floating point and 8-bit digital models 2x10
range per-layer (black and blue dots),
reported by Google and the 8-bit models and different ADC ranges per column
mapped on the tiled architecture using only 0
(red and green lines) scenarios, (c) 8-bit
one 8bit ADC range. 0 10 20 30 ADC with 2% device variation for
Range(µA)
different on/off ratios and the two
Fig. 9. Partial product ran mapping methods. 8 and 4 ADC ranges
distribution. The PP distributions p are used for MobileNet and VGG-16 in
column are shown in (a) (f (c).
MobileNet) and (b) (for VGG-16).
6
(a) (b) (c) (d)
MobileNet, cb_size 128x128 VGG 16, cb_size 128x128 20
MobileNet, cb_size 256x64 107 5 MobileNet VGG-16
VGG 16, cb_size 256x64
105
MobileNet, cb_size 256x64, with copies and sharing VGG 16, cb_size 256x64, with copies and sharing MobileNet, with copies and sharing VGG-16, with copies and sharing
4 15
Latency (ms)
Latency (ms)
# of ADC
# of ADC
105
3
10
103
2
103
5
1
101 101 0 0
5 10 15 20 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 5 10 15 20 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Layer Layer Layer Layer
Fig. 11. Number of ADCs required for different layers for MobileNet (a) and VGG-16 (b), and latencies of different layers for the two models ((c) and (d)). By creating
multiple parallel copies in the first a few layers and sharing ADC with multiple columns in the last a few layers, the latency can be balanced among the layers and result
in greatly reduced overall system latency (blue bars in c,d).
VGG-16 MobileNet
# of crossbar # of ADC Latency with Energy Latency Energy
# of # of Latency Power Area Latency per Power Area
with copies with copies copies and TOPS TOPS/W per image per image per image
crossbar ADC (ms) (W) ( ) image (ms) (W) ( )
and sharing and sharing sharing (ms) (mJ) (ms) (mJ)
VGG-16 16908 18240 1081920 87168 20.07 0.314 RRAM 1.48 50.50 0.34 13.57
65nm 0.314 111.58 5.90 5.94 0.314 1.29
ADC 17.43 261.50 3.78 56.54
MobileNet 4732 4900 151424 18848 5.02 0.314
RRAM 1.48 50.50 0.34 13.57
Table 1: Optimization of the mapping leads to reduced # of ADCs and Projected
ADC 1.48 50.50
0.314 111.58 37.70 0.93
0.34 13.57
0.314 0.21
latency for both models.
Table 2: System performance estimation based on 256×64 RRAM tiles in 65nm technology, and
projected performance if the ADC power can be balanced with the RRAM power through scaling
and circuit optimizations.
IEDM19-321 14.4.4

A Deep Neural Network Accelerator Based On Tiled RRAM Architecture

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Deep Neural Network Accelerator Based On Tiled RRAM Architecture

Uploaded by

Copyright:

Available Formats

A Deep Neural Network Accelerator Based on Tiled

978-1-7281-4032-2/19/$31.00 ©2019 IEEE 14.4.1 IEDM19-318

You might also like