You are on page 1of 6

Deep Learning Acceleration using Digital-based

Processing In-Memory
Mohsen Imani?, Saransh Gupta*, Yeseong Kim!, Tajana Rosing*
‘Department of Computer Science, UC Irvine
‘Department of Information and Communication Engineering, DGIST
*Department of Computer Science and Engineering, UC San Diego

Abstract—Processing In-Memory (PIM) has shown a great point precision, e.g., bfloatl6. The bfloat16 is a half precision
potential to accelerate inference tasks of Convolutional Neu- floating point format utilized in many AI processors.
ral Network (CNN). However, existing PIM architectures do
Another limitation is that the state-of-the-art PIM-based
not support high precision computation, e.g., in floating point
precision, which is essential for training accurate CNN mod- designs utilize costly digital-to-analog (DAC) and analog-to-
els. In addition, most of the existing PIM approaches require digital converter (ADC) blocks. For example, recent work
2020 IEEE 33rd International System-on-Chip Conference (SOCC) | 978-1-7281-8746-4/20/$31.00 ©2020 IEEE | DOI: 10.1109/SOCC49529.2020.9524776

analog/mixed-signal circuits, which do not scale, exploiting in- in [23] designed an analog-based memristive accelerator to
sufficiently reliable multi-bit Non-Volatile Memory (NVM). In support floating point operations. However, the mixed-signal
this paper, we propose FloatPIM, a fully-digital scalable PIM ADC/DAC blocks take the majority of the chip area and power,
architecture that accelerates CNN in both training and testing
phases. FloatPIM natively supports floating-point representation, e.g., 98% of the total area and 89% of the total power, and
thus enabling accurate CNN training. FloatPIM also enables fast do not scale as fast as the CMOS technology does [20], [24],
communication between neighboring memory blocks to reduce [25]. In addition, prior PIM designs use multi-bit memristor
internal data movement of the PIM architecture. We break the devices that are not sufficiently reliable for commercializa-
CNN computation into computing and data transfer modes. tion unlike commonly-used single-level NVMs, e.g., Intel 3D
In computing mode, all blocks are processing a part of CNN
training/testing in parallel, while in data transfer mode Float- Xpoint. Their very expensive write operations frequently occur
PIM enables fast and row-parallel communication between the during the training. For example, work in [26]-[28] extend the
neighbor blocks. Our evaluation shows that FloatPIM training is application of analog crossbar memory to accelerate training,
on average 303.2 and 48.6 (4.3x and 15.8x) faster and more but they still have expensive converter units and multi-bit
energy efficient as compared to GTX 1080 GPU (PipeLayer [1]
devices. PipeLayer [1] modified the ISAAC [20] pipeline
PIM accelerator).
architecture and use spike-based input to eliminate ADC and
DAC blocks. However, the computation of PipeLayer still
I. INTRODUCTION
happens on the converted data and its precision limits to fixed-
Artificial neural networks, in particular deep learning [2], point operations.
[3], have wide range of applications in diverse areas. Pro- In this paper, we propose FloatPIM, a novel high precision
cessing CNNs in conventional von Neumann architectures is PIM architecture, which significantly accelerates CNNs in
inefficient as these architectures have separate memory and both training and testing with the floating-point representation.
computing units. Processing in-memory (PIM) is a promis- FloatPIM directly supports floating-point representations, thus
ing solution to address the data movement issue [4]. Prior enabling high precision CNN training and testing. To the
works exploited digital PIM operations to accelerate different best of our knowledge, FloatPIM is the first PIM-based CNN
applications such as DNNs [5]-[10], brain-inspired comput- training architecture that exploits analog properties of the
ing [11]-[13], object recognition [14], graph processing [15], memory without explicitly converting data into the analog
[16], and database applications [17]-[19]. ISAAC [20] and domain. FloatPIM is flexible in that it works with floating-
PRIME [21] exploit analog characteristics of non-volatile point as well as fixed-point precision. We introduce several
memory to support matrix multiplication in memory. These key design features that optimize the CNN computations in
architectures transfer the digital input data into an analog PIM designs. FloatPIM breaks the computation into computing
domain and pass the analog signal through a crossbar RERAM and data transfer phases. In the computing mode, all blocks are
to compute matrix multiplication. The matrix values are stored working in parallel to compute the matrix multiplication and
as multi-bit memristors in a crossbar memory. Although these convolution tasks. We evaluate the efficiency of FloatPIM on
PIM-based designs presented superior efficiency, there are popular large-scale networks with comparisons to the state-
several limitations when using PIM for CNN training. First, the of-the-art solutions. Our evaluation shows that FloatPIM in
precision of the design is bounded to fixed-point precision as training can achieve 303.2x and 48.6x (4.3x and 15.8x)
determined by the number of multi-bit memristors used to rep- speedup and energy efficiency as compared to the state-of-
resent a value. However, CNN models often need to be trained the-art GPU (PipeLayer PIM accelerator [1]).
with floating point precision to achieve high classification
II]. FLOATPIM OVERVIEW
accuracy [22]. For example, GoogleNet, trained with 32-bit
fixed point values, achieves 3% lower classification accuracy In this paper, we propose a digital and scalable processing
than the one trained with 32-bit floating points. In addition, in-memory architecture (FloatPIM), which accelerates CNNs
earlier work showed that, without enough precision, the model in both training and testing phases with precise floating-point
training is likely to diverge or provide low accuracy [22]. Most computations. Figure la shows the overview of the FloatPIM
commercial CNN accelerators train their models using floating architecture consisting of multiple crossbar memory blocks.

978-1-7281-8746-4/20/$31.00 ©2020 IEEE 123

Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:09 UTC from IEEE Xplore. Restrictions apply.
(b) Computation in Memory (c) Row Parallel Operation in Each Block (a) Feed forward
1 1

Activation
Zz Weights Matrix
Switch = I 1 Function ()

Input Vector
fea
qT qT

ee eee -
TI !
Derivative '

1 1
Activation (2 )

I eee i
=>

=
1 !

| Updated Weights |
I !
I !

4! GE)
1

crt E78; KOE Wy

a
Row-Parallel Row-Parallel Enabling Row-
x
Floating Point
Operations
NOR-based
Activation/Pooling
Parallel Data
Transfer
Hi coo
~ — Data Transfer (b) Error Backward (c) Weight Update
(a) FloatPIM Memory Blocks Computing Mode Mode
Fig. 2. Overview of CNN Training.
Fig. 1. Overview of FloatPIM.

III. CNN COMPUTATION IN FLOATPIM


In this section, we show how a FloatPIM memory block
performs the training/testing task! of a single CNN layer. Fig-
As an example, Figure 1b shows how three adjacent layers are
ure 2 shows a high-level illustration of the training procedure
mapped to the FloatPIM memory blocks to perform the feed-
of a fully-connected layer in FloatPIM. CNN training has two
forward computation, Each memory block represents a layer,
steps: feed-forward and back-propagation. During the feed-
and stores the data used in either testing (1.e., weights) or train-
forward step, FloatPIM processes the input data in a pipeline
ing (i.e., weights, the output of each neuron before activation,
stage. For each data point, FloatPIM stores two intermediate
and the derivative of the activation function (g”)), as shown in
neuron values: (i) the output of each neuron after the activation
Figure lc. With the stored data, the FloatPIM performs with
function (4;) and (ii) the gradient of activation function for
two phases: (1) computing phase and (ii) data transfer phase.
the accumulated results (g’(a,;)). In the back-propagation step,
During the computing phase, all memory blocks work in
FloatPIM first measures the loss function in the last output
parallel, where each block processes an individual layer using
layer and accordingly updates the weights of each layer using
PIM operations. Then, in the data transfer phase, the memory
the intermediate values stored during the feed-forward step.
blocks transfer their outputs to the blocks corresponding to the
As Figure 2b and c show, the error sequentially propagates
next layers, i.e., to proceed either the feed-forward or back-
and updates the weights in the previous layer.
propagation. The switches are shown in Figure 1b control the
CNNs use similar operations for both the fully-connected
data transfer flows.
and convolution layers. In the feed-forward step, the CNN
computations are vector-matrix multiplication for the fully-
In Section II], we present how each FloatPIM memory connected layers and convolution operation for the convolution
block performs CNN computations for a layer. The block layers. In the back-propagation, FloatPIM uses the same
supports in-memory operations for key CNN computations, in- vector-matrix multiplication to update the weights for fully-
cluding vector-matrix multiplication, convolution, and pooling connected layers, while the weights of the convolution layers
(Section III-A.) We also support the activation functions like are updated using the in-memory vector-matrix multiplication
ReLU and Sigmoid in memory. MIN/MAX pooling operations and convolution. In the next subsection, we first describe how
are implemented using in-memory search operations. Our FloatPIM support basic testing/training operations of a single
proposed design optimizes each of the basic operations to CNN layer in digital PIM.
provide high performance. For example, for the convolution
which requires shifting convolution kernels across different A. Building Blocks of CNN
parts of an input matrix, we design shifter circuits that allow Vector-Matrix Multiplication: One of the key operations
accessing weight vectors across different rows of the input of CNN computation is vector-matrix multiplication. The
matrix. The feed-forward step is performed entirely inside vector-matrix multiplication is accomplished by multiplica-
memory by executing the basic PIM operations (Section III-B.) tions of the stored inputs and weights, and addition to accu-
FloatPIM also performs all the computations of the back- mulating the results of the multiplications. Figure 3a shows an
propagation with the same key operations and hardware to example of the vector-matrix multiplication. The in-memory
the one used in the feed-forward (Section III-C.) operations on digital data can perform in a row-parallel way,
by performing the NOR-based operations on the data located
We describe how the memory blocks compose the entire in different columns. Thus, the input-weight multiplication can
FloatPIM architecture. FloatPIM further accelerates the feed- be processed by the row-parallel PIM operation. In contrast,
forward and back-propagation by fully utilizing the parallelism the subsequent addition cannot be done in the row-parallel
provided in the PIM architecture, e.g., row/block-parallel PIM way as its operands are located in different rows. This hinders
operations. We show how these tasks can be parallelized for
both feed-forward and back-propagation across a batch, i.e., 'Please note that only with the feed-forward step, FloatPIM supports the
multiple inputs at a time. It uses multiple data copies pre- testing task, i.c., inference, where an input data processes through different
stored in different blocks in memory. CNN layers.

124

Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:09 UTC from IEEE Xplore. Restrictions apply.
Figure 4c shows the structure of a barrel shifter that provides
a 3-bits shift operation as an example. Depending on the Bs

v W217 ea control signals, a barrel shifter connects different {b;,...
, bg}

Copied Input
% | Ep | Za

Addition
Ws| ws| =
22 | Za | > [wel Ws) Ws] = @/[a2| bits to {b),...,b4}. The number of required shift operations

_—

a
|@[as]
Wy | Wo
Z| Zz | fs
‘Weight Matrix @6e@ depends on the size of the convolution windows. For the ex-
Multiplication (ai)az|as| Weight } yuk
Multiplication Addition ? ample shown in Figure 4b, for a 2 x 2 convolution window, the
(a) Vector-Matrix Multiplication (b) PIM-Compatible Vector-Matrix Multiplication barrel shifter supports a single shift operation using Bs = 0
Fig. 3. Vector-matrix multiplication. or Bs = 1 control signal. Similarly, for a n x n convolution
kernel, the number of shift operation is n — 1. Our FloatPIM
| 4 By | wilwo| _ a] a2) |
supports up to a 7 x 7 convolution kernel, and it covers all the
[t[ts 26} Fost)
|| 41 | %s |Z —
= [alga] [2122] 25) (5)
Uy) ts | ts |=
[wilwelwolwal_
Wi) W2) Wa) Wy, tested popular CNN structures. Note that FloatPIM can also
tT rasnvesennesrnnennneereneereeeensypprecsnneeneseensessenenseeesseeat | [27] 2s | Zo | [ vs | Wa] Wa Wa)
Beg Copied Weights ao [a
support 7 larger than 7 by rewriting shifted input matrices into
other columns.
a | z5| zs) fu Wi] W2] Ws] Wa
Row-Parallel Write Both vector-matrix multiplication and
Z;|Zg|¥ |\= Wy) W2| Wa) Wa convolution require to copy the input or weight vectors in
2 Wy | W2| Ws | Wy
‘Beet Copied Weights
multiple rows, Since writing multiple rows sequentially would
(a) Convolution | (b) PIM-compatible Convolution degrade performance, FloatPIM supports a row-parallel write
Thbe be operation that writes the same value to all rows only in two
bsNESSES) #ofshifts b's |b’: b'1 b's cycles. In the first cycle, the block activates all columns
bs Bsc Bs O-bit_ bs | bz | bi be
we SoS | Control Bs l-bit by bs bp bi containing “1” by connecting the corresponding bitlines to
NaSSSGSSH Bs, | Signals Bs; 2-bit bs by obs by Vsger voltage, while the row driver sets the wordlines for
ON SSRIS Be, Bs; 3-bit, be | bs | bs bs
the destination rows to zero. It writes 1s on all the selected
by ob
(c) Barrel Shifter memory cells at the same time. In the second cycle, the column
Fig. 4. Convolution operation. driver connects only the bitlines which carry ”0” bit to the zero
voltage, while the row driver sets the wordlines to Vegser.
achieving maximum parallelism that the digital PIM operations This writes the input to all memory rows.
offer. MAX/MIN Pooling: The goal of MAX (MIN) pooling
Figure 3b shows how our design implements row-parallel layer is to find a maximum (minimum) values among the
operations by locating the data in a PIM-compatible manner. neuron’s output in the previous layer. To implement pooling
FloatPIM stores multiple copies of the input vector horizon- in memory, we use a crossbar memory with the capability
tally and the transposed weight matrix in memory (Wi). of searching for the nearest value. Work in [18] exploited
FloatPIM first performs the multiplication of the input columns different supply voltages to give weight to different bitlines
with each corresponding column of the weight matrix. The and enable the nearest search capability. Using this hardware,
multiplication result is written in another column of the we implement MAX pooling by searching for a value which
same memory block. Finally, FloatPIM accumulates the stored has the nearest similarity to the largest possible value. Simi-
multiplication results column-wise with multiple PIM addition larity the MIN pooling can be implemented by searching for
operations to the other column. a row of a memory which has the closest distance to the
FloatPIM enables the multiplication and accumulation to minimum possible value. Since the values are floating point,
perform independent of the number of rows. Let us assume that the search happens in two phases. First, we find value with
each multiplication and addition take Ty,,,. and T4aa latencies the highest exponent; then for values with the same maximum
respectively. Thus, we require M x Tyyu: and N x Tyaa exponent, we search to find a value with the which has the
latencies to perform the multiplication and accumulation re- largest mantissa.
spectively, where the size of the weight matrix is M by N.
B. Feed-Forward Acceleration
Convolution: As shown in Figure 4a, the convolution layer
consists of many multiplications, where a shared weight kernel There are three major types of CNN layers: fully-
shifts and multiplies with an input matrix. A naive way to connected, convolution, and pooling layers. For each type of
implement the convolution is to write all the partial convo- the three layers, we exploit different data allocation mech-
lutions for each window movement by reading and writing anisms to enable high parallelism and perform the compu-
the convolution weights repeatedly in memory. However, this tation tasks with minimal internal data movement. For the
method has high-performance overhead in PIM, since non- fully connected layer, the main computation is vector-matrix
volatile memories (NVMs) have slow write operation. multiplication. CNN weights (Wj;) are stored as a matrix
FloatPIM addresses this issue by replacing the convolution in memory and multiplied with the input vector stored in a
with light-weight interconnect logic for the multiplication different column. This multiplication and addition can happen
operation. Figure 4b illustrates the proposed method which between the memory columns using the same approach we
consists of two parts: (i) It writes all convolution weights in introduce for PIM-compatible vector-matrix multiplication.
a single row and then copies them in other rows using the The convolution is another commonly used operation in the
row-parallel write operation that happens just in two cycles, deep neural network, which is implemented using the PIM-
This method enables the input values to be multiplied with compatible convolution hardware.
any convolution weights stored in another column. (ii) It
C. Back-Propagation Acceleration
exploits a configurable interconnect to virtually model the shift
procedure of the convolution kernel. This interconnect is a Figure 5 shows the CNN training phases in fully-connected
barrel shifter which connects two parts of the same memory. layer: (i) Error backward, where the error propagates through

125

Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:09 UTC from IEEE Xplore. Restrictions apply.
Stored during feed in a layer 7. Next, FloatPIM transfers the 6; vector to the
forward
next memory block which is responsible to update the W;,;
eon = i.

\1 . SBS = Se
3 3
= Update ), &
i 6, copies 7 = | 8 2¢ = . P ik weights, The 6; vector is copied in i memory rows next to the
\ | | ae a = e A 2 tf 4 ; Backward 6;
wi matrix using the copy operation.
K pawn h.----F Rotate
& write __ [Switch | For the weight update, the 6; matrix is multiplied with 2;
t
toe P Siz
>
a
z =
2 Update jy & vector, where 74; is calculated and stored during the feed-
i; 6; copies S/F] = i ‘ 6 Backward 6,
forward step. This takes 7 x Tyy,,,. As Figure 5b shows, the
result of the multiplication is a matrix with j x i elements.
Finally, FloatPIM updates the weights by subtracting wi from

:
the 74; Z; matrix. This subtraction happens column by column

Memory Block j"


and the result will be rewritten in the same column as the new
weight matrix. This reduces the number of required memory
columns from & x bw to bw columns.
Convolution Layer: There are a few differences be-
tween the feed-forward and convolution layers in the back-
propagation step. Unlike the feed-forward layer, the error term
is defined as a matrix, i.e., the error backward computes

Memory Block i"


the error matrix in a layer 7 (6/) depending on the error
matrix in a layer k (5), The update on the error matrix
happens by computing the convolution of the 6/ and weight
j (c) Weight update
matrix, where the size of weights is usually much smaller
Fig. 5. Back-propagation of FloatPIM. than 6 (m,n << r,s). This operation can be implemented
in-memory using the same hardware we used to accelerate
different CNN layers. (ii) Weight update, which calculates the the convolution in the feed-forward layer. Next, the generated
new CNN weights depending on the propagated error. matrix from the convolution is multiplied with the derivatives
Fully-Connected Layer: Figure 5 shows the overview of of the activation function (g’), which is already stored in
the DNN operations to update the error vector (6). Figure 5a memory during the feed-forward step. It is computed the
shows the layout of the pre-stored values in each memory same PIM functionalities used for the fully-connected layers.
block in order to perform the back-propagation. Each memory Finally, the generated matrix is convolved with Z' which is
block stores the weights, the output of neurons (7) and the matrix corresponding to the output of the previous CNN
derivatives of the activation (g’(a)) in a block for each layer. layer. When a pooling layer is used, the Z* is the output of
During the back-propagation, 6 vector is the only input to that layer.
each memory block. The error vector propagates backward in
the networks. The error backward starts with multiplying the
weights of the j*” layer (Wx) with the 6, error vector. To IV. EVALUATION
enhance the performance of this multiplication, we copy the
same 6; vector on the 7 rows of the memory (as shown in
Figure 5b). The multiplication of the transposed weights and
A. Experimental Setup
copied 46, matrix is performed in a row parallel way.
Finally, FloatPIM accumulates all stored multiplication re- We have designed and used a cycle-accurate simulator
sults ($° 6;Wj,). One way is to use k * bw-bits columns that based on Tensorflow [29], [30] which emulates the memory
stores all the results of the multiplications where bw is the functionality during the DNN training and testing phases. For
values bit-width. Instead, we design an in-memory multiply- the accelerator design, we use HSPICE for circuit-level sim-
accumulation (MAC) operation which reuses the memory ulations to measure the energy consumption and performance
columns for the accumulation. FloatPIM consecutively per- of all the FloatPIM floating-point/fixed-point operations in
forms multiplication and addition operations. This reduces 28nm technology. The energy consumption and performance
the number of required columns to bw-bits, and results in are also cross-validated using NVSim [31]. We used System
significant improvements in the area efficiency per computa- Verilog and Synopsys Design Compiler [32] to implement
tion. Assuming that Tyy,; and T4¢q take for multiplication and synthesize the FloatPIM controller. For parasitics, we
and addition respectively, the multiplication of the weight used the same simulation setup considered by work in [33].
and 6; matrix is computed in (Trrut + Tada) x k. Since The robustness of all proposed circuits, i.e., interconnect, has
FloatPIM performs the computation in a row-parallel way, been verified by considering 10% process variations on the
the performance of computation is independent on j. The size and threshold voltage of transistors using 5000 Monte
result of 5°6;W5, is a vector with j elements (Figure 5b). Carlo simulations. FloatPIM works with any bipolar resistive
This vector multiplies element-wise by g'(a;) vector in S technology which is the most commonly used in existing
cycles and row-parallel way. Note that during feed-forward the NVMs. Here, we adopt memristor device with a VTEAM
g'(a;) is written in a suitable memory location which enables model [34]. Table I summarizes the device characteristics
column-wise multiplication with no internal data movement. for each FloatPIM component. FloatPIM consists of 32 tiles,
The result of the multiplication is a 6; error vector, and where each has 256 memory blocks to cover all the tested
it is sent to the next memory block to update the weights CNN structures. Each tile takes 0.96mm? area and consumes
(Figure 5c). The error vector is used for both updating the 7.64mW power. In total, FloatPIM takes 30.64mm? area and
weights (W,;) and computing the backward error vector (4;) consumes 62.60W power on average.

126

Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:09 UTC from IEEE Xplore. Restrictions apply.
TABLE I TABLE IL
FLOATPIM PARAMETERS ERROR RATE COMPARISON AND PIM SUPPORTS.

| Component | Params | Spec | Area | Power | | Float-32 bFloat-16 Fixed-32 Fixed-16


Crossbar Array | size IMb | 3449.60? | 6.14mW AlexNet 27.4% 27.4% 29.6% 31.3%
Shifter shift 6 levels 19.26p2m? 0.69mW GoogleNet 15.6% 15.6% 18.5% 21.4%
Switches number 1K-bits 32.69um? | 0.42mW VGGNet 17.5% 17.7% 21.4% 23.1%
Max Pool number 1 80pm? 0.38mW SqueezeNet 25.9% 26.1% 29.6% 32.1%
Controller number 1 401.4pum? | 0.65mW PIM Designs Support
Memory Block size IMb 3,468.8m? | 6.83mW
number | 256 Blocks
| Float-32 bFloat-16 Fixed-32 Fixed-16
Tile size 256Mb 0.96mm? 7,64mW ISAAC [20] |X x v v
number 32 Tiles PipeLayer [1] x x v v
Total size 8Gb 30.64mm? | 62.60W FloatPIM | ¥ v v v
| Float-32 Ml bFloat MM Fixed-32 J) Fixed-16,
B. Workload
We perform our experiment on ImageNet [35] which is

>
Energy Saving &
a large dataset with about 1.2M training samples and 50K
validation samples. The objective is to classify each image to

Speedup
ow
one of 1000 categories. We tested with four popular large-scale

we
networks, i.e., AlexNet [35], VGGNet [36], GoogleNet [37],
and SqueezeNet [38] to classify ImageNet dataset.
Testing Testing Training Training
C. FloatPIM & Data Representation Speedup Energy Saving Speedup Energy Saving
Table II reports the classification error rate of different Fig. 6. FloatPIM energy saving and speedup using floating point and fixed
networks when they train with floating point and fixed point point representations.
representation. For float precision, we used 32-bit floating
point (Float-32) and bfloat!6 (bFloat) [39], a commonly used [1] which is a state-of-the-art hardware accelerating CNN
representation in many CNN accelerators. For fixed-point pre- training using ISAAC [20] hardware. For PipeLayer, we used
cision, we used a 32-bit fixed point (Fixed-32) and 16-bit fixed read/write latency of 29.31ns/50.88ns and energy of 1.08pJ/
point (Fixed-16) representations for FloatPIM training. For all 3.91nJ per spike as reported in the reference paper [1]. In
networks, we perform the testing using Fixed-32 precision. To addition, we used 4 = 4 which provides reasonable efficiency.
achieve maximum classification accuracy, it is essential to train During training, CNN requires a significantly large memory
CNN models using floating point representation. For example, size to store the feed-forward information of different data
using Fixed-16 and Fixed-32 for training, VWGGNet provides points in a batch. For large networks, this information cannot
5.2% and 2.6% lower classification accuracy as compared fit on the GPU memory, thus it results in slow training.
to the same network trained based on bFloat. In addition,
Our evaluation shows that FloatPIM can achieve on average
we observe that for all applications, bFloat can provide the 303.2 speedup and 48.6x energy efficiency in training as
same accuracy as Float-32, while computationally processes compared to GPU-based approach. The higher efficiency of
in a much faster way. This is because FloatPIM works based the FloatPIM is more obvious on the CNNs with more num-
on the bitwise NOR operation, thus it can simply ignore ber of convolution layers. Figure 7 also compares FloatPIM
processing the least significant bits of mantissas in floating efficiency over PipeLayer when it enables and disables the
point representation in order to accelerate the computation. in-parallel data transfer between the memory blocks. Our
Table II lists the supported computation precision by two evaluation shows that FloatPIM without parallelized data
recent PIM-based CNN accelerators [1], [20]. All existing PIM transfer provides 1.6x lower speedup, but 3.5x higher energy
architectures can support CNN acceleration just using fixed- efficiency as compared to the PipeLayer. However, exploiting
point values, which results in up to 5.1% lower classification switches significantly accelerates the FloatPIM computation
accuracy than floating point precision supported by FloatPIM. by removing the internal data movement between the neigh-
Figure 6 shows the speedup and energy saving of FloatPIM, boring blocks. Our evaluation shows that FloatPIM enabling
on average for the four CNN models, using the fixed point in-parallel data transfer can achieve on average 4.3x speedup
and floating point representation for the CNN training and and 15.8x energy efficiency as compared to PipeLayer. The
testing. All results are normalized to Float-32. Our evaluation higher energy efficiency of FloatPIM comes from (i) its
shows that FloatPIM using bFloat can achieve 2.9x speedup digital-based operation which avoids paying the extra cost of
and 2.5x energy savings as compared to FloatPIM using transferring data between the digital and analog/spike domain;
Float-32, while providing similar classification accuracy. In (ii) the higher density of the FloatPIM which enables signifi-
addition, FloatPIM using bFloat model can provide higher effi- cantly better parallelism. The PipeLayer computing precision
ciency than Fixed-32. For example, FloatPIM using bFloat can is bounded to fixed point operations, while FloatPIM provides
achieve 1.5x speedup, 1.42x energy efficiency as compared the floating point precision which is essential for the highly
to Fixed-32. accurate CNN training.
D. FloatPIM Training V. CONCLUSION
Figure 7 compares the performance and energy efficiency of In this paper, we proposed FloatPIM, the first PIM-based
FloatPIM with the GPU-based implementation and PipeLayer DNN training architecture that exploits analog properties of

127

Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:09 UTC from IEEE Xplore. Restrictions apply.
|INGPU MilPipeLayer|__|FloatPIM (No Switches) |_|FloatPIM| [15] M. Zhou, M. Imani, 8. Gupta, and T. Rosing, “Gas: A heterogeneous
memory architecture for graph processing.” in Proceedings of the
10° - . . = 10°; International Symposium on Low Power Electronics and Design, p. 27,
== x | ACM, 2018.
2%
S
Wg
1 | |
So
a

Pio:
| [16] M. Zhou et al., “Gram: graph processing in a reram-based computational
memory,” in DAC, pp. 591-596, ACM, 2019.
a > : [17] M. Imani er al., “Nvquery: Efficient query processing in non-volatile
a i & memory,” JEEE TCAD, 2018.
[18] M. Imani, D. Peroni, Y. Kim, A. Rahimi, and T. Rosing, “Efficient neural
E network acceleration on gpgpu using content addressable memory,” in
2017 Design, Automation & Test in Europe Conference & Exhibition
peti ga tye | i gt tg [19]
(DATE), pp. 1026-1031, IEEE, 2017.
X. Yin, C. Li, Q. Huang, L. Zhang, M. Niemier, X. S. Hu, C. Zhuo,
Fig. 7. FloatPIM efficiency during training. and K. Ni, “Fecam: A universal compact digital and analog content ad-
dressable memory using ferroelectric,’ arXiv preprint arXiv:2004.01866,
2020.
the memory without explicitly converting data into the analog [20] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-
chan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional
domain. FloatPIM is a flexible PIM-based accelerator that neural network accelerator with in-situ analog arithmetic in crossbars,”
works with floating-point as well as fixed-point precision. in Proceedings of the 43rd International Symposium on Computer
Architecture, pp. 14-26, TEEE Press, 2016.
FloatPIM addresses the internal data movement issue of
[21] P. Chi er al., “Prime: A novel processing-in-memory architecture for
the PIM architecture by enabling in-parallel data transfer neural network computation in reram-based main memory,” in JSCA,
between the neighboring blocks. Our evaluation shows that pp. 27-39, IEEE Press, 2016.
FloatPIM can achieve on average 4.3x and 15.8x (6.3x and [22] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
deep neural networks with pruning, trained quantization and huffman
21.6x) higher speedup and energy efficiency as compared coding,” arXiv preprint arXiv: 1510.00149, 2015.
to PipeLayer (ISAAC), the state-of-the-art PIM accelerator, [23] B. Feinberg, U. K. R. Vengalam, N. Whitehair, S. Wang, and E. Ipek,
“Enabling scientific computing on memristive accelerators,” in 20/8
during training (testing). ACMAEEE 45th Annual International Symposium on Computer Archi-
ACKNOWLEDGEMENTS tecture (ISCA), pp. 367-382, IEEE, 2018.
[24] X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, and Y. Shi,
This work was supported by Semiconductor Research Cor- “Scaling for edge inference of deep neural networks,” Nature Electron-
poration GRC contract #2020-AH-2988. Mohsen Imani and ics, vol. 1, no. 4, pp. 216-222, 2018.
Yeseong Kim are co-corresponding authors of the paper. [25] C. Zhuo, S. Luo, H. Gan, J. Hu, and 7. Shi, “Noise-aware dvfs for
efficient transitions on battery-powered iot devices,” EEE Transactions
REFERENCES on Computer-Aided Design of Integrated Circuits and Systems, vol. 39,
[1] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined reram- no. 7, pp. 1498-1510, 2020.
based accelerator for deep learning,” in High Performance Computer [26] M. Cheng, L. Xia, Z. Zhu, Y. Cai, Y. Xie, Y. Wang, and H. Yang,
Architecture (HPCA), 2017 IEEE International Symposium on, TEEE, “Time: A training-in-memory architecture for memristor-based deep
2017. neural networks,” in Proceedings of the 54th Annual Design Automation
[2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, Conference 2017, p. 26, ACM, 2017.
no. 7553, p. 436, 2015. [27] Y. Cai, T. Tang, L. Xia, M. Cheng, Z. Zhu, Y. Wang, and H. Yang,
“Training low bitwidth convolutional neural network on rram,” in
[3] J. Schmidhuber, “Deep learning in neural networks: An overview,”
Neural networks, vol. 61, pp. 85-117, 2015. Proceedings of the 23rd Asia and South Pacific Design Automation
[4] M. N. Bojnordi and E. Ipek, “Memristive boltzmann machine: A Conference, pp. 117-122, IEEE Press, 2018.
hardware accelerator for combinatorial optimization and deep learning,” [28] Y. Cai, Y. Lin, L. Xia, X. Chen, $. Han, Y. Wang, and H. Yang,
in High Performance Computer Architecture (HPCA), 2016 IEEE Inter- “Long live time: improving lifetime for training-in-memory engines by
national Symposium on, pp. 1-13, IEEE, 2016. structured gradient sparsification,” in Proceedings of the 55th Annual
[5] M. Imani, M. Samragh, Y. Kim, S$. Gupta, F. Koushanfar, and T. Rosing, Design Automation Conference, p. 107, ACM, 2018.
“Rapidnn: In-memory deep neural network acceleration framework,” [29] F. Chollet, “keras.” https://github.com/fchollet/keras, 2015.
arXiv preprint arXiv: 1806.05794, 2018. [30] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
[6] S. Gupta, M. Imani, H. Kaur, and T. S. Rosing, “Nnpim: A processing Corrado, A. Davis, J. Dean, M. Devin, et al., “Tensorflow: Large-scale
in-memory architecture for neural network acceleration,” JEEE Trans- machine learning on heterogeneous distributed systems,” arXiv preprint
actions on Computers, 2019. arXiv: 1603.04467, 2016.
[7] M. Imani, S. Gupta, and T. Rosing, “Genpim: Generalized processing in- [31] X. Dong, C. Xu, N. Jouppi, and Y. Xie, “Nvsim: A circuit-level perfor-
mance, energy, and area model for emerging non-volatile memory,” in
memory to accelerate data intensive applications,” in 20/8 Design, Au-
tomation & Test in Europe Conference & Exhibition (DATE), pp. 1155—
Emerging Memory Technologies, pp. 15-50, Springer, 2014.
1158, IEEE, 2018. [32] D. Compiler, R. User, and M. Guide, “Synopsys,” Jnc., see http://www.
synopsys. com, 2000.
[8] M. Imani, 5. Gupta, Y. Kim, and T. Rosing, “Floatpim: In-memory
acceleration of deep neural network training with high precision,” in [33] N. Talati, S. Gupta, P. Mane, and S. Kvatinsky, “Logic design within
2019 ACMAEEE 46th Annual International Symposium on Computer memristive memories using memristor-aided logic (magic),”’ JEEE
Transactions on Nanotechnology, vol. 15, no. 4, pp. 635-650, 2016.
Architecture (ISCA), pp. 802-815, IEEE, 2019.
[9] M. Imani, M. S. Razlighi, Y. Kim, S. Gupta, F. Koushanfar, and
[34 S. Kvatinsky er al., “Wteam: A general model for voltage-controlled
T. Rosing, “Deep learning acceleration with neuron-to-memory transfor- memristors,” JEEE TCAS II, vol. 62, no. 8, pp. 786-790, 2015.
mation,” in 2020 JEEE International Symposium on High Performance [35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
Computer Architecture (HPCA), pp. 1-14, IEEE, 2020.
[10] D. Gao, D. Reis, X. S$. Hu, and C. Zhuo, “Eva-cim: A system-
mation processing systems, pp. 1097-1105, 2012.
level performance and energy evaluation framework for computing-in- [36] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv: 1409,1556, 2014.
memory architectures,” JEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, pp. 1-1, 2020. [37] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[11] S. Gupta ef al., “Felix: Fast and energy-efficient logic in memory,” in VY. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE conference on computer vision and pattern
ICCAD, pp. 1-7, IEEE, 2018.
recognition, pp. 1-9, 2015.
[12] M. Imani, A. Rahimi, D. Kong, T. Rosing, and J. M. Rabaey, “Exploring
hyperdimensional associative memory,” in HPCA, pp. 445-456, IEEE, [38] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,
2017. and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer
parameters andj 0.5 mb model size,” arXiv preprint arXiv: 1602.07360,
[13] K. Ni, X. Yin, A. F Laguna, S. Joshi, S. Diinkel, M. Trentzsch,
J, Miieller, 5. Beyer, M. Niemier, X. S. Hu, ef al., “Ferroelectric ternary 2016.
content-addressable memory for one-shot learning,” Nature Electronics, [39] K. He, X. Zhang, §. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
vol. 2, no. 11, pp. 521-529, 2019.
and pattern recognition, pp. 770-778, 2016.
[14] Y. Kim et al., “Orchard: Visual object recognition accelerator based on
approximate in-memory processing,” in /CCAD, pp. 25-32, IEEE, 2017.

128

Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:09 UTC from IEEE Xplore. Restrictions apply.

You might also like