Professional Documents
Culture Documents
Processing In-Memory
Mohsen Imani?, Saransh Gupta*, Yeseong Kim!, Tajana Rosing*
‘Department of Computer Science, UC Irvine
‘Department of Information and Communication Engineering, DGIST
*Department of Computer Science and Engineering, UC San Diego
Abstract—Processing In-Memory (PIM) has shown a great point precision, e.g., bfloatl6. The bfloat16 is a half precision
potential to accelerate inference tasks of Convolutional Neu- floating point format utilized in many AI processors.
ral Network (CNN). However, existing PIM architectures do
Another limitation is that the state-of-the-art PIM-based
not support high precision computation, e.g., in floating point
precision, which is essential for training accurate CNN mod- designs utilize costly digital-to-analog (DAC) and analog-to-
els. In addition, most of the existing PIM approaches require digital converter (ADC) blocks. For example, recent work
2020 IEEE 33rd International System-on-Chip Conference (SOCC) | 978-1-7281-8746-4/20/$31.00 ©2020 IEEE | DOI: 10.1109/SOCC49529.2020.9524776
analog/mixed-signal circuits, which do not scale, exploiting in- in [23] designed an analog-based memristive accelerator to
sufficiently reliable multi-bit Non-Volatile Memory (NVM). In support floating point operations. However, the mixed-signal
this paper, we propose FloatPIM, a fully-digital scalable PIM ADC/DAC blocks take the majority of the chip area and power,
architecture that accelerates CNN in both training and testing
phases. FloatPIM natively supports floating-point representation, e.g., 98% of the total area and 89% of the total power, and
thus enabling accurate CNN training. FloatPIM also enables fast do not scale as fast as the CMOS technology does [20], [24],
communication between neighboring memory blocks to reduce [25]. In addition, prior PIM designs use multi-bit memristor
internal data movement of the PIM architecture. We break the devices that are not sufficiently reliable for commercializa-
CNN computation into computing and data transfer modes. tion unlike commonly-used single-level NVMs, e.g., Intel 3D
In computing mode, all blocks are processing a part of CNN
training/testing in parallel, while in data transfer mode Float- Xpoint. Their very expensive write operations frequently occur
PIM enables fast and row-parallel communication between the during the training. For example, work in [26]-[28] extend the
neighbor blocks. Our evaluation shows that FloatPIM training is application of analog crossbar memory to accelerate training,
on average 303.2 and 48.6 (4.3x and 15.8x) faster and more but they still have expensive converter units and multi-bit
energy efficient as compared to GTX 1080 GPU (PipeLayer [1]
devices. PipeLayer [1] modified the ISAAC [20] pipeline
PIM accelerator).
architecture and use spike-based input to eliminate ADC and
DAC blocks. However, the computation of PipeLayer still
I. INTRODUCTION
happens on the converted data and its precision limits to fixed-
Artificial neural networks, in particular deep learning [2], point operations.
[3], have wide range of applications in diverse areas. Pro- In this paper, we propose FloatPIM, a novel high precision
cessing CNNs in conventional von Neumann architectures is PIM architecture, which significantly accelerates CNNs in
inefficient as these architectures have separate memory and both training and testing with the floating-point representation.
computing units. Processing in-memory (PIM) is a promis- FloatPIM directly supports floating-point representations, thus
ing solution to address the data movement issue [4]. Prior enabling high precision CNN training and testing. To the
works exploited digital PIM operations to accelerate different best of our knowledge, FloatPIM is the first PIM-based CNN
applications such as DNNs [5]-[10], brain-inspired comput- training architecture that exploits analog properties of the
ing [11]-[13], object recognition [14], graph processing [15], memory without explicitly converting data into the analog
[16], and database applications [17]-[19]. ISAAC [20] and domain. FloatPIM is flexible in that it works with floating-
PRIME [21] exploit analog characteristics of non-volatile point as well as fixed-point precision. We introduce several
memory to support matrix multiplication in memory. These key design features that optimize the CNN computations in
architectures transfer the digital input data into an analog PIM designs. FloatPIM breaks the computation into computing
domain and pass the analog signal through a crossbar RERAM and data transfer phases. In the computing mode, all blocks are
to compute matrix multiplication. The matrix values are stored working in parallel to compute the matrix multiplication and
as multi-bit memristors in a crossbar memory. Although these convolution tasks. We evaluate the efficiency of FloatPIM on
PIM-based designs presented superior efficiency, there are popular large-scale networks with comparisons to the state-
several limitations when using PIM for CNN training. First, the of-the-art solutions. Our evaluation shows that FloatPIM in
precision of the design is bounded to fixed-point precision as training can achieve 303.2x and 48.6x (4.3x and 15.8x)
determined by the number of multi-bit memristors used to rep- speedup and energy efficiency as compared to the state-of-
resent a value. However, CNN models often need to be trained the-art GPU (PipeLayer PIM accelerator [1]).
with floating point precision to achieve high classification
II]. FLOATPIM OVERVIEW
accuracy [22]. For example, GoogleNet, trained with 32-bit
fixed point values, achieves 3% lower classification accuracy In this paper, we propose a digital and scalable processing
than the one trained with 32-bit floating points. In addition, in-memory architecture (FloatPIM), which accelerates CNNs
earlier work showed that, without enough precision, the model in both training and testing phases with precise floating-point
training is likely to diverge or provide low accuracy [22]. Most computations. Figure la shows the overview of the FloatPIM
commercial CNN accelerators train their models using floating architecture consisting of multiple crossbar memory blocks.
Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:09 UTC from IEEE Xplore. Restrictions apply.
(b) Computation in Memory (c) Row Parallel Operation in Each Block (a) Feed forward
1 1
Activation
Zz Weights Matrix
Switch = I 1 Function ()
Input Vector
fea
qT qT
ee eee -
TI !
Derivative '
1 1
Activation (2 )
I eee i
=>
=
1 !
| Updated Weights |
I !
I !
4! GE)
1
a
Row-Parallel Row-Parallel Enabling Row-
x
Floating Point
Operations
NOR-based
Activation/Pooling
Parallel Data
Transfer
Hi coo
~ — Data Transfer (b) Error Backward (c) Weight Update
(a) FloatPIM Memory Blocks Computing Mode Mode
Fig. 2. Overview of CNN Training.
Fig. 1. Overview of FloatPIM.
124
Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:09 UTC from IEEE Xplore. Restrictions apply.
Figure 4c shows the structure of a barrel shifter that provides
a 3-bits shift operation as an example. Depending on the Bs
—
v W217 ea control signals, a barrel shifter connects different {b;,...
, bg}
Copied Input
% | Ep | Za
Addition
Ws| ws| =
22 | Za | > [wel Ws) Ws] = @/[a2| bits to {b),...,b4}. The number of required shift operations
_—
a
|@[as]
Wy | Wo
Z| Zz | fs
‘Weight Matrix @6e@ depends on the size of the convolution windows. For the ex-
Multiplication (ai)az|as| Weight } yuk
Multiplication Addition ? ample shown in Figure 4b, for a 2 x 2 convolution window, the
(a) Vector-Matrix Multiplication (b) PIM-Compatible Vector-Matrix Multiplication barrel shifter supports a single shift operation using Bs = 0
Fig. 3. Vector-matrix multiplication. or Bs = 1 control signal. Similarly, for a n x n convolution
kernel, the number of shift operation is n — 1. Our FloatPIM
| 4 By | wilwo| _ a] a2) |
supports up to a 7 x 7 convolution kernel, and it covers all the
[t[ts 26} Fost)
|| 41 | %s |Z —
= [alga] [2122] 25) (5)
Uy) ts | ts |=
[wilwelwolwal_
Wi) W2) Wa) Wy, tested popular CNN structures. Note that FloatPIM can also
tT rasnvesennesrnnennneereneereeeensypprecsnneeneseensessenenseeesseeat | [27] 2s | Zo | [ vs | Wa] Wa Wa)
Beg Copied Weights ao [a
support 7 larger than 7 by rewriting shifted input matrices into
other columns.
a | z5| zs) fu Wi] W2] Ws] Wa
Row-Parallel Write Both vector-matrix multiplication and
Z;|Zg|¥ |\= Wy) W2| Wa) Wa convolution require to copy the input or weight vectors in
2 Wy | W2| Ws | Wy
‘Beet Copied Weights
multiple rows, Since writing multiple rows sequentially would
(a) Convolution | (b) PIM-compatible Convolution degrade performance, FloatPIM supports a row-parallel write
Thbe be operation that writes the same value to all rows only in two
bsNESSES) #ofshifts b's |b’: b'1 b's cycles. In the first cycle, the block activates all columns
bs Bsc Bs O-bit_ bs | bz | bi be
we SoS | Control Bs l-bit by bs bp bi containing “1” by connecting the corresponding bitlines to
NaSSSGSSH Bs, | Signals Bs; 2-bit bs by obs by Vsger voltage, while the row driver sets the wordlines for
ON SSRIS Be, Bs; 3-bit, be | bs | bs bs
the destination rows to zero. It writes 1s on all the selected
by ob
(c) Barrel Shifter memory cells at the same time. In the second cycle, the column
Fig. 4. Convolution operation. driver connects only the bitlines which carry ”0” bit to the zero
voltage, while the row driver sets the wordlines to Vegser.
achieving maximum parallelism that the digital PIM operations This writes the input to all memory rows.
offer. MAX/MIN Pooling: The goal of MAX (MIN) pooling
Figure 3b shows how our design implements row-parallel layer is to find a maximum (minimum) values among the
operations by locating the data in a PIM-compatible manner. neuron’s output in the previous layer. To implement pooling
FloatPIM stores multiple copies of the input vector horizon- in memory, we use a crossbar memory with the capability
tally and the transposed weight matrix in memory (Wi). of searching for the nearest value. Work in [18] exploited
FloatPIM first performs the multiplication of the input columns different supply voltages to give weight to different bitlines
with each corresponding column of the weight matrix. The and enable the nearest search capability. Using this hardware,
multiplication result is written in another column of the we implement MAX pooling by searching for a value which
same memory block. Finally, FloatPIM accumulates the stored has the nearest similarity to the largest possible value. Simi-
multiplication results column-wise with multiple PIM addition larity the MIN pooling can be implemented by searching for
operations to the other column. a row of a memory which has the closest distance to the
FloatPIM enables the multiplication and accumulation to minimum possible value. Since the values are floating point,
perform independent of the number of rows. Let us assume that the search happens in two phases. First, we find value with
each multiplication and addition take Ty,,,. and T4aa latencies the highest exponent; then for values with the same maximum
respectively. Thus, we require M x Tyyu: and N x Tyaa exponent, we search to find a value with the which has the
latencies to perform the multiplication and accumulation re- largest mantissa.
spectively, where the size of the weight matrix is M by N.
B. Feed-Forward Acceleration
Convolution: As shown in Figure 4a, the convolution layer
consists of many multiplications, where a shared weight kernel There are three major types of CNN layers: fully-
shifts and multiplies with an input matrix. A naive way to connected, convolution, and pooling layers. For each type of
implement the convolution is to write all the partial convo- the three layers, we exploit different data allocation mech-
lutions for each window movement by reading and writing anisms to enable high parallelism and perform the compu-
the convolution weights repeatedly in memory. However, this tation tasks with minimal internal data movement. For the
method has high-performance overhead in PIM, since non- fully connected layer, the main computation is vector-matrix
volatile memories (NVMs) have slow write operation. multiplication. CNN weights (Wj;) are stored as a matrix
FloatPIM addresses this issue by replacing the convolution in memory and multiplied with the input vector stored in a
with light-weight interconnect logic for the multiplication different column. This multiplication and addition can happen
operation. Figure 4b illustrates the proposed method which between the memory columns using the same approach we
consists of two parts: (i) It writes all convolution weights in introduce for PIM-compatible vector-matrix multiplication.
a single row and then copies them in other rows using the The convolution is another commonly used operation in the
row-parallel write operation that happens just in two cycles, deep neural network, which is implemented using the PIM-
This method enables the input values to be multiplied with compatible convolution hardware.
any convolution weights stored in another column. (ii) It
C. Back-Propagation Acceleration
exploits a configurable interconnect to virtually model the shift
procedure of the convolution kernel. This interconnect is a Figure 5 shows the CNN training phases in fully-connected
barrel shifter which connects two parts of the same memory. layer: (i) Error backward, where the error propagates through
125
Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:09 UTC from IEEE Xplore. Restrictions apply.
Stored during feed in a layer 7. Next, FloatPIM transfers the 6; vector to the
forward
next memory block which is responsible to update the W;,;
eon = i.
\1 . SBS = Se
3 3
= Update ), &
i 6, copies 7 = | 8 2¢ = . P ik weights, The 6; vector is copied in i memory rows next to the
\ | | ae a = e A 2 tf 4 ; Backward 6;
wi matrix using the copy operation.
K pawn h.----F Rotate
& write __ [Switch | For the weight update, the 6; matrix is multiplied with 2;
t
toe P Siz
>
a
z =
2 Update jy & vector, where 74; is calculated and stored during the feed-
i; 6; copies S/F] = i ‘ 6 Backward 6,
forward step. This takes 7 x Tyy,,,. As Figure 5b shows, the
result of the multiplication is a matrix with j x i elements.
Finally, FloatPIM updates the weights by subtracting wi from
:
the 74; Z; matrix. This subtraction happens column by column
126
Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:09 UTC from IEEE Xplore. Restrictions apply.
TABLE I TABLE IL
FLOATPIM PARAMETERS ERROR RATE COMPARISON AND PIM SUPPORTS.
>
Energy Saving &
a large dataset with about 1.2M training samples and 50K
validation samples. The objective is to classify each image to
Speedup
ow
one of 1000 categories. We tested with four popular large-scale
we
networks, i.e., AlexNet [35], VGGNet [36], GoogleNet [37],
and SqueezeNet [38] to classify ImageNet dataset.
Testing Testing Training Training
C. FloatPIM & Data Representation Speedup Energy Saving Speedup Energy Saving
Table II reports the classification error rate of different Fig. 6. FloatPIM energy saving and speedup using floating point and fixed
networks when they train with floating point and fixed point point representations.
representation. For float precision, we used 32-bit floating
point (Float-32) and bfloat!6 (bFloat) [39], a commonly used [1] which is a state-of-the-art hardware accelerating CNN
representation in many CNN accelerators. For fixed-point pre- training using ISAAC [20] hardware. For PipeLayer, we used
cision, we used a 32-bit fixed point (Fixed-32) and 16-bit fixed read/write latency of 29.31ns/50.88ns and energy of 1.08pJ/
point (Fixed-16) representations for FloatPIM training. For all 3.91nJ per spike as reported in the reference paper [1]. In
networks, we perform the testing using Fixed-32 precision. To addition, we used 4 = 4 which provides reasonable efficiency.
achieve maximum classification accuracy, it is essential to train During training, CNN requires a significantly large memory
CNN models using floating point representation. For example, size to store the feed-forward information of different data
using Fixed-16 and Fixed-32 for training, VWGGNet provides points in a batch. For large networks, this information cannot
5.2% and 2.6% lower classification accuracy as compared fit on the GPU memory, thus it results in slow training.
to the same network trained based on bFloat. In addition,
Our evaluation shows that FloatPIM can achieve on average
we observe that for all applications, bFloat can provide the 303.2 speedup and 48.6x energy efficiency in training as
same accuracy as Float-32, while computationally processes compared to GPU-based approach. The higher efficiency of
in a much faster way. This is because FloatPIM works based the FloatPIM is more obvious on the CNNs with more num-
on the bitwise NOR operation, thus it can simply ignore ber of convolution layers. Figure 7 also compares FloatPIM
processing the least significant bits of mantissas in floating efficiency over PipeLayer when it enables and disables the
point representation in order to accelerate the computation. in-parallel data transfer between the memory blocks. Our
Table II lists the supported computation precision by two evaluation shows that FloatPIM without parallelized data
recent PIM-based CNN accelerators [1], [20]. All existing PIM transfer provides 1.6x lower speedup, but 3.5x higher energy
architectures can support CNN acceleration just using fixed- efficiency as compared to the PipeLayer. However, exploiting
point values, which results in up to 5.1% lower classification switches significantly accelerates the FloatPIM computation
accuracy than floating point precision supported by FloatPIM. by removing the internal data movement between the neigh-
Figure 6 shows the speedup and energy saving of FloatPIM, boring blocks. Our evaluation shows that FloatPIM enabling
on average for the four CNN models, using the fixed point in-parallel data transfer can achieve on average 4.3x speedup
and floating point representation for the CNN training and and 15.8x energy efficiency as compared to PipeLayer. The
testing. All results are normalized to Float-32. Our evaluation higher energy efficiency of FloatPIM comes from (i) its
shows that FloatPIM using bFloat can achieve 2.9x speedup digital-based operation which avoids paying the extra cost of
and 2.5x energy savings as compared to FloatPIM using transferring data between the digital and analog/spike domain;
Float-32, while providing similar classification accuracy. In (ii) the higher density of the FloatPIM which enables signifi-
addition, FloatPIM using bFloat model can provide higher effi- cantly better parallelism. The PipeLayer computing precision
ciency than Fixed-32. For example, FloatPIM using bFloat can is bounded to fixed point operations, while FloatPIM provides
achieve 1.5x speedup, 1.42x energy efficiency as compared the floating point precision which is essential for the highly
to Fixed-32. accurate CNN training.
D. FloatPIM Training V. CONCLUSION
Figure 7 compares the performance and energy efficiency of In this paper, we proposed FloatPIM, the first PIM-based
FloatPIM with the GPU-based implementation and PipeLayer DNN training architecture that exploits analog properties of
127
Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:09 UTC from IEEE Xplore. Restrictions apply.
|INGPU MilPipeLayer|__|FloatPIM (No Switches) |_|FloatPIM| [15] M. Zhou, M. Imani, 8. Gupta, and T. Rosing, “Gas: A heterogeneous
memory architecture for graph processing.” in Proceedings of the
10° - . . = 10°; International Symposium on Low Power Electronics and Design, p. 27,
== x | ACM, 2018.
2%
S
Wg
1 | |
So
a
Pio:
| [16] M. Zhou et al., “Gram: graph processing in a reram-based computational
memory,” in DAC, pp. 591-596, ACM, 2019.
a > : [17] M. Imani er al., “Nvquery: Efficient query processing in non-volatile
a i & memory,” JEEE TCAD, 2018.
[18] M. Imani, D. Peroni, Y. Kim, A. Rahimi, and T. Rosing, “Efficient neural
E network acceleration on gpgpu using content addressable memory,” in
2017 Design, Automation & Test in Europe Conference & Exhibition
peti ga tye | i gt tg [19]
(DATE), pp. 1026-1031, IEEE, 2017.
X. Yin, C. Li, Q. Huang, L. Zhang, M. Niemier, X. S. Hu, C. Zhuo,
Fig. 7. FloatPIM efficiency during training. and K. Ni, “Fecam: A universal compact digital and analog content ad-
dressable memory using ferroelectric,’ arXiv preprint arXiv:2004.01866,
2020.
the memory without explicitly converting data into the analog [20] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-
chan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional
domain. FloatPIM is a flexible PIM-based accelerator that neural network accelerator with in-situ analog arithmetic in crossbars,”
works with floating-point as well as fixed-point precision. in Proceedings of the 43rd International Symposium on Computer
Architecture, pp. 14-26, TEEE Press, 2016.
FloatPIM addresses the internal data movement issue of
[21] P. Chi er al., “Prime: A novel processing-in-memory architecture for
the PIM architecture by enabling in-parallel data transfer neural network computation in reram-based main memory,” in JSCA,
between the neighboring blocks. Our evaluation shows that pp. 27-39, IEEE Press, 2016.
FloatPIM can achieve on average 4.3x and 15.8x (6.3x and [22] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
deep neural networks with pruning, trained quantization and huffman
21.6x) higher speedup and energy efficiency as compared coding,” arXiv preprint arXiv: 1510.00149, 2015.
to PipeLayer (ISAAC), the state-of-the-art PIM accelerator, [23] B. Feinberg, U. K. R. Vengalam, N. Whitehair, S. Wang, and E. Ipek,
“Enabling scientific computing on memristive accelerators,” in 20/8
during training (testing). ACMAEEE 45th Annual International Symposium on Computer Archi-
ACKNOWLEDGEMENTS tecture (ISCA), pp. 367-382, IEEE, 2018.
[24] X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, and Y. Shi,
This work was supported by Semiconductor Research Cor- “Scaling for edge inference of deep neural networks,” Nature Electron-
poration GRC contract #2020-AH-2988. Mohsen Imani and ics, vol. 1, no. 4, pp. 216-222, 2018.
Yeseong Kim are co-corresponding authors of the paper. [25] C. Zhuo, S. Luo, H. Gan, J. Hu, and 7. Shi, “Noise-aware dvfs for
efficient transitions on battery-powered iot devices,” EEE Transactions
REFERENCES on Computer-Aided Design of Integrated Circuits and Systems, vol. 39,
[1] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined reram- no. 7, pp. 1498-1510, 2020.
based accelerator for deep learning,” in High Performance Computer [26] M. Cheng, L. Xia, Z. Zhu, Y. Cai, Y. Xie, Y. Wang, and H. Yang,
Architecture (HPCA), 2017 IEEE International Symposium on, TEEE, “Time: A training-in-memory architecture for memristor-based deep
2017. neural networks,” in Proceedings of the 54th Annual Design Automation
[2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, Conference 2017, p. 26, ACM, 2017.
no. 7553, p. 436, 2015. [27] Y. Cai, T. Tang, L. Xia, M. Cheng, Z. Zhu, Y. Wang, and H. Yang,
“Training low bitwidth convolutional neural network on rram,” in
[3] J. Schmidhuber, “Deep learning in neural networks: An overview,”
Neural networks, vol. 61, pp. 85-117, 2015. Proceedings of the 23rd Asia and South Pacific Design Automation
[4] M. N. Bojnordi and E. Ipek, “Memristive boltzmann machine: A Conference, pp. 117-122, IEEE Press, 2018.
hardware accelerator for combinatorial optimization and deep learning,” [28] Y. Cai, Y. Lin, L. Xia, X. Chen, $. Han, Y. Wang, and H. Yang,
in High Performance Computer Architecture (HPCA), 2016 IEEE Inter- “Long live time: improving lifetime for training-in-memory engines by
national Symposium on, pp. 1-13, IEEE, 2016. structured gradient sparsification,” in Proceedings of the 55th Annual
[5] M. Imani, M. Samragh, Y. Kim, S$. Gupta, F. Koushanfar, and T. Rosing, Design Automation Conference, p. 107, ACM, 2018.
“Rapidnn: In-memory deep neural network acceleration framework,” [29] F. Chollet, “keras.” https://github.com/fchollet/keras, 2015.
arXiv preprint arXiv: 1806.05794, 2018. [30] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
[6] S. Gupta, M. Imani, H. Kaur, and T. S. Rosing, “Nnpim: A processing Corrado, A. Davis, J. Dean, M. Devin, et al., “Tensorflow: Large-scale
in-memory architecture for neural network acceleration,” JEEE Trans- machine learning on heterogeneous distributed systems,” arXiv preprint
actions on Computers, 2019. arXiv: 1603.04467, 2016.
[7] M. Imani, S. Gupta, and T. Rosing, “Genpim: Generalized processing in- [31] X. Dong, C. Xu, N. Jouppi, and Y. Xie, “Nvsim: A circuit-level perfor-
mance, energy, and area model for emerging non-volatile memory,” in
memory to accelerate data intensive applications,” in 20/8 Design, Au-
tomation & Test in Europe Conference & Exhibition (DATE), pp. 1155—
Emerging Memory Technologies, pp. 15-50, Springer, 2014.
1158, IEEE, 2018. [32] D. Compiler, R. User, and M. Guide, “Synopsys,” Jnc., see http://www.
synopsys. com, 2000.
[8] M. Imani, 5. Gupta, Y. Kim, and T. Rosing, “Floatpim: In-memory
acceleration of deep neural network training with high precision,” in [33] N. Talati, S. Gupta, P. Mane, and S. Kvatinsky, “Logic design within
2019 ACMAEEE 46th Annual International Symposium on Computer memristive memories using memristor-aided logic (magic),”’ JEEE
Transactions on Nanotechnology, vol. 15, no. 4, pp. 635-650, 2016.
Architecture (ISCA), pp. 802-815, IEEE, 2019.
[9] M. Imani, M. S. Razlighi, Y. Kim, S. Gupta, F. Koushanfar, and
[34 S. Kvatinsky er al., “Wteam: A general model for voltage-controlled
T. Rosing, “Deep learning acceleration with neuron-to-memory transfor- memristors,” JEEE TCAS II, vol. 62, no. 8, pp. 786-790, 2015.
mation,” in 2020 JEEE International Symposium on High Performance [35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
Computer Architecture (HPCA), pp. 1-14, IEEE, 2020.
[10] D. Gao, D. Reis, X. S$. Hu, and C. Zhuo, “Eva-cim: A system-
mation processing systems, pp. 1097-1105, 2012.
level performance and energy evaluation framework for computing-in- [36] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv: 1409,1556, 2014.
memory architectures,” JEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, pp. 1-1, 2020. [37] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[11] S. Gupta ef al., “Felix: Fast and energy-efficient logic in memory,” in VY. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE conference on computer vision and pattern
ICCAD, pp. 1-7, IEEE, 2018.
recognition, pp. 1-9, 2015.
[12] M. Imani, A. Rahimi, D. Kong, T. Rosing, and J. M. Rabaey, “Exploring
hyperdimensional associative memory,” in HPCA, pp. 445-456, IEEE, [38] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,
2017. and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer
parameters andj 0.5 mb model size,” arXiv preprint arXiv: 1602.07360,
[13] K. Ni, X. Yin, A. F Laguna, S. Joshi, S. Diinkel, M. Trentzsch,
J, Miieller, 5. Beyer, M. Niemier, X. S. Hu, ef al., “Ferroelectric ternary 2016.
content-addressable memory for one-shot learning,” Nature Electronics, [39] K. He, X. Zhang, §. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
vol. 2, no. 11, pp. 521-529, 2019.
and pattern recognition, pp. 770-778, 2016.
[14] Y. Kim et al., “Orchard: Visual object recognition accelerator based on
approximate in-memory processing,” in /CCAD, pp. 25-32, IEEE, 2017.
128
Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:09 UTC from IEEE Xplore. Restrictions apply.