Professional Documents
Culture Documents
∗
Large-scale Neuromorphic Computing
Shankar Ganesh Ramasubramanian, Rangharajan Venkatesan, Mrigank Sharad,
Kaushik Roy and Anand Raghunathan
School of Electrical and Computer Engineering, Purdue University
West Lafayette, IN, USA
{sramasub,rvenkate,msharad,kaushik,raghunathan}@purdue.edu
ABSTRACT on deep learning networks (DLNs) [1–3], an important class
Deep Learning Networks (DLNs) are bio-inspired large-scale of large-scale neural networks that have shown state-of-the-
neural networks that are widely used in emerging vision, an- art results on a range of problems in text, image, and video
alytics, and search applications. The high computation and analysis. DLNs are currently used in real-world applications
storage requirements of DLNs have led to the exploration such as Google+ image search, Apple Siri voice recognition,
of various avenues for their efficient realization. Concur- house number recognition for Google Maps, etc. [4, 5]. In or-
rently, the ability of emerging post-CMOS devices to effi- der to achieve high accuracy and robustness to variations in
ciently mimic neurons and synapses has led to great interest inputs, DLNs utilize several layers with each layer consisting
in their use for neuromorphic computing. of a large number of neurons and varying interconnectivity
We describe spindle, a programmable processor for deep patterns. For example, a DLN that recently won the Im-
learning based on spintronic devices. spindle exploits the agenet visual recognition challenge contains around 650,000
unique ability of spintronic devices to realize highly dense neurons and 60 million synapses, and requires compute power
and energy-efficient neurons and memory, which form the in the order of 2-4 GOPS per classification. The complexity
fundamental building blocks of DLNs. spindle consists of of DLNs, together with the growth in the sizes of data sets
a three-tier hierarchy of processing elements to capture the that they process, places high computational demands on the
nested parallelism present in DLNs, and a two-level mem- platforms that execute them. Several research efforts have
ory hierarchy to facilitate data reuse. It can be programmed been devoted to realizing efficient implementations of DLNs
to execute DLNs with widely varying topologies for different using multi-core processors, graphics processing units, and
applications. spindle employs techniques to limit the over- hardware accelerators [6–8]. All these approaches are limited
heads of spin-to-charge conversion, and utilizes output and by one common fundamental bottleneck - in effect, they em-
weight quantization to enhance the efficiency of spin-neurons. ulate neuromorphic systems using primitives (instructions,
We evaluate spindle using a device-to-architecture modeling digital arithmetic units, and Boolean gates) that are inher-
framework and a set of widely used DLN applications (hand- ently mismatched with the constructs that they are used to
writing recognition, face detection, and object recognition). realize (neurons and synapses).
Our results indicate that spindle achieves 14.4X reduction A concurrent trend that has catalyzed the field of neuro-
in energy consumption and 20.4X reduction in EDP over the morphic computing is the emergence of post-CMOS devices.
CMOS baseline under iso-area conditions. Although a clear replacement for Silicon and CMOS is yet to
be found, many emerging devices have unique characteristics
Categories and Subject Descriptors and strengths that are different from CMOS. Among them,
C.1.3 [PROCESSOR ARCHITECTURES]: Other Ar- spintronics, which uses electron spin rather than charge to
chitecture Styles (Neural Nets) represent and process information, has attracted great inter-
est. Spintronic memories promise high density, non-volatility,
Keywords and near-zero leakage, leading to extensive research, industry
Spintronics; Emerging Devices; Nanoelectronics; Post-CMOS; prototypes, and early commercial offerings in recent years [9].
Neural Networks; Neuromorphic Computing Recently, it has been demonstrated that spintronic devices
can also be used to directly mimic the computations per-
1. INTRODUCTION formed in neurons and synapses while operating at very low
voltages [10, 11], leading to greatly reduced area and power
Neuromorphic algorithms, which mimic the functionality
over digital and analog CMOS implementations. These ad-
of the human brain, are used for a wide class of applications
vances raise the prospect of realizing large-scale neuromor-
involving classification, recognition, search, and inference.
phic systems such as DLNs in an energy-efficient manner
While the roots of neuromorphic algorithms lie in simple arti-
using spintronics. However, several challenges need to be
ficial neural networks, contemporary networks have grown to
addressed to realize this vision.
be much larger in scale and complexity. In this work, we focus
First, a direct mapping of a DLN into a network of spin-
∗
This work was supported in part by STARnet, a Semicon- tronic neurons and synapses, as envisioned by previous work,
ductor Research Corporation program sponsored by MARCO leads to a large, inefficient design, especially for DLNs of high
and DARPA, and in part by the Natonal Science Foundation complexity (e.g., ∼106 neurons and ∼108 synapses). Second,
under grant no. 1320808. for broader utility, it is desirable to have a programmable
Permission to make digital or hard copies of all or part of this work for personal or platform that can implement a wide range of networks of
classroom use is granted without fee provided that copies are not made or distributed varying complexity and topology, as required by different ap-
for profit or commercial advantage and that copies bear this notice and the full cita- plications. Third, the low spin diffusion length in most ma-
tion on the first page. Copyrights for components of this work owned by others than terials mandates the use of charge-based interconnects, im-
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- posing overheads for spin⇐⇒charge conversion. Finally, the
publish, to post on servers or to redistribute to lists, requires prior specific permission energy consumed by spin-neurons increases drastically with
and/or a fee. Request permissions from permissions@acm.org. the precision at which their weights are stored and their out-
ISLPED’14, August 11–13, 2014, La Jolla, CA, USA. puts are computed. Therefore, careful architectural design is
Copyright 2014 ACM 978-1-4503-2975-0/14/08 ...$15.00. required in order to preserve the intrinsic efficiency of spin-
http://dx.doi.org/10.1145/2627369.2627625 . tronic devices for large-scale neuromorphic computing.
15
In this work, we propose SPINtronic Deep Learning Engine computing dot-product of the kernel with regions of the in-
(spindle), an energy-efficient programmable architecture for put feature in a sliding window fashion), and the results are
Deep Learning Networks (DLNs). We consider various key summed up to produce an intermediate output. A bias value
characteristics of DLNs and spintronic devices in the design is then added and a thresholding operation, also known as
of spindle. Notably, DLNs exhibit multiple levels of nested an activation function (typically tanh or sigmoid) is applied
parallelism and are comprised of a few recurring computation to produce a feature of the output layer. This process is re-
patterns. At the lowest level, they contain fine-grained data- peated M times, where M is the number of features in the
parallel computations such as convolution and sub-sampling. output layer.
These computations are in turn organized into layers, with Subsampling-and-Thresholding (S-T): Subsampling takes
task parallelism within each layer and producer-consumer one input feature and produces one output feature in which
parallelism across layers. DLNs exhibit significant data reuse each value of an output feature is produced by first com-
across computations, and exploiting this reuse is critical to puting the average of the neighboring values over a sliding
reducing off-chip memory accesses, as well as limiting the window, adding a bias, and then performing a thresholding
on-chip storage requirements. Considering these character- operation.
istics, we propose a hierarchical three-tiered architecture for The connectivity between layers falls into two categories:
spindle, consisting of Spin Neuromorphic Arrays (SNAs), (i) partially connected layers where each feature is connected
Spin Neuromorphic Cores (SNCs) and SNC Clusters. spin- to a subset of features from the previous layer (ii) fully con-
dle uses a two-level memory hierarchy, consisting of on-chip nected layers where each feature is connected to all features
distributed scratchpad memories that are local to SNCs, and in the previous layer. Note that DLNs contain three levels
shared off-chip memory. We propose various techniques to of parallelism - producer-consumer parallelism across layers,
enhance the efficiency of spindle, including intra- and inter- task parallelism within a layer, and fine-grained data paral-
layer data reuse, and neuron output and weight quantiza- lelism within each C-T or S-T operation. Furthermore there
tion. We evaluate spindle using a hierarchical modeling exists significant data reuse across and within these opera-
framework, starting with physics-based device simulation, tions.
and constructing circuit and architectural macro-models of The computational complexity of a DLN is determined by
spin-based neurons and memories. We compare spindle to a the number of layers, the number and sizes of network in-
well-optimized CMOS baseline using three popular DLN ap- puts and features in each layer, connectivity between layers,
plications - handwriting recognition, face detection, and ob- and kernel sizes. For example, CIFAR, an image classifica-
ject recognition. Our analysis shows that spindle achieves tion DLN, takes an input of size 3x32x32, and has 685 fea-
14.4X reduction in energy and 20.4X reduction in energy- tures spread across 6 layers, requiring 3.77 million multiply-
delay product under iso-area, establishing the potential of accumulate computations per classification. DLNs are also
spintronic devices for large-scale neuromorphic computing. memory-intensive. For example, CIFAR requires memory
The rest of the paper is organized as follows. Section 2 pro- accesses amounting to 7.5 MB per input.
vides the necessary background on DLNs. Section 3 presents
the design of spin-based neurons and memory, which form the
building blocks of spindle. Section 4 describes the spindle 3. SPINDLE BUILDING BLOCKS
architecture and the techniques used to improve its energy The fundamental building blocks of spindle are spintronic
efficiency. Section 5 details the experimental setup includ- neurons and memory. In this section, we provide a brief
ing our device-to-architecture modeling framework. Section description of their structure and operation.
6 presents results comparing spindle with a CMOS baseline.
Section 7 provides an overview of prior efforts on hardware 3.1 Spin-Neurons
for neuromorphic computing, and Section 8 concludes the
paper. V+∆V
IN1
2. DEEP LEARNING NETWORKS V+∆V
IN2
Inputs L1 L2 L(N-1) LN V+∆V
INN
F10 F21 F31 Resistive
Convolution & Thresholding
Subsampling & Thresholding
Fi0
Convolution & Thresholding
DTCS-
F32 DACs Ic1 Ic2 IcM crossbar
array
F11 F22
F33
F38 memristor
metal channel
V V V
Fin spin-magnets
F1l F2m F3t Control logic & Control logic & Control logic &
Act. Func. Act. Func. Act. Func. Enhanced
Fully connected SAR-ADCs
16
energy-efficiency of the entire neuron since it enables opera- The cross-bar based spin-neuron performs a 120-input weighted
tion at very low voltages (∼10s of mV). summation ∼117X and ∼60X more energy efficiently as com-
pared to state-of-the-art digital and analog CMOS (45 nm)
metal channel implementations, respectively [10, 11]. The energy efficiency
spin-magnets of the spin-neuron derives from two primary sources: (i)
Select tunneling barrier
1 0 0 0 0 Ic very low voltages compared to analog and digital CMOS im-
logic Ibias plementations, and (ii) much lower circuit complexity (de-
SAR
vice count) compared to digital implementations that uti-
Inverse Act. M1 M4 M2 lize adders and multipliers. Finally, we believe that the pro-
Fn. lookup M3
posed design should be feasible from an integration perspec-
Sensing
circuitry CL tive, since both spintronic and memristive devices have been
V+∆V
17
titions that are executed in a serial manner, with interme- or sub-sampling operations from within a layer, while exploit-
diate results stored in the memory hierarchy. This allows ing any available input feature re-use across these operations.
spindle to execute a wide range of DLNs and makes the The scratchpad memory forms the highest level of the
architecture scalable and programmable. memory hierarchy and stores the input features used by the
The components of spindle are described in greater detail SNAs in the SNC as well as the output features that they gen-
in the following subsections. erate. As described in Section 3.2, we utilize 1bitDWM for
designing the scratchpad memory in spindle. The dispatch
SPINDLE unit performs three key functions. First, it reads all the input
SNC Cluster SNC Cluster SNC Cluster
Global features stored in scratchpad memory and broadcasts them
Memory Mapped
Control
to all SNAs. The input selector in each SNA is programmed
Host
Unit
Interface
Control sign choice amortizes control overheads across the SNAs, and
Signals Intra-cluster interconnect
enables data reuse across SNAs so that the read traffic to the
scratchpad is minimized. Second, the dispatch unit controls
Dispatch Scratchpad (Spin Memory) the memristor programming circuitry, and supplies weights
unit
from the scratchpad memory to each SNA. Finally, the dis-
patch unit manages communication with the Global Control
Unit (GCU).
SNA SNA SNA As described in Section 3.1, the output precision and the
Memristor prog.
Column Driver
number of memristor levels strongly influence the energy ex-
I1
pended by SNAs. Therefore, each SNC has suitable tuning
Selector
SNA I2
Input
Memristor prog.
In
Spin
of successive approximations in the SAR-ADC, and the pre-
O1 Neuron cision of memristor programming. Fig. 4.2(a) shows the in-
Ctrl. logic O2 Om Array
SNA & Act. Fn. (SNA)
crease in energy required to program a memristor at increas-
ing levels of precision, and Fig. 4.2(b) compares the energy
Spin Neuromorphic Core (SNC)
consumption of a spin-neuron to a digital CMOS neuron for
Spin Neuromorphic Core Cluster various values of output precision. The figures show that it
Figure 6: spindle architecture is highly desirable to operate spin-neurons at the lowest pos-
4.1 Spin Neuron Arrays sible precision in order to achieve higher energy efficiency.
An SNA performs the smallest unit of computation, namely
a convolution or subsampling followed by thresholding. It is 2
10
1.5 1
Normalized energy
and produces an output feature. As shown in Figure 6, each
(log-scale)
SNA consists of a set of spin-neurons (Fig. 2), an input se- 1 0.1
lector unit and a memristor programming circuit.
The dispatch unit within each SNC streams the input fea- 0.5 0.01
tures required by all SNAs. The input selector unit in each 0 0.001
SNA selects the subset of inputs that SNA uses, and feeds this 3 4 5 6 7 8 9 4 6 8 10 12 14
data to the spin-neurons. The spin-neurons accept the inputs Memristor quantization Precision
offered by the input selector unit, and perform weighted sum-
Mem. Prog. Energy CMOS Energy SPIN Energy
mations with the kernel (weight matrix) stored in the mem-
ristors. Each spin-neuron (column of the crossbar) produces (a) Weight Precision (b) Output precision
one element of the output feature. The complete output Figure 7: Effect of precision on energy
feature is obtained over multiple cycles. The memristor pro-
gramming circuit consists of row and column drivers. The However, excessively re- Original
drivers select a memristor located at a particular row and ducing the precision of weights 6 Bit Tanh
column (analogous to the decode circuits in RAM), and drive and outputs can lead to a 20 5 Bit Tanh
the appropriate current to program the memristor. reduction in output quality
Accuracy loss -->
4 Bit Tanh
A key property of the SNA design is that all the neurons for the network. There- 3 Bit Tanh
in the array share the same inputs and execute in parallel. fore, it is necessary to de- 10
This results in substantially lower control overheads, since termine the lowest weight
these overheads are amortized across all spin-neurons. In and neuron output precisions 0
addition, the computations that produce adjacent elements that lead to acceptable out- 4 6 7 8 10
in an output feature share a large fraction of their inputs, put quality. For example, Weight Quantization bits -->
and the crossbar structure is naturally suited for exploiting Fig. 8 shows the impact of Figure 8: Quantization
this fine-grained data reuse. However, each column requires weight and output quantiza- vs. application accu-
only a subset of the inputs, and the memristor weights for tion on the classification ac- racy
the unused inputs are set to 0. Thus, increasing the data curacy loss for the object de-
reuse by increasing the number of columns results in a larger tection benchmark (CIFAR). For this benchmark, we choose
fraction of the memristors being programmed to 0, leading to a weight precision of 7 bits and output precision of 4 bits,
lower energy efficiency. We determine the number of columns since they result in negligible impact on the output accuracy.
in each SNA of spindle by balancing the benefits of data
reuse against the overheads of programming zero weights to 4.3 SNC Cluster and Global Control Unit
memristors. The SNC Cluster consists of a group of SNCs that are con-
nected through a high bandwidth and low-latency local bus.
4.2 Spin Neuromorphic Cores SNC clusters also exploit intra-layer parallelism by generat-
SNCs consist of a group of SNAs, local scratchpad memory, ing different features of the same layer in parallel. The hier-
and a dispatch unit. Each SNC executes a set of convolution archical bus architecture, if combined with a locality-aware
18
CMOS SPINDLE-ScaleUp SPINDLE-DSE
100 1.00
Normalized
Exe. Time
Energy/EDP
N
Normalized
10 0.50
1 0.00
Energy EDP Exe. Time Energy EDP Exe. Time Energy EDP Exe. Time
(a) CIFAR (b) MNIST (c) Face detection
Figure 9: Comparison of benefits for individual benchmarks
mapping of the network to spindle, has a potential to reduce sistors to form a circuit model for spin-neurons. We use this
traffic on the high-energy global bus and off-chip memory circuit model to characterize SNAs.
accesses. The Global Control Unit (GCU) orchestrates the In order to model spin memories, we evaluate the bit-cells
overall execution of the DLN by triggering the execution of using SPICE models and use a variant of CACTI [16] called
parts of the network on each SNC cluster, and the transfer of DWM-CACTI [13,17] to evaluate memory array characteris-
data between SNC Clusters, and to/from off-chip memory. tics. We use CACTI [16] to model the CMOS memories in
the baseline design. The energy and timing for the control
logic in spindle as well as the entire CMOS baseline are ob-
5. EXPERIMENTAL METHODOLOGY tained by synthesizing RTL implementations using Synopsys
In this section, we provide a description of the experimental Design Compiler to the 45nm NangateOpenCell FreePDK
methodology used to evaluate spindle. library.
The DLN networks for various benchmarks are built and
5.1 Modeling Framework trained on the Torch framework [18]. We train the applica-
spindle varies from a traditional CMOS architecture in tion using the training input set and evaluate the application
that it is composed of disparate technologies namely spin- level accuracy using an independent testing input set. We
neurons, spin memory and CMOS. We model each of these then quantize the weights and activation function to a lower
technologies across multiple levels of abstraction using the precision and retrain the application until quality bounds (<
framework shown in Fig. 10. 1.5% loss in accuracy compared to the original network) are
met. The quantized DLNs are executed on the spindle ar-
Coupled ion and Spin diffusion
chitectural simulator to obtain their execution traces. These
electron transport traces are analyzed with the appropriate energy and perfor-
modeling
CMOS SPICE
Device
Hybrid SPICE
5.2 DLN Benchmarks
simulator To evaluate the benefits of spindle, we use 3 representative
DLN benchmarks: (i) handwriting recognition on the MNIST
database [1], (ii) face detection, and (iii) object classification
Circuit level
Spin-neuron
library model bit-cell model on the CIFAR-10 dataset [19]. Relevant details of these three
Design Compiler SNA simulator DWM-CACTI
benchmarks are given in Table 1.
19
CMOS baseline. This demonstrates that the spindle archi- damental building blocks of neuromorphic computing. In
tecture largely preserves the intrinsic benefits of spintronic particular, PCRAM and memristors enable dense, crossbar
neurons and memory. spindle-DSE further improves the memory designs that can store the weights compactly. Spin-
benefits and achieves 14.4X energy and 20.4X EDP improve- based devices can match the basic computation patterns in
ments, underscoring the value of architectural design space neuromorphic algorithms while enabling very low-voltage op-
exploration. eration [10].
While the above efforts have shown that emerging devices
6.2 Design Space Exploration are promising for neuromorphic computing, they are primar-
In this section, we study the impact of varying the architec- ily at the device level and form the motivation for our work,
tural parameters of spindle on its energy and performance. which focuses on the design of a large-scale, programmable
Impact of local memory per SNC: Varying the local hardware architecture based on these building blocks.
memory per SNC results in an initial sharp decrease in overall
system energy fol- 1.2 10 8. CONCLUSION
lowed by a gradual 1 9 Spintronic devices have emerged as a promising technology
increase as shown in
Cycles(1000s)
0.8 8 that can realize the building blocks of neuromorphic comput-
Fig. 11. Ini-
Energy
computed in parallel in 1
Cycles(1000s) -->
20