Professional Documents
Culture Documents
ISCA Tutorial
1
Speakers and Contributors
2
Outline
3
Participant Takeaways
• Understand the key design considerations for
DNNs
• Be able to evaluate different implementations of
DNN with benchmarks and comparison metrics
• Understand the tradeoffs between various
architectures and platforms
• Assess the utility of various optimization
approaches
• Understand recent implementation trends and
opportunities
4
Resources
5
Background of
Deep Neural
Networks
6
Artificial Intelligence
Artificial Intelligence
7
AI and Machine Learning
Artificial Intelligence
Machine Learning
8
Brain-Inspired Machine Learning
Artificial Intelligence
Machine Learning
Brain-Inspired
9
How Does the Brain Work?
Artificial Intelligence
Machine Learning
Brain-Inspired
Spiking
11
Spiking Architecture
• Brain-inspired
• Integrate and fire
• Example: IBM TrueNorth
Artificial Intelligence
Machine Learning
Brain-Inspired
Spiking Neural
Networks
13
Neural Networks: Weighted Sum
Artificial Intelligence
Machine Learning
Brain-Inspired
Spiking Neural
Networks
Deep
Learning
16
What is Deep Learning?
“Volvo
Image
XC90”
350M images
uploaded per
day
2.5 Petabytes
of customer
data hourly
300 hours of
video uploaded
every minute
18
ImageNet Challenge
19
ImageNet: Image Classification Task
15
10
0
2010 2011 2012 2013 2014 2015 Human
21
Established Applications
• Image
o Classification: image to object class
o Recognition: same as classification (except for faces)
o Detection: assigning bounding boxes to objects
o Segmentation: assigning object class to every pixel
• Speech & Language
o Speech Recognition: audio to text
o Translation
o Natural Language Processing: text to meaning
o Audio Generation: text to audio
• Games
22
Deep Learning on Games
Google DeepMind AlphaGo
23
Emerging Applications
• Medical (Cancer Detection, Pre-Natal)
http://www.nextplatform.com/2016/09/14/next-wave-deep-learning-applications/ 24
Deep Learning for Self-driving Cars
25
Opportunities
26
Overview of
Deep Neural
Networks
27
DNN Timeline
• 1940s: Neural networks were proposed
• 1960s: Deep neural networks were proposed
• 1989: Neural network for recognizing digits (LeNet)
• 1990s: Hardware for shallow neural nets
– Example: Intel ETANN (1992)
• 2011: Breakthrough DNN-based speech recognition
– Microsoft real-time speech translation
• 2012: DNNs for vision supplanting traditional ML
– AlexNet for image classification
• 2014+: Rise of DNN accelerator research
– Examples: Neuflow, DianNao, etc.
28
Publications at Architecture Conferences
29
So Many Neural Networks!
http://www.asimovinstitute.org/neural-network-zoo/ 30
DNN Terminology 101
Neurons
L2 Output
Activations
Sparsely-Connected
• Fully-Connected NN
– feed forward, a.k.a. multilayer perceptron (MLP)
• Convolutional NN (CNN)
– feed forward, sparsely-connected w/ weight sharing
• Recurrent NN (RNN)
– feedback
39
Inference vs. Training
40
Deep Convolutional Neural Networks
Low- High-
CONV CONV FC
Layer
Level …
Layer
Level Layer
Classe
Features Features s
1 – 3 Layers
41
Deep Convolutional Neural Networks
Low- High-
CONV CONV FC
Layer
Level …
Layer
Level Layer
Classe
Features Features s
Convolution Activation
42
Deep Convolutional Neural Networks
Low- High-
CONV CONV FC
Layer
Level …
Layer
Level Layer
Classe
Features Features s
Fully Activation
Connected
×
43
Deep Convolutional Neural Networks
Normalization Pooling
44
Deep Convolutional Neural Networks
High-Level
CONV NORM POOL CONV FC
Layer Layer Layer Layer Features Layer
Classes
45
Convolution (CONV) Layer
a plane of input activations
a.k.a. input feature map (fmap)
filter (weights)
H
R
S W
46
Convolution (CONV) Layer
input fmap
filter (weights)
H
R
S W
Element-wise
Multiplication
47
Convolution (CONV) Layer
S W F
Element-wise Partial Sum (psum)
Accumulation
Multiplication
48
Convolution (CONV) Layer
S W F
Sliding Window Processing
49
Convolution (CONV) Layer
input fmap
filte C
output fmap
C r
H
R E
S W F
Many Input Channels (C)
50
Convolution (CONV) Layer
input fmap
many output fmap
filters (M) C
C
H M
R E
1
S W F
Many
…
R
M
S
51
Convolution (CONV) Layer
Many
Input fmaps (N) Many
C Output fmaps (N)
filters
M
C
H
R E
1 1
S W F
…
…
C C
R E
H
S
N
N F
W
52
CNN Decoder Ring
53
CONV Layer Tensor Computation
Output fmaps (O) Input fmaps (I)
Biases (B) Filter weights (W)
54
CONV Layer Implementation
Naïve 7-layer for-loop implementation:
O[n][m][x][y] = B[m];
for (i=0; i<R; i++)
convolve { for (j=0; j<S; j+
a window +) {
and apply for (k=0; k<C; k++) {
O[n][m][x][y] +=
activation I[n][k][Ux+i]
[Uy+j] × W[m]
[k][i][j];
}
O[n][m][x][y] = Activation(O[n][m][x][y]);
} }
} }
}
}
55
Traditional Activation Functions
0 0
-1 -1
-1 0 -1 0 1
1 y=(ex-e-x)/(ex+e-x)
y=1/(1+e-x)
Image Source: Caffe Tutorial 56
Modern Activation Functions
Rectified Linear Unit
Leaky ReLU Exponential LU
(ReLU)
1 1 1
0 0 0
-1 -1 -1
-1 0 1 -1 0 1 -1 0 1
x, x≥0
y=max(0,x) y=max(αx,x) α(ex-1),x<0
y=
α = small const. (e.g. 0.1)
Image Source: Caffe Tutorial 57
Fully-Connected (FC) Layer
• Height and width of output fmaps are 1 (E = F = 1)
• Filters as large as input fmaps (R = H, S = W)
• Implementation: Matrix Multiplication
M × = M
58
FC Layer – from CONV Layer POV
input fmaps
C
output fmaps
filters
M
C
H
H 1
1 1
W W 1
…
…
C C
H 1
H
W
N
N 1
W
59
Pooling (POOL) Layer
• Reduce resolution of each channel independently
• Overlapping or non-overlapping depending on stride
max = -Inf;
for (i=0; i<R; i++)
{ for (j=0; j<S; j+
+) {
if (I[n][m][Ux+i] find the max
[Uy+j] > max) { with in a window
max = I[n][m]
[Ux+i][Uy+j];
}
}
O[n][m][x][y] = max;
} }
}
}
}
61
Normalization (NORM) Layer
• Batch Normalization (BN)
– Normalize activations towards mean=0 and
std. dev.=1 based on the statistics of the training
dataset
– put in between CONV/FC and Activation function
Convolution Activation
CONV BN
Layer
×
63
Normalization (NORM) Layer
• Local Response Normalization (LRN)
• Tries to mimic the inhibition scheme in the brain
Now deprecated!
Image Source: Caffe Tutorial 64
Relevant Components for Tutorial
65
Survey of DNN
Development Resources
Clarifai
2
0
2012 2013
2014 2015
Human 2
LeNet-5
CONV Layers: 2
Fully Connected Layers: 2
Weights: 60k Digit Classification!
MACs: 341k
Sigmoid used for non-
linearity
1000
224x224 scores
Input
Conv (11x11)
L1 L2 L3 L4 L5 L6 L7
Max Pooling
Max Pooling
Norm (LRN)
Norm (LRN)
Conv (3x3)
Conv (3x3)
Conv (5x5)
Conv (3x3)
Max Pool
Linearity
Linearity
Linearity
Non-
Non-
Non-
Linearity
Linearity
Linearity
Image
Linearity
Linearity
Connect
Connect
Connect
Non-
Non-
Non-
Non-
Fully
Non-
Fully
Fully
# of weights 35k 307k 885k 664k 442k 37.7M 16.8M 4.1M 4
Large Sizes with Varying Shapes
AlexNet Convolutional Layer Configurations
Layer Filter Size (RxS) # Filters (M) # Channels (C) Stride
1 11x11 96 3 4
2 5x5 256 48 1
3 3x3 384 256 1
4 3x3 384 192 1
5 3x3 256 192 1
Layer 1 Layer 2 Layer 3
Inception
Module
1x1 ‘bottleneck’ to
reduce number of
weights
9 Inception Layers
Learns Identity
ReLU
Residual x
F(x)=H(x)-x
3x3 CONV
F(x) 16 Short
+ Cut Layers
H(x) = F(x) + x
ReLU
12
Frameworks
* *
• Rapid development
• Sharing models
• Workload profiling
• Network hardware co-design
15
Image Classification Datasets
• Image Classification/Recognition
– Given an entire image Select 1 of N classes
– No localization (detection)
LeNet in 1998
(0.95% error)
ICML 2013
(0.21% error)
http://yann.lecun.com/exdb/mnist/ 17
ImageNet
Object Classification
~256x256 pixels (color)
1000 Classes
1.3M Training
100,000 Testing (50,000 Validation)
Image Source: http://karpathy.github.io/
http://www.image-net.org/challenges/LSVRC/ 18
ImageNet
Fine grained
Classes
(120 breeds)
Image Source: http://karpathy.github.io/
Winner 2016
(2.99% error)
http://www.image-net.org/challenges/LSVRC/ 19
Image Classification Summary
MNIST IMAGENET
Year 1998 2012
Resolution 28x28 256x256
Classes 10 1000
Training 60k 1.3M
Testing 10k 100k
Accuracy 0.21% error 2.99%
(ICML 2013) top-5 error
(2016 winner)
http://rodrigob.github.io/are_we_there_yet/build/
classification_datasets_results.html 20
Next Tasks: Localization and Detection
http://host.robots.ox.ac.uk/pascal/VOC/ http://mscoco.org/ 22
Recently Introduced Datasets
23
Summary
24
Survey of
DNNHardware
• 245W TDP
• 29 GFLOPS/W (FP32)
• 14nm process
Knights Mill: next gen Xeon Phi “optimized for deep learning”
Intel announced the addition of new vector instructions for deep learning
(AVX512-4VNNIW and AVX512-4FMAPS), October 2016
Source: Nvidia 3
GPUs Are Targeting Deep Learning
Nvidia VOLTA GV100 (2017)
• 15 TFLOPS FP32
• 16GB HBM2 – 900 GB/s
• 300W TDP
• 50 GFLOPS/W (FP32)
• 12nm process
• 300GB/s NV Link2
• Tensor Core….
Source: Nvidia 4
GV100 – “Tensor Core”
Tensor Core….
• 120 TFLOPS (FP16)
• 400 GFLOPS/W (FP16)
5
Systems for Deep Learning
Nvidia DGX-1 (2016)
• 170 TFLOPS
• 8× Tesla P100, Dual Xeon
• NVLink Hybrid Cube Mesh
• Optimized DL Software
• 7 TB SSD Cache
Source: Nvidia 6
Cloud Systems for Deep Learning
Facebook’s Deep Learning Machine
Source: Facebook 7
SOCs for Deep Learning Inference
Nvidia Tegra - Parker
ARM v8
• GPU: 1.5 TeraFLOPS FP16
CPU
COMPLEX
(2x Denver 2 +
• 4GB LPDDR4 @ 25.6 GB/s
4x A57)
Coherent HMP
• 15 W TDP
4K60
SECURITY
ENGINES
4K60
VIDEO
VIDEO
DECODER
AUDIO
ENGINE
ENGINE
2D
(1W idle, <10W typical)
ENCODER
DISPLAY
ENGINES
128-bit
LPDDR4
BOOT and
GigE
PM PROC
IMAGE
PROC
• 100 GFLOPS/W (FP16)
Ethernet (ISP)
MAC
Safety
• 16nm process
I/O
Engine
Source: Nvidia 8
Mobile SOCs for Deep Learning
Samsung Exynos (ARM Mali)
• 14nm process
Source: Wikipedia 9
FPGAs for Deep Learning
Intel/Altera Stratix 10
• 10 TFLOPS FP32
• HBM2 integrated
• Up to 1 GHz
• 14nm process
• 80 GFLOPS/W
Xilinx Virtex UltraSCALE+
• DSP: up to 21.2 TMACS
• DSP: up to 890 MHz
• Up to 500Mb On-Chip Memory
• 16nm process
10
Kernel
Computatio
n
11
Fully-Connected (FC) Layer
• Matrix–Vector Multiply:
• Multiply all inputs in all channels by a weight and sum
M × CHW = M
12
Fully-Connected (FC) Layer
CHW
M × = M
13
Fully-Connected (FC) Layer
14
Convolution (CONV) Layer
• Convert to matrix mult. using the Toeplitz Matrix
Toeplitz Matrix
(w/ redundant data)
1 2 3 4
× 1
2
2
3
4
5
5
6
= 1 2 3 4
Matrix Mult:
4 5 7 8
5 6 8 9
15
Convolution (CONV) Layer
• Convert to matrix mult. using the Toeplitz Matrix
Toeplitz Matrix
(w/ redundant data)
1 2 3 4
× 1
2
2
3
4
5
5
6
= 1 2 3 4
Matrix Mult:
4 5 7 8
5 6 8 9
Data is repeated
16
Convolution (CONV) Layer
1 2 1 2 1 2
Filter 2 3 4 3 4 Chnl 1 Chnl 2 3 4
Chnl 2
Chnl 1 Chnl 2
17
Convolution (CONV) Layer
Toeplitz Matrix
Chnl 1 Chnl 2 (w/ redundant data)
Filter 1 1 2 3 4 1 2 3 4 1 2 4 5 1 2 3 4 Chnl 1
Filter 2 1 2 3 4 1 2 3 4
× 2 3 5 6 = 1 2 3 4 Chnl 2
4 5 7 8
5 6 8 9 Chnl 1
1 2 4 5
2 3 5 6
4 5 7 8
5 6 8 9 Chnl 2
18
Computationa
l Transforms
19
Computation Transformations
20
Gauss’s Multiplication Algorithm
4 multiplications + 3 additions
3 multiplications + 5 additions
21
Strassen
8 multiplications + 4 additions
P1 = a(f – h) P5 = (a + d)(e + h)
P2 = (a + b)h P6 = (b - d)(g + h)
P3 = (c + d)e P7 = (a – c)(e + f)
P4 = d(g – e)
7 multiplications + 18 additions
Naïve
Strassen
N
Comes at the price of reduced numerical stability and
requires significantly more memory
Image Source: http://www.stoimen.com/blog/2012/11/26/computer-algorithms-strassens-matrix-multiplication/
Winograd 1D – F(2,3)
• Targeting convolutions instead of matrix multiply
• Notation: F(size of output, filter size)
input filte
r
6 multiplications + 4 additions
Original: 36 multiplications
Winograd: 16 multiplications 2.25 times reduction
25
Winograd Halos
• Winograd works on a small region of output at a
time, and therefore uses inputs repeatedly
Halo columns
26
Winograd Performance Varies
Source: Nvidia 27
Winograd Summary
28
Winograd as a Transform
Transform inputs
Dot-product
filte
r Transform output
input
GgGT can be precomputed
S W F
F F I
F F F
T T F
30
FFT Overview
32
Optimization opportunities
33
cuDNN: Speed up with Transformations
Source: Nvidia 34
DNN
Accelerator
Architectures
2
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
filter weight AL
fmap activation U
partial sum updated partial sum
* multiply-and-accumulate
3
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
AL
DRAM U DRAM
* multiply-and-accumulate
4
Memory Access is the Bottleneck
Memory Read MAC Memory Write
AL
DRAM Me U Me DRAM
m m
5
Memory Access is the Bottleneck
Memory Read MAC Memory Write
1
AL
DRAM Me U Me DRAM
m m
6
Types of Data Reuse in DNN
Convolutional Reuse
CONV layers only
(sliding window)
Input Fmap
Filter
Activations
Reuse:
Filter weights
7
Types of Data Reuse in DNN
Convolutional Reuse Fmap Reuse
CONV layers only CONV and FC layers
(sliding window)
Filters
Input Fmap Input Fmap
Filter
1
Activations
Reuse: Filter weights Reuse: Activations
8
Types of Data Reuse in DNN
Convolutional Reuse Fmap Reuse Filter Reuse
CONV layers only CONV and FC layers CONV and FC layers
(sliding window) (batch size > 1)
Input
Filters Fmaps
Input Fmap Input Fmap
Filter Filter
1 1
2
2
Activations
Reuse: Filter weights Reuse: Activations Reuse: Filter weights
9
Memory Access is the Bottleneck
Memory Read MAC Memory Write
1
AL
DRAM Me U Me DRAM
m m
10
Memory Access is the Bottleneck
Memory Read MAC Memory Write
1
AL
DRAM Me U Me DRAM
m 2 m
11
Memory Access is the Bottleneck
Memory Read MAC Memory Write
1
AL
DRAM Me U Me DRAM
m 2 m
• 2) Example:
2 Partial sumDRAM
accumulation
access in NOT
does have can
AlexNet to access DRAM
be reduced
from 2896M to 61M (best case)
12
Spatial Architecture for DNN
DRAM
13
Low-Cost Local Data Access
PE PE
Global
DRAM Buffer
PE ALU fetch data to run
a MAC here
Global Buffer
Psum Activation
W0 W1 W2 W3 W4 W5 W6 W7 PE
Weight
18
WS Example: nn-X (NeuFlow)
A 3×3 2D Convolution Engine
activations
weights
psums
Global Buffer
Activation
Weight
P0 P1 P2 P3 P4 P5 P6 P7 PE
Psum
20
OS Example: ShiDianNao
psums
Global Buffer
Weight
Activation PE
Psum
22
NLR Example: UCLA
psums
activations
weights
23
NLR Example: TPU
weights
activations
psums
2 Variants of OS
1.5
Normalized
1
Energy/MAC
0.5
0
WS OSA OSB OSC NLR
CNN Dataflows
[Chen et al., ISCA 2016] 26
Energy Efficiency Comparison
• Same total area • 256 PEs
• AlexNet CONV layers • Batch size = 16
2 Variants of OS
1.5
Normalized
1
Energy/MAC
0.5
0
WS OSA OSB OSC NLR Row
Stationary
CNN Dataflows
[Chen et al., ISCA 2016] 27
Energy-Efficient Dataflow:
Row Stationary (RS)
Input Fmap
Filter Output Fmap
* =
29
1D Row Convolution in PE
Input Fmap
Filter a b c d e Partial Sums
a b c a b c
* =
Reg File PE
c b a
e d c b a
30
1D Row Convolution in PE
Input Fmap
Filter a b c d e Partial Sums
a b c a b c
* =
Reg File PE
c b a
e d c b
a
31
1D Row Convolution in PE
Input Fmap
Filter a b c d e Partial Sums
a b c a b c
* =
Reg File PE
c. b a
e d. c b
a
b
32
1D Row Convolution in PE
Input Fmap
Filter a b c d e Partial Sums
a b c a b c
* =
Reg File PE
c b a
e d c
b a
c
33
1D Row Convolution in PE
Reg File PE
c b a
e d c
b a
c
34
2D Convolution in PE Array
PE 1
Row 1 * Row 1
* =
35
2D Convolution in PE Array
Row 1
PE 1
Row 1 * Row 1
PE 2
Row 2 * Row 2
PE 3
Row 3 * Row 3
* =
36
2D Convolution in PE Array
Row 1 Row 2
PE 1 PE 4
Row 1 * Row 1 Row 1 * Row 2
PE 2 PE 5
Row 2 * Row 2 Row 2 * Row 3
PE 3 PE 6
Row 3 * Row 3 Row 3 * Row 4
* = =
37
2D Convolution in PE Array
Row 1 Row 2 Row 3
PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3
PE 2 PE 5 PE 8
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4
PE 3 PE 6 PE 9
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5
* =
* = =
38
Convolutional Reuse Maximized
PE 1 PE 4 PE 7
Row 1 Row 1 Row 1
* Row 1 * Row 2 * Row
3 PE 2 PE 5 PE 8
Row 2 Row 2 Row 2
* RowPE
23 * RowPE
36 * RowPE 9
Row 3 4 Row 3 Row 3
Row 1 Row 2
Row 3
PE 1 PE 4 PE 7
Row 1 Row 2 Row 3
Row 1 * Row 1 * Row 1
PE 2 PE 5 PE 8
Row 2 Row 3 Row 4
*
PE 3 PE 6 PE 9
Row 3 Row 4 Row 5
Row 2 * Row 2 * Row 2
40
Maximize 2D Accumulation in PE
Array
Row 1 Row 2 Row 3
PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3
PE 2 PE 5 PE 8
41
Dimensions Beyond 2D Convolution
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
42
Filter Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
Filter 1 Fmap 1 Psum 1
C
H
H
Channel 1 Row 1
* Row 1 = Row 1
R
C
R
Filter 1 Fmap 2 Psum 2
H Channel 1 Row 1
* Row 1 = Row 1
H
43
Filter Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
Filter 1 Fmap 1 Psum 1
C
H
H
Channel 1 Row 1
* Row 1 = Row 1
R
C
R Filter 1 Fmap 2 Psum 2
H Channel 1 Row 1
*
share the same filter row
Row 1 = Row 1
H
44
Filter Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
Filter 1 Fmap 1 Psum 1
C
H
H
Channel 1 Row 1
* Row 1 = Row 1
R
C
R Filter 1 Fmap 2 Psum 2
H Channel 1 Row 1
*
share the same filter row
Row 1 = Row 1
H
45
Fmap Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
R
C
Channel 1 Row 1
* Row 1 = Row 1
H
C
Filter 2 Fmap 1 Psum 2
* = Row 1
R H
Channel 1 Row 1 Row 1
R
46
Fmap Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
R
C
Channel 1 Row 1
* Row 1 = Row 1
H
C
Filter 2 Fmap 1 Psum 2
* = Row 1
R H
Channel 1 Row 1 Row 1
R
share the same fmap row
47
Fmap Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
R
C
Channel 1 Row 1
* Row 1 = Row 1
H
C
Filter 2 Fmap 1 Psum 2
* = Row 1
R H
Channel 1 Row 1 Row 1
R
share the same fmap row
48
Channel Accumulation in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
C
Channel 1 Row 1
* Row 1 = Row 1
R H
R
Filter 1 Fmap 1 Psum 1
* = Row 1
H
Channel 2 Row 1 Row 1
49
Channel Accumulation in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
C
Channel 1 Row 1
* Row 1 = Row 1
R H
R
Filter 1 Fmap 1 Psum 1
* = Row 1
H
Channel 2 Row 1 Row 1
accumulate psums
Row 1 + Row 1 = Row 1
50
Channel Accumulation in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
C
Channel 1 Row 1
* Row 1 = Row 1
R H
R
Filter 1 Fmap 1 Psum 1
* = Row 1
H
Channel 2 Row 1 Row 1
accumulate psums
51
DNN Processing – The Full Picture
Optimization
H
R E
1 1 1
R H E
Compiler
…
…
C C
R
M H
E (Mapper)
R
N
N E
H
Filter 1 Imag
Fmap 1 & 2 Psum 1 & 2
ALU ALU ALU ALU Multiple fmaps: e
* =
Filter 1 & 2 Imag
Fmap 1 Psum 1 & 2
Multiple filters: e
ALU ALU ALU ALU
* =
Filter 1 Image
Fmap 1 Psum
Multiple channels:
* =
Compilation Execution
DNN Shape and Size Processed
(Program) Data
Dataflow, …
(Architecture)
Mapper DNN Accelerator
(Compiler) (Processor)
Implementation
Details
Mapping Input
(µArch)
(Binary) Data
55
Evaluate Reuse in Different Dataflows
• Weight Stationary
– Minimize movement of filter weights
• Output Stationary
– Minimize movement of partial sums
• No Local Reuse
– No PE local storage. Maximize global buffer size.
• Row Stationary
Normalized Energy Cost*
Evaluation Setup
ALU 1× (Reference)
• same total area
RF ALU 1×
• 256 PEs 2×
PE ALU
• AlexNet Buffer ALU 6×
• batch size = 16 DRAM ALU 200×
56
Variants of Output Stationary
Parallel
Output Region E E E
E E E
Targeting Targeting
Notes CONV layers FC layers
57
Dataflow Comparison: CONV Layers
2
psums
weights
1.5
activations
Normalized 1
Energy/MAC
0.5
0
WS OSA OSB OSC NLR RS
CNN Dataflows
RS optimizes for the best overall energy efficiency
[Chen et al., ISCA 2016] 58
Dataflow Comparison: CONV Layers
2
1.5 AL
U
Normalized
1 RF
Energy/MAC
NoC
0.5 buffer
DRA
0 M
WS OSA OSB OSC NLR RS
CNN Dataflows
RS uses 1.4× – 2.5× lower energy than other dataflows
[Chen et al., ISCA 2016] 59
Dataflow Comparison: FC Layers
2 psums
weights
1.5 activations
Normalized
Energy/MAC 1
0.5
0
WS OSA OSB OSC NLR RS
CNN Dataflows
RS uses at least 1.3× lower energy than other dataflows
[Chen et al., ISCA 2016] 60
Row Stationary: Layer Breakdown
2.0e10
1.5e10 AL
Normalized U
Energy 1.0e10 RF
(1 MAC = 1) NoC
0.5e10 buffer
DRA
0 M
L1 L2 L3 L4 L5 L6 L7
L8
CONV Layers FC
Layers
[Chen et al., ISCA 2016] 61
Row Stationary: Layer Breakdown
2.0e10
1.5e10 AL
Normalized U
Energy 1.0e10 RF
(1 MAC = 1) NoC
0.5e10 buffer
DRA
0 M
L1 L2 L3 L4 L5 L6 L7
L8
CONV Layers FC
RF dominates
Layers
[Chen et al., ISCA 2016] 62
Row Stationary: Layer Breakdown
2.0e10
1.5e10 AL
Normalized U
Energy 1.0e10 RF
(1 MAC = 1) NoC
0.5e10 buffer
DRA
0 M
L1 L2 L3 L4 L5 L6 L7
L8
CONV Layers FC
RF dominates
Layers DRAM dominates
[Chen et al., ISCA 2016] 63
Row Stationary: Layer Breakdown
2.0e10 Total Energy
80% 20%
1.5e10 AL
U
RF
Normalized
Energy 1.0e10 NoC
(1 MAC = 1) buffer
0.5e10 DRA
M
0
L1 L2 L3 L4 L5 L6 L7 L8
CONV Layers FC Layers
…
Comp ReLU M
…
108KB
Off-Chip DRAM
64 bits
[Chen et al., ISSCC 2016] 66
Data Delivery with On-Chip Network
Link Clock Core Clock DCNN Accelerator
14×12 PE Array
FilterPatterns
Data Delivery Filt
…
Fmap
…
Input Image Buffer
Psum
SRAM …
Decomp
Filter Fmap Psum
…
DeliveryOutput Image
Delivery
…
108KB
Comp
How to accommodate different shapes with fixed PE array?
ReLU
67
Logical to Physical Mappings
Replication Folding
13 27
AlexNet .. AlexNet ..
.. ..
Layer 3-5 3 .. Layer 2 5
..
..
..
..
..
.. ..
14 14
3 13
14
13 5
12 3 12
3 13 13
5
3 13PE Array
Physical Physical PE Array
68
Logical to Physical Mappings
Replication Folding
13 27
AlexNet .. AlexNet ..
.. ..
Layer 3-5 3 .. Layer 2 5
..
..
..
..
..
.. ..
14 14
3 13
14
3 13 5
12 Unused PEs 12
3 13 are
13
3 13 Clock 5
Gated
Physical PE Array Physical PE Array
69
Data Delivery with On-Chip Network
Link Clock Core Clock DCNN Accelerator
14×12 PE Array
FilterPatterns
Data Delivery Filt
…
Img
…
Input Image Buffer
Psum
SRAM …
Decomp
Filter Image Psum
…
DeliveryOutput Image
Delivery
…
108KB
Compared toComp
Broadcast, Multicast saves >80% of NoC energy
ReLU
70
Chip Spec & Measurement Results
Technology TSMC 65nm LP 1P9M
On-Chip Buffer 108 KB 4000
# of PEs 168 µm
Scratch Pad / PE 0.5 KB Glob Spatial
Core Frequency 100 – 250 MHz al (168
Array
Peak 33.6 – 84.0 GOPS Buffer PEs)
µm
4000
Performance
Word Bit-width 16-bit Fixed-Point
Filter Width: 1 – 32
Filter Height: 1 – 12
Natively Num. Filters: 1 – 1024
Supported
DNN Shapes Num. Channels: 1 –
1024
Horz. Stride: 1–12
To support 2.66 GMACs [8 billion
Vert. Stride: 1, 2, 4 16-bit inputs (16GB) and 2.7 billion
outputs (5.4GB)], only requires 208.5MB (buffer) and 15.4MB (DRAM)
[Chen et al., ISSCC 2016] 71
Summary of DNN Dataflows
• Weight Stationary
– Minimize movement of filter weights
– Popular with processing-in-memory architectures
• Output Stationary
– Minimize movement of partial sums
– Different variants optimized for CONV or FC layers
• No Local Reuse
– No PE local storage maximize global buffer size
• Row Stationary
– Adapt to the NN shape and hardware constraints
– Optimized for overall system energy efficiency
72
Fused Layer
• Dataflow across multiple layers
2
eDRAM (DaDianNao)
• Advantages of eDRAM
– 2.85x higher density than SRAM
– 321x more energy-efficient than DRAM (DDR3)
• Store weights in eDRAM (36MB)
– Target fully connected layers since dominated by weights
16 Parallel
Tiles
Eyeriss
design
• Conductance = Weight V1
• Voltage = Input G1
• Current = Voltage × Conductance
• Sum currents for addition I1 = V1×G1
V2
G2
Output Weight Input = V2×G2
I2
Input = V1, V2, …
= I1 + I2
I
Filter Weights = G1, G2, … (conductance)
= V1×G1 + V2×G2
• Advantages
– High Density (< 10nm x 10nm size*)
• ~30x smaller than SRAM**
• 1.5x smaller than DRAM**
– Non-Volatile
– Operates at low voltage
– Computation within memory (in situ)
• Reduce data movement
ADC
Eight 128x128
arrays per
IMA
12 IMAs per
Tile
2
Cost of Operations
Operation: Energy Relative Energy Cost Area Relative Area Cost
(pJ) (m2)
8b Add 0.03 36
16b Add 0.05 67
32b Add 0.1 137
16b FP Add 0.4 1360
32b FP Add 0.9 4184
8b Mult 0.2 282
32b Mult 3.1 3495
16b FP Mult 1.1 1640
32b FP Mult 3.7 7700
32b SRAM Read (8KB) 5 N/A
32b DRAM Read 640 N/A
23 Range Accuracy
1 8
FP32 S E M .000006%
10-38 – 1038
1 5
10
FP16 S E M
6x10-5 - 6x104 .05%
1 31
Int32 S M 0 – 2x109 ½
1 15
Int16 S M 0 – 6x104 ½
1 7
Int8 S M 0 – 127 ½
32-bit float 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0
-1.42122425 x 10-13 s=1 e = 70 m = 20482
Fixed Point
sign mantissa (7-bits)
8-bit 0 1 1 0 0 1 1 0
5
N-bit Precision
For no loss in precision, M is determined based on largest filter
size (in the range of 10 to 16 bits for popular DNNs)
2N+M-bits
Weight
(N-bits)
2N-bits Quantize Output
+ Accumulate to N-bits (N-bits)
Activation NxN
(N-bits) multiply
6
Dynamic Fixed Point
Floating Point
sign exponent (8-bits) mantissa (23-bits)
32-bit float 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0
-1.42122425 x 10-13 s=1 e = 70 m = 20482
Fixed Point
sign mantissa (7-bits) sign mantissa (7-bits)
8-bit 0 1 1 0 0 1 1 0 8-bit 0 1 1 0 0 1 1 0
dynamic dynamic
fixed integerfractional fixed fractional
([7-f ]-bits) (f- (f-bits)
12.75 bits) 0.1992187 m=102
s=0 m=102 f=3 5
f=9
Allow f to vary based on sdata
= 0 type and layer 7
Impact on Accuracy
Top-1 accuracy
on of CaffeNet
on ImageNet
Reduce Bit-width
Shorter Critical Path
Reduce Voltage
Power reduction of
2.56x vs. 16-bit fixed
On AlexNet Layer 2
Scale factors (α, βi) can change per layer or position in filter
– Weights {-w, 0, w} except first and last layers are 32-bit float
– Activations: 32-bit float
– Accuracy loss: 3.7% on AlexNet
• Trained Ternary Quantization (TTQ) [Zhu et al., ICLR 2017]
– Weights {-w1, 0, w2} except first and last layers are 32-bit float
– Activations: 32-bit float
– Accuracy loss: 0.6% on AlexNet
18
Non-Linear Quantization
• Precision refers to the number of levels
– Number of bits = log2 (number of levels)
19
Computed Non-linear Quantization
Only activation
in log domain
Both weights
and activations
in log domain
WS
22
Reduce Precision Overview
Implement with
look up table
• Additional Properties
– Fixed or Variable (across data types, layers, channels,
etc.)
23
Non-Linear Quantization Table Lookup
Trained Quantization: Find K weights via K-means clustering to
reduce number of unique weights per layer (weight sharing)
Example: AlexNet (no accuracy loss)
256 unique weights for CONV layer
16 unique weights for FC layer
Smaller Weight
Memory Weight Overhead Does not reduce
index Weight precision of MAC
Weight Weight
Memory (log2U-bits) Decoder/ (16-bits) MAC
CRSM x Dequant Output
log2U- U x 16b Activation
bits (16-bits)
Input
Activation
(16-bits)
26
Sparsity in Fmaps
Many zeros in output fmaps after ReLU
9 -1 -3 ReLU 9 0 0
1 -5 5 1 0 5
-2 6 -1 0 6 0
ReLU
Off-Chip DRAM …
64 bits
[Chen et al., ISSCC 2016] 28
Compression Reduces DRAM BW
1.2×
6 1.4×
1.7×
DRAM 4 1.8× Uncompressed
Access 1.9× Fmaps + Weights
(MB) 2
RLE Compressed
0 Fmaps +
1 2 3 4 5
AlexNet Conv Layer Weights
Prune Connections
pruning
neurons
Train Weights
Example: AlexNet
Weight Reduction: CONV layers 2.7x, FC layers 9.9x
(Most reduction on fully connected layers)
Overall: 9x weight reduction, 3x MAC reduction
Batch size = 1
36
Energy-Aware Pruning
…
Optimizati # acc. at mem. level n Edata
on
# of # of MACs Ecomp
MACs
Calculatio
n
CNN Weights and Input Energy
Data
L1 L2 L3 …
[0.3, 0, -0.4, 0.7, 0, 0, 0.1,
CNN Energy
…]
Consumption
Evaluation tool available at http://eyeriss.mit.edu/energy.html 38
Key Observations
Weights
Energy Consumption 22%
of GoogLeNet
89% GoogLeNet
87%
85%
83%
81% AlexNet SqueezeNet
79%
77%
5E+08 5E+09 5E+10
Normalized Energy Consumption
Original DNN
89% GoogLeNet
87%
85%
83% SqueezeNet
81% AlexNet SqueezeNet
79% AlexNet
77%
5E+08 5E+09 5E+10
Normalized Energy Consumption
[6]
Original DNN Magnitude-based Pruning [Han et al., NIPS 2015]
41
Energy-Aware Pruning
93%
91% ResNet-50
VGG-16
Top-5 Accuracy
89% GoogLeNet
87% GoogLeNet
85%
83% 1.74x SqueezeNet
81% AlexNet SqueezeNet
AlexNet AlexNet Squ ezeNet
79%
e
77%
5E+08 5E+09 5E+10
Normalized Energy Consumption
Original DNN Magnitude-based Pruning [6] Energy-aware Pruning (This Work)
a a * x
b x a * y
c z
y = a * Scatter
d z network
e b * x
f b * y
b * z
Accumulate MULs
…
PE frontend PE backend
Input Stationary Dataflow
[Parashar et al., ISCA 2017] 47
Structured/Coarse-Grained Pruning
• Scalpel
– Prune to match the underlying data-parallel hardware
organization for speed up
49
Network Architecture Design
Build Network with series of Small Filters
GoogleNet/Inception v3
5x5 filter Apply sequentially
5x1 filter
1x5 filter
decompose
separable
filters
VGG-16
5x5 filter Two 3x3 filters Apply sequentially
decompose
50
Network Architecture Design
Reduce size and computation with 1x1 Filter (bottleneck)
Figure Source:
Stanford cs231n
Figure Source:
Stanford cs231n
Figure Source:
Stanford cs231n
compress
ResNet
expand
GoogleNet
compress
54
SqueezeNet
Reduce weights by reducing number of input
channels by “squeezing” with 1x1
50x fewer weights than AlexNet
(no accuracy loss)
Fire Module
89% GoogLeNet
87%
85%
83%
81% AlexNet SqueezeNet
79%
77%
5E+08 5E+09 5E+10
Normalized Energy Consumption
Original DNN
59
[Kim et al., ICLR 2016]
Knowledge Distillation
class
scores probabilities
Complex Complex
softmax
softmax
DNN A DNN B
(teacher) (teacher)
Try to match
softmax
Simple DNN
(student)
• Input (http://image-net.org/)
– Sample subset from ImageNet Validation Dataset
3
Metrics for DNN Algorithm
• Accuracy
• Network Architecture
– # Layers, filter size, # of filters, # of channels
• # of Weights (storage capacity)
– Number of non-zero (NZ) weights
• # of MACs (operations)
– Number of non-zero (NZ) MACS
4
Metrics of DNN Algorithms
Metrics AlexNet VGG-16 GoogLeNet (v1) ResNet-50
Accuracy (top-5 error)* 19.8 8.80 10.7 7.02
Input 227x227 224x224 224x224 224x224
# of CONV Layers 5 16 21 49
Filter Sizes 3, 5,11 3 1, 3 , 5, 7 1, 3, 7
# of Channels 3 - 256 3 - 512 3 - 1024 3 - 2048
# of Filters 96 - 384 64 - 512 64 - 384 64 - 2048
Stride 1, 4 1 1, 2 1, 2
# of Weights 2.3M 14.7M 6.0M 23.5M
# of MACs 666M 15.3G 1.43G 3.86G
# of FC layers 3 3 1 1
# of Weights 58.6M 124M 1M 2M
# of MACs 58.6M 124M 1M 2M
Total Weights 61M 138M 7M 25.5M
Total MACs 724M 15.5G 1.43G 3.9G
ASIC Specs
Process Technology 65nm LP TSMC (1.0V)
Clock Frequency (MHz) 200
Number of Multipliers 168
Total core area (mm2) /total # of multiplier 0.073
Total on-Chip memory (kB) / total # of multiplier 1.14
Measured or Simulated Measured
If Simulated, Syn or PnR? Which corner? n/a
9
ASIC Benchmark (e.g. Eyeriss)
Layer by layer breakdown for AlexNet CONV layers
Metric Units L1 L2 L3 L4 L5 Overall*
Batch Size # 4
Bit/Operand # 16
Energy/
non-zero MACs pJ/MAC 16.5 18.2 29.5 41.6 32.3 21.7
(weight & act)
DRAM access/ Operands/ 0.006 0.003 0.007 0.010 0.008 0.005
non-zero MACs MAC
Runtime ms 20.9 41.9 23.6 18.4 10.5 115.3
Power mW 332 288 266 235 236 278
Example: FPGAs
12
Tutorial Summary
• DNNs are a critical component in the AI revolution, delivering
record breaking accuracy on many important AI tasks for a wide range
of applications; however, it comes at the cost of high
computational complexity
• Efficient processing of DNNs is an important area of research with
many promising opportunities for innovation at various levels of
hardware design, including algorithm co-design
• When considering different DNN solutions it is important to evaluate
with the appropriate workload in term of both input and model, and
recognize that they are evolving rapidly.
• It’s important to consider a comprehensive set of metrics when
evaluating different DNN solutions: accuracy, speed, energy, and
cost
1
Resources
2
References
2
References (Alphabetical by Author)
• Dean, Jeffrey, et al., "Large Scale Distributed Deep Networks," NIPS, 2012
• Denton, Emily L., et al. "Exploiting linear structure within convolutional networks for efficient
evaluation," NIPS, 2014.
• Dorrance, Richard, Fengbo Ren, and Dejan Marković. "A scalable sparse matrix-vector multiplication
kernel for energy-efficient sparse-blas on FPGAs." Proceedings of the 2014 ACM/SIGDA international
symposium on Field-programmable gate arrays. ACM, 2014.
• Du, Zidong, et al., "ShiDianNao: shifting vision processing closer to the sensor," ISCA, 2015
• Eryilmaz, Sukru Burc, et al. "Neuromorphic architectures with electronic synapses.” ISQED, 2016.
• Esser, Steven K., et al., "Convolutional networks for fast, energy-efficient neuromorphic computing,"
PNAS 2016
• Farabet, Clement, et al., "An FPGA-Based Stream Processor for Embedded Real-Time Vision with
Convolutional Networks," ICCV 2009
• Gokhale, Vinatak, et al., "A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks," CVPR Workshop,
2014
• Govoreanu, B., et al. "10× 10nm 2 Hf/HfO x crossbar resistive RAM with excellent performance,
reliability and low-energy operation,” IEDM, 2011.
• Gupta, Suyog, et al., "Deep Learning with Limited Numerical Precision," ICML, 2015
• Gysel, Philipp, Mohammad Motamedi, and Soheil Ghiasi. "Hardware-oriented Approximation of
Convolutional Neural Networks." arXiv preprint arXiv:1604.03168 (2016).
• Han, Song, et al. "EIE: efficient inference engine on compressed deep neural network," ISCA,
2016.
3
References (Alphabetical by Author)
• Han, Song, et al. "Learning both weights and connections for efficient neural network,” NIPS, 2015.
• Han, Song, Huizi Mao, and William J. Dally. "Deep compression: Compressing deep neural network with
pruning, trained quantization and huffman coding," ICLR, 2016.
• He, Kaiming, et al. "Deep residual learning for image recognition," CVPR, 2016.
• Horowitz, Mark. “Computing's energy problem (and what we can do about it),” ISSCC, 2014.
• Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 1MB
model size," ICLR, 2017.
• Ioffe, Sergey, and Szegedy, Christian, "Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift," ICML, 2015
• Jermyn, Michael, et al., "Neural networks improve brain cancer detection with Raman spectroscopy in the
presence of operating room light artifacts," Journal of Biomedical Optics, 2016
• Jouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit,” ISCA, 2017.
• Judd, Patrick, et al. "Reduced-precision strategies for bounded memory in deep neural nets." arXiv
preprint arXiv:1511.05236 (2015).
• Judd, Patrick, Jorge Albericio, and Andreas Moshovos. "Stripes: Bit-serial deep neural network
computing." IEEE Computer Architecture Letters (2016).
• Kim, Duckhwan, et al. "Neurocube: a programmable digital neuromorphic architecture with high-density
3D memory." ISCA 2016
• Kim, Yong-Deok, et al. "Compression of deep convolutional neural networks for fast and low power mobile
applications." ICLR 2016
4
References (Alphabetical by Author)
• Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional
neural networks," NIPS, 2012.
• Lavin, Andrew, and Gray, Scott, "Fast Algorithms for Convolutional Neural Networks," arXiv preprint arXiv:
1509.09308 (2015)
• LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE
86.11 (1998): 2278-2324.
• LeCun, Yann, et al. "Optimal brain damage," NIPS, 1989.
• Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).
• Mathieu, Michael, Mikael Henaff, and Yann LeCun. "Fast training of convolutional networks through FFTs."
arXiv preprint arXiv:1312.5851 (2013).
• Merola, Paul A., et al. "Artificial brains. A million spiking-neuron integrated circuit with a
scalable communication network and interface," Science, 2014
• Moons, Bert, and Marian Verhelst. "A 0.3–2.6 TOPS/W precision-scalable processor for real-time large-scale
ConvNets." Symposium on VLSI Circuits (VLSI-Circuits), 2016.
• Moons, Bert, et al. "Energy-efficient ConvNets through approximate computing,” WACV, 2016.
• Parashar, Angshuman, et al. "SCNN: An Accelerator for Compressed-sparse Convolutional Neural
Networks." ISCA, 2017.
• Park, Seongwook, et al., "A 1.93TOPS/W Scalable Deep Learning/Inference Processor with Tetra-Parallel
MIMD Architecture for Big-Data Applications," ISSCC, 2015
• Peemen, Maurice, et al., "Memory-centric accelerator design for convolutional neural networks," ICCD,
2013
• Prezioso, Mirko, et al. "Training and operation of an integrated neuromorphic network based on metal-oxide
memristors." Nature 521.7550 (2015): 61-64.
5
References (Alphabetical by Author)
• Rastegari, Mohammad, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural
Networks,” ECCV, 2016
• Reagen, Brandon, et al. "Minerva: Enabling low-power, highly-accurate deep neural network accelerators,”
ISCA, 2016.
• Rhu, Minsoo, et al., "vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network
Design," MICRO, 2016
• Russakovsky, Olga, et al. "Imagenet large scale visual recognition challenge." International Journal of
Computer Vision 115.3 (2015): 211-252.
• Sermanet, Pierre, et al. "Overfeat: Integrated recognition, localization and detection using convolutional
networks,” CVPR, 2014.
• Shafiee, Ali, et al. "ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic
in Crossbars." Proc. ISCA. 2016.
• Shen, Yichen, et al. "Deep learning with coherent nanophotonic circuits." Nature Photonics, 2017.
• Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image
recognition." arXiv preprint arXiv:1409.1556 (2014).
• Szegedy, Christian, et al. "Going deeper with convolutions,” CVPR, 2015.
• Vasilache, Nicolas, et al. "Fast convolutional nets with fbfft: A GPU performance evaluation." arXiv preprint
arXiv:1412.7580 (2014).
• Yang, Tien-Ju, et al. "Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware
Pruning," CVPR, 2017
• Yu, Jiecao, et al. "Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism." ISCA,
2017.
• Zhang, Chen, et al., "Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural
FPGA, 2015
Networks,"
6