Professional Documents
Culture Documents
Agenda
2
Objectives
3
AI is Transforming Industries
4
The Flood of Data
By 2020
5
Fast Evolution of Technology
6
Taxonomic Foundations
AI
Sense, learn, reason, act, and adapt to the real world
without explicit programming
Data Analytics
Build a representation, query, or model that enables
Perceptual Understanding descriptive, interactive, or predictive analysis over any
Detect patterns in audio or visual data with amount of diverse data
context
Machine Learning
Computational methods that use learning algorithms to build a model from data (in
supervised, unsupervised, semi-supervised, or reinforcement mode)
Deep Learning
Algorithms inspired by neural networks with
multiple layers of neurons that learn successively
complex representations
Convolutional Neural
Networks (CNN)
DL topology particularly
effective at image classification
7
Classical Machine Learning vs Deep Learning
Classic ML Deep Learning
Using functions or algorithms to extract Using massive data sets to train deep (neural)
insights from new data graphs that can extract insights from new data
Functions
𝑓1 , 𝑓2 , … , 𝑓𝐾 CNN,
Untrained
RNN, New
▪ Random Forest Trained
RBM, Data
▪ Naïve Bayes Inference or etc.
Training
▪ Decision Trees Classification
Data*
▪ Graph Analytics
▪ Regression
▪ Ensemble methods
Step 1: Training Step 2: Inference
▪ Support Vector
Machines (SVM) Hours to Days Real-Time
in Cloud at Edge/Cloud
▪ More…
Use massive “known” dataset Form inference about
(e.g. 10M tagged images) to new input data (e.g. a
iteratively adjust weighting of photo) using trained
New Data* neural network connections neural network
8
*Not all classical machine learning algorithms require separate training and new data sets
Training vs Inference
Step 1: Training Step 2: Inference
(In Data Center – Over Hours/Days/Weeks) (End point or Data Center - Instantaneous)
Create “Deep
Trained Trained neural
neural net”
Model network model
math model
97%
Output 90% person person Output
Classification 8% traffic light Classification
9
Common Data Sets
10
ImageNet Classification Competition
Recent winners all Deep Neural Nets!
▪ 2012 Winner: AlexNet (University of Toronto), top-5 error rate of 15.3%, 5 convolution layers
▪ 2014 Winner: GoogLeNet, top-5 error rate of 6.67%, 22 layers in total CPU’s & FPGAs process
thousands of frames of
▪ 2015 Winner: Microsoft (ResNet) with 3.6% , 152 layers images per second
▪ 2016 Winner: 3rd Research Institute of Ministry of Public Security, China with 2.991%
▪ 2017 Winner: WMW (Researchers from Momenta and Oxford) with 2.251%
11
Deep Learning Terminology
DL Network
▪ Layers and other hyperparameters
▪ e.g. AlexNet, GoogLeNet, VGG, ResNet, etc..
DL Frameworks
▪ High-level tools make it easier to build deep learning algorithms
▪ Provides definitions and tools to define, train, and deploy your network
▪ e.g. Caffe, TensorFlow™, MxNet, Caffe2, Theano, Torch, etc
DL Primitives Libraries
▪ Low-level accelerator specific libraries
Sky
Clouds
Tree
Grass
Perceptron
X0
X1 output
X2
14
Weights and Bias
Weights 0 𝑖𝑓 σ𝑖0 𝑋𝑖 ∗ 𝑊𝑖 + 𝐵 ≤ 0
1 𝑖𝑓 σ𝑖0 𝑋𝑖 ∗ 𝑊𝑖 + 𝐵 > 0
15
Example
1
Am I hungry? 1 X
1 -2
3
Is it a special
occasion? 1 X 3 4 + 2 output
6 Bias
Can I expense 0 X
0
the meal?
Weights
16
Sigmoid Neuron
X1 output
X2
1
𝑓𝑧 =
1 + 𝑒 −𝑧
17
Neural Networks
Input Hidden Output
Layer Layers Layer
Network of interconnected Neurons
Each neuron in the input layer relates to a
piece of input data
▪ E.g.: pixel for image classification
Each neuron in the output relates to a
piece of output data
▪ E.g.: confidence of a car for image
classification
Diagram shows a Fully Connected Network
18
AlexNet
Common benchmark used by the industry
Based upon Convolutional Neural Network – CNN
20
Convolution Layers
1 1
Inew 𝑥 𝑦 = Iold 𝑥 + 𝑥′ 𝑦 + 𝑦′ × F 𝑥′ 𝑦′
𝑥′=−1 𝑦′=−1
Filter
Input Feature Map
(3D Space)
(Set of 2D Images)
21
Additional Layers Used in AlexNet
Conv ReLu Norm MaxPool Fully Conn.
AlexNet Graph
22
Rectified Linear Unit (ReLU)
Activation Layer
Similar function to Sigmoid function
Applied to each neuron independently during the ReLU layer
𝑓(x) = max(0,x)
23
Normalization (Local Response Normalization)
Smoothing function applied through the depth of the image layers
▪ Reduces the relative peaks in neighbouring filters
▪ Padding used to preserve output layer depth
24
Pooling
25
26
FPGA Architecture DSP Block
Memory Block
27
FPGA Architecture: Basic Elements
Basic Element
1-bit register
1-bit configurable
(store result)
operation
28
FPGA Architecture: Flexible Interconnect
29
FPGA Architecture: Flexible Interconnect
30
FPGA Architecture: Custom Operations Using Basic Elements
16-bit add
32-bit sqrt
31
FPGA Architecture: Memory Blocks
32
FPGA Architecture: Memory Blocks
Few larger
Lots of smaller caches
Can be configured and grouped
caches
using the interconnect to create
various cache architectures
33
FPGA Architecture: Floating Point Multiplier/Adder Blocks
data_in data_out
34
DSP Blocks
Thousands Digital Signal Processing
(DSP) Blocks in Modern FPGAs
▪ Configurable to support multiple features
– Variable precision fixed-point multipliers
– Adders with accumulation register
– Internal coefficient register bank
– Rounding
– Pre-adder to form tap-delay line for filters
– Single precision floating point
multiplication, addition, accumulation
35
FPGA Architecture: Configurable Routing
36
FPGA Architecture: Configurable IO
37
FPGA I/Os and Interfaces
CPU instructions
R0 Load Mem[100]
High-level code R1 Load Mem[101]
Mem[100] += 42 * Mem[101] R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
40
First let’s take a look at execution on a simple CPU
LdAddr LdData StAddr
PC Fetch Load Store
StData
Instruction Op
Registers
Op
ALU
Aaddr
A
A C
Val Baddr
Caddr B
CWriteEnable CData
Op
- General “cover-all-cases” data-paths
Fixed and general
- Fixed data-widths
architecture: - Fixed operations 41
41
Looking at a Single Instruction
Instruction Op
Registers
Op
ALU
Aaddr
A
A C
Val Baddr
Caddr B
CWriteEnable CData
Op
A
load load 42
R
e
A A s
o
Time u
r
c
A A
e
s
store
43
Custom Data-Path on the FPGA Matches Your Algorithm!
High-level code
Mem[100] += 42 * Mem[101]
Custom data-path
Build exactly what you need:
load load 42
Operations
Data widths
Memory size & configuration
Efficiency:
store Throughput / Latency / Power
44
Advantages of Custom Hardware with FPGAs
▪ Custom hardware!
DSP
▪ Efficient processing Blocks
Transceiver
Channels
Intel® Stratix 10
Intel®
Xeon®
Processor
48
Why Intel® FPGAs for Machine Learning?
Convolutional Neural Networks are Compute Intensive Feature Benefit
Convolutional Neural Networks are Compute Intensive
Highly parallel Facilitates efficient low-batch video
architecture stream processing and reduces latency
Configurable
FP32 9Tflops, FP16, FP11
Distributed
Accelerates computation by tuning
Floating Point
compute performance
DSP Blocks
Tightly coupled >50TB/s on chip SRAM bandwidth,
high-bandwidth random access, reduces latency,
memory minimizes external memory access
Fine-grained & low latency
between compute and memory Programmable Reduces unnecessary data movement,
Data Path improving latency and efficiency
Optional
Optional Memory
Memory
49
Deterministic Latency Matters for Inference
Automotive example:
▪ Latency impacts response time and distance
▪ Factors that impact latency – batch size / IO latency
▪ Need to perform better than human
FPGAs leverages parallelism across the entire chip to reduce compute latency
FPGAs has flexible and customizable IOs with low & deterministic I/O latency
System Latency = I/O Latency + Compute Latency
Compute Latency
FPGA Xeon
– Compact network
I W O ···
I 3 2
3 X = ···
1 1 3
3 3 1 O
2 Shared Weights W
52
CNN Inference Implementation Requirements
High bandwidth local storage for filter data and partial sums
>58 TB/s internal memory bandwidth in Stratix 10
53
Summary
Deep Learning is a type of machine learning for extracting patterns from data
using neural networks
DL neural networks are built and trained using frameworks and combining
various layers
FPGAs are made up of a variety of building blocks that using FPGA development
tools will translate code into custom hardware
FPGAs provide a flexible, deterministic low-latency, high-throughput, and
energy-efficient solution for accelerating the constantly changing networks and
precisions for DL inference
54
Legal Disclaimers/Acknowledgements
55
56