Hardware Accleration For ML

Hardware acceleration of Machine Learning Algorithms
CONTENTS:
 Intro to Machine Learning and Deep Learning.
 Intro to neural networks.
 Difference between training and inference.
 Parallel Paradigms (Temporal and Spatial).
 Types of data reuse and Dataflow architectures .
 Hardware software co-design.
 MIT Eyeriss.
 Google TPU.
 Some FPGA Implementations in brief.
INTRODUCTION TO MACHINE LEARNING
MORE ON CONVOLUTIONAL NEURAL NETWORKS
A NEURON
A BASIC ARTIFICIAL NEURAL NETWORK

CNN IN-DEPTH
CONV OPERATION NON-LINEARITY MAX POOLING OPERATION

DIFFERENT STAGES OF A NEURAL NETWORK
The stage when an un-trained neural network is fed several inputs from a labelled dataset and the
TRAINING: output is matched with the right answer/label. The error is ‘Back-propagated’ into the network
in order to modify the weights in a way that they predict the correct labels better.
 INFERENCE:
This is the deployment phase

where only the forward
propagation takes place.
Once the network has learned

The right weights, it is now ready
for prediction on previously
Un-seen input data.
PARALLELIZATION OF CONV AND FC COMPUTATION
 TEMPORAL ARCHITECTURE  SPATIAL ARCHITECTURE
(ASICs and FPGAs)
(CPUs and GPUs)
• DATAFLOW PROCESSING: ALUs form a
• The ALUs are controlled by a central control
processing chain, they have their own local
logic and they cannot communicate with each memory and can communicate with the nearby
other directly. ALUs.
• Several methods like Vectors(SIMD) and Parallel
• ALU with Local Memory =Processing Engine (PE)
Threads (SIMT) are used to parallelize
computation. • Different memory hierarchies are created in
order to store the right data for re-use.
• CONV -> Matrix-Matrix Multiply
• FC -> Matrix-Vector Multiply Types of Re-Use:

1. Convolutional Reuse (Activations and Weights)
Eg: FFT, Winograd’s algorithm, Strassen’s 2. Fmap Reuse (Activations)
algorithm, Toeplitz matrices. 3. Filter Reuse (Filter weights)
TYPES OF DATAFLOW ARCHITECTURES
WEIGHT STATIONARY: The filter weights are

copied from the DRAM into the Register File of the PE.
and it stays there for further reuse.
Input Fmap activations are broadcast to all PEs and

Partial sums are accumulated.
OUPUT STATIONARY: Partial Sums for each activation

Are stored in the RF.
Global Buffer streams i/p activations and broadcasts the

weights to all the PEs
NO LOCAL REUSE: As small register files for each PE are In-Efficient in terms of area, they are excluded from
The design and all their area is allotted to a bigger global buffer from which all the data is fetched.
ROW STATIONARY: This is a hybrid of WS and OS wherein the processing of a 1-D row convolution is
Assigned to each PE.
A complete row of filter weights are placed in the RF and the I/P activations are streamed to the PE.
Also, the input activations that are overlapping between convolutions can be placed in the RF too.
EYERISS by MIT based on Row-Stationary Dataflow Architecture
Consists of 12X14 PE array
Convolutions of different sizes

Are mapped to these programming
Elements using techniques like
Folding and replication.
HARDWARE SOFTWARE CO-DESIGN
• There are several aspects of designing CNNs that can make it easy for the
hardware to process the data. Falls into the wider concept of Model
Compression of Deep Neural Networks.
• Data Quantization.
• Exploiting Sparsity.
Tensor Processing Unit : An ASIC by GOOGLE to accelerate the inference of Deep Neural Networks
SYSTOLIC Array TPU Architecture
IMPORTANT COMPUTATIONAL RESOURCES

1. Matrix multiply unit: 65536 multiply and
add units for matrix multiplications.
2. Unified Buffer: 24MB of SRAM that acts as
registers.
3. Activation Unit: Hardwired Activation
Functions.
SOME FPGA IMPLEMENTATIONS IN BRIEF
• FPGAs have the advantage over ASICs that the Time-To-Market can be made
considerably less. This is very important in the extremely volatile world of
Machine learning.
EXAMPLES:
 FPGA accelerator by BAIDU:
• Xilinx K7 480t
• Floating point Matrix Multiply
• Floating point Activation
Functions
• Customized FP MUL and ADD
• Buffers made of 512X512
BRAM
• Each ALU for a 32X32 Tile
ESE: EFﬁCIENT SPEECH RECOGNITION ENGINE WITH
SPARSE LSTM ON FPGA
• More application specific to LSTM networks.

• Focuses on quantization and very low bit widths
using table lookups which are very efficient in
FPGAs
• Floating datatypes.
• Pipelined architecture.
• Model Pruning.
• Load Balancing Scheduler to manage the
complex data dependency requirements of RNNs
and LSTM networks.
8-BIT DOT PRODUCT OPTIMIZATION FROM XILINX
In the work on 8-bit dot product acceleration,

Xilinx demonstrates a method of using cascaded DSPs to implement both
the multiplication and addition parts of dot products using dedicated
FPGA hardware.
The main FPGA feature being demonstrated here are both the possibility
of using the DSP as an 8-bit SIMD multiply accumulate as well as
demonstrating using the direct DSP to DSP cascaded interconnect.
Types of FPGA accelerator architectures
Streaming Architecture - One distinct hardware block for each layer of the CNN where
block is optimised separately to exploit the parallelism of this layer. All these blocks are
then chained to form a pipeline.
Examples: FPGACONVNET, DeepBurning, HADDOC2 (Caffe to VHDL)
Types of FPGA accelerator architectures
Single computation engine: consists of a single computation engine in the form of systolic arrays or a
matrix multiply unit that executes the CNN layers sequentially. the control of hardware and the scheduling
if operations is done by the software. There are fixed architectural templates which are scaled depending
On the CNN. The processor responds to certain customised microinstructions.
EXAMPLES: Project Snowflake – Converts Caffe models to custom Instructions.

NeuFlow – Has a hardware architecture and a compiler called LuaFlow.
DNNWeaver , Angle-Eye, Caffeine
PROJECT SNOWFLAKE
Xilinx Zynq XC7Z045 (only 1 compute cluster implemented)
1. custom instruction set. (13 instructions)

2. a complete compiler that translates high level models into these custom instructions.
Compiler tasks:
-> model parsing
- Thnets is a library that converts torch 7 models into C data structures.(in 5 steps)
-> instruction generation
-> deploying instructions in hardware
SNOWFLAKE MICROARCHITECTURE:
-> control core
a. instruction fetch (instruction cache and PC)
b. instruction decode(decode stage of a trafitional RISC pipeline)
c. instruction dispatch (reads source operands from a register file.)
d. ALU (typical one but does not do any processing, the compute core does the processing)
e. register file
-> compute core
Systolic Array Based Implementation.
• XILINX SDACCEL FOR CPU GPU FPGA INTEGRATION
OPENCL TOOL TO MAP NEURAL NETWORKS DIRECTLY TO HARDWARE
• Multi Layer Perceptron Completely in Hardware

(https://www.youtube.com/watch?v=FmTzJv18VGU)
• Deep Neural Network Hardware Acclerator

(https://projects.digilentinc.com/SmarTech/deep-neural-network-hardware-accelerator-bda5e8?)
NEURON MACHINE ARCHITECTURE
ACCLERATOR -TIMING SUMMARY
ACCLERATOR – UTILIZATION REPORT

REFERENCES:
• Efficient Processing of Deep Neural Networks: A Tutorial and Survey.

• Hardware Acceleration for Machine Learning.
• Exploration and Tradeoffs of different Kernels in FPGA Deep Learning Applications.

• Accelerating CNN inference on FPGAs: A Survey.
• [DL] A Survey of FPGA Based Neural Network Accelerator.
• 8-Bit Dot-Product Acceleration.

Thank You!

Hardware Accleration For ML

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hardware Accleration For ML

Uploaded by

Copyright:

Available Formats

Hardware acceleration of Machine Learning Algorithms

A BASIC ARTIFICIAL NEURAL NETWORK

CONV OPERATION NON-LINEARITY MAX POOLING OPERATION

This is the deployment phase

Once the network has learned

• FC -> Matrix-Vector Multiply Types of Re-Use:

WEIGHT STATIONARY: The filter weights are

Input Fmap activations are broadcast to all PEs and

OUPUT STATIONARY: Partial Sums for each activation

Global Buffer streams i/p activations and broadcasts the

Consists of 12X14 PE array

Convolutions of different sizes

IMPORTANT COMPUTATIONAL RESOURCES

• More application specific to LSTM networks.

In the work on 8-bit dot product acceleration,

EXAMPLES: Project Snowflake – Converts Caffe models to custom Instructions.

1. custom instruction set. (13 instructions)

• Multi Layer Perceptron Completely in Hardware

• Deep Neural Network Hardware Acclerator

ACCLERATOR – UTILIZATION REPORT

• Efficient Processing of Deep Neural Networks: A Tutorial and Survey.

• Exploration and Tradeoffs of different Kernels in FPGA Deep Learning Applications.

• 8-Bit Dot-Product Acceleration.

You might also like