You are on page 1of 26

Hardware acceleration of Machine Learning Algorithms

CONTENTS:
 Intro to Machine Learning and Deep Learning.
 Intro to neural networks.
 Difference between training and inference.
 Parallel Paradigms (Temporal and Spatial).
 Types of data reuse and Dataflow architectures .
 Hardware software co-design.
 MIT Eyeriss.
 Google TPU.
 Some FPGA Implementations in brief.
INTRODUCTION TO MACHINE LEARNING
MORE ON CONVOLUTIONAL NEURAL NETWORKS

A NEURON

A BASIC ARTIFICIAL NEURAL NETWORK


CNN IN-DEPTH

CONV OPERATION NON-LINEARITY MAX POOLING OPERATION


DIFFERENT STAGES OF A NEURAL NETWORK
The stage when an un-trained neural network is fed several inputs from a labelled dataset and the
TRAINING: output is matched with the right answer/label. The error is ‘Back-propagated’ into the network
in order to modify the weights in a way that they predict the correct labels better.

 INFERENCE:

This is the deployment phase


where only the forward
propagation takes place.

Once the network has learned


The right weights, it is now ready
for prediction on previously
Un-seen input data.
PARALLELIZATION OF CONV AND FC COMPUTATION
 TEMPORAL ARCHITECTURE  SPATIAL ARCHITECTURE
(ASICs and FPGAs)
(CPUs and GPUs)
• DATAFLOW PROCESSING: ALUs form a
• The ALUs are controlled by a central control
processing chain, they have their own local
logic and they cannot communicate with each memory and can communicate with the nearby
other directly. ALUs.
• Several methods like Vectors(SIMD) and Parallel
• ALU with Local Memory =Processing Engine (PE)
Threads (SIMT) are used to parallelize
computation. • Different memory hierarchies are created in
order to store the right data for re-use.
• CONV -> Matrix-Matrix Multiply

• FC -> Matrix-Vector Multiply Types of Re-Use:


1. Convolutional Reuse (Activations and Weights)
Eg: FFT, Winograd’s algorithm, Strassen’s 2. Fmap Reuse (Activations)
algorithm, Toeplitz matrices. 3. Filter Reuse (Filter weights)
TYPES OF DATAFLOW ARCHITECTURES

WEIGHT STATIONARY: The filter weights are


copied from the DRAM into the Register File of the PE.
and it stays there for further reuse.

Input Fmap activations are broadcast to all PEs and


Partial sums are accumulated.

OUPUT STATIONARY: Partial Sums for each activation


Are stored in the RF.

Global Buffer streams i/p activations and broadcasts the


weights to all the PEs
NO LOCAL REUSE: As small register files for each PE are In-Efficient in terms of area, they are excluded from
The design and all their area is allotted to a bigger global buffer from which all the data is fetched.

ROW STATIONARY: This is a hybrid of WS and OS wherein the processing of a 1-D row convolution is
Assigned to each PE.

A complete row of filter weights are placed in the RF and the I/P activations are streamed to the PE.
Also, the input activations that are overlapping between convolutions can be placed in the RF too.
EYERISS by MIT based on Row-Stationary Dataflow Architecture

Consists of 12X14 PE array

Convolutions of different sizes


Are mapped to these programming
Elements using techniques like
Folding and replication.
HARDWARE SOFTWARE CO-DESIGN
• There are several aspects of designing CNNs that can make it easy for the
hardware to process the data. Falls into the wider concept of Model
Compression of Deep Neural Networks.
• Data Quantization.
• Exploiting Sparsity.
Tensor Processing Unit : An ASIC by GOOGLE to accelerate the inference of Deep Neural Networks
SYSTOLIC Array TPU Architecture

IMPORTANT COMPUTATIONAL RESOURCES


1. Matrix multiply unit: 65536 multiply and
add units for matrix multiplications.
2. Unified Buffer: 24MB of SRAM that acts as
registers.
3. Activation Unit: Hardwired Activation
Functions.
SOME FPGA IMPLEMENTATIONS IN BRIEF
• FPGAs have the advantage over ASICs that the Time-To-Market can be made
considerably less. This is very important in the extremely volatile world of
Machine learning.
EXAMPLES:
 FPGA accelerator by BAIDU:

• Xilinx K7 480t
• Floating point Matrix Multiply
• Floating point Activation
Functions
• Customized FP MUL and ADD
• Buffers made of 512X512
BRAM
• Each ALU for a 32X32 Tile
ESE: EFfiCIENT SPEECH RECOGNITION ENGINE WITH
SPARSE LSTM ON FPGA

• More application specific to LSTM networks.


• Focuses on quantization and very low bit widths
using table lookups which are very efficient in
FPGAs
• Floating datatypes.
• Pipelined architecture.
• Model Pruning.
• Load Balancing Scheduler to manage the
complex data dependency requirements of RNNs
and LSTM networks.
8-BIT DOT PRODUCT OPTIMIZATION FROM XILINX

In the work on 8-bit dot product acceleration,


Xilinx demonstrates a method of using cascaded DSPs to implement both
the multiplication and addition parts of dot products using dedicated
FPGA hardware.

The main FPGA feature being demonstrated here are both the possibility
of using the DSP as an 8-bit SIMD multiply accumulate as well as
demonstrating using the direct DSP to DSP cascaded interconnect.
Types of FPGA accelerator architectures

Streaming Architecture - One distinct hardware block for each layer of the CNN where
block is optimised separately to exploit the parallelism of this layer. All these blocks are
then chained to form a pipeline.
Examples: FPGACONVNET, DeepBurning, HADDOC2 (Caffe to VHDL)
Types of FPGA accelerator architectures
Single computation engine: consists of a single computation engine in the form of systolic arrays or a
matrix multiply unit that executes the CNN layers sequentially. the control of hardware and the scheduling
if operations is done by the software. There are fixed architectural templates which are scaled depending
On the CNN. The processor responds to certain customised microinstructions.

EXAMPLES: Project Snowflake – Converts Caffe models to custom Instructions.


NeuFlow – Has a hardware architecture and a compiler called LuaFlow.
DNNWeaver , Angle-Eye, Caffeine
PROJECT SNOWFLAKE
Xilinx Zynq XC7Z045 (only 1 compute cluster implemented)

1. custom instruction set. (13 instructions)


2. a complete compiler that translates high level models into these custom instructions.
Compiler tasks:
-> model parsing
- Thnets is a library that converts torch 7 models into C data structures.(in 5 steps)
-> instruction generation
-> deploying instructions in hardware

SNOWFLAKE MICROARCHITECTURE:
-> control core
a. instruction fetch (instruction cache and PC)
b. instruction decode(decode stage of a trafitional RISC pipeline)
c. instruction dispatch (reads source operands from a register file.)
d. ALU (typical one but does not do any processing, the compute core does the processing)
e. register file
-> compute core
Systolic Array Based Implementation.
• XILINX SDACCEL FOR CPU GPU FPGA INTEGRATION
OPENCL TOOL TO MAP NEURAL NETWORKS DIRECTLY TO HARDWARE

• Multi Layer Perceptron Completely in Hardware


(https://www.youtube.com/watch?v=FmTzJv18VGU)

• Deep Neural Network Hardware Acclerator


(https://projects.digilentinc.com/SmarTech/deep-neural-network-hardware-accelerator-bda5e8?)
NEURON MACHINE ARCHITECTURE
ACCLERATOR -TIMING SUMMARY

ACCLERATOR – UTILIZATION REPORT


REFERENCES:

• Efficient Processing of Deep Neural Networks: A Tutorial and Survey.


• Hardware Acceleration for Machine Learning.

• Exploration and Tradeoffs of different Kernels in FPGA Deep Learning Applications.


• Accelerating CNN inference on FPGAs: A Survey.
• [DL] A Survey of FPGA Based Neural Network Accelerator.

• 8-Bit Dot-Product Acceleration.


Thank You!

You might also like