Professional Documents
Culture Documents
CONTENTS:
Intro to Machine Learning and Deep Learning.
Intro to neural networks.
Difference between training and inference.
Parallel Paradigms (Temporal and Spatial).
Types of data reuse and Dataflow architectures .
Hardware software co-design.
MIT Eyeriss.
Google TPU.
Some FPGA Implementations in brief.
INTRODUCTION TO MACHINE LEARNING
MORE ON CONVOLUTIONAL NEURAL NETWORKS
A NEURON
INFERENCE:
ROW STATIONARY: This is a hybrid of WS and OS wherein the processing of a 1-D row convolution is
Assigned to each PE.
A complete row of filter weights are placed in the RF and the I/P activations are streamed to the PE.
Also, the input activations that are overlapping between convolutions can be placed in the RF too.
EYERISS by MIT based on Row-Stationary Dataflow Architecture
• Xilinx K7 480t
• Floating point Matrix Multiply
• Floating point Activation
Functions
• Customized FP MUL and ADD
• Buffers made of 512X512
BRAM
• Each ALU for a 32X32 Tile
ESE: EFfiCIENT SPEECH RECOGNITION ENGINE WITH
SPARSE LSTM ON FPGA
The main FPGA feature being demonstrated here are both the possibility
of using the DSP as an 8-bit SIMD multiply accumulate as well as
demonstrating using the direct DSP to DSP cascaded interconnect.
Types of FPGA accelerator architectures
Streaming Architecture - One distinct hardware block for each layer of the CNN where
block is optimised separately to exploit the parallelism of this layer. All these blocks are
then chained to form a pipeline.
Examples: FPGACONVNET, DeepBurning, HADDOC2 (Caffe to VHDL)
Types of FPGA accelerator architectures
Single computation engine: consists of a single computation engine in the form of systolic arrays or a
matrix multiply unit that executes the CNN layers sequentially. the control of hardware and the scheduling
if operations is done by the software. There are fixed architectural templates which are scaled depending
On the CNN. The processor responds to certain customised microinstructions.
SNOWFLAKE MICROARCHITECTURE:
-> control core
a. instruction fetch (instruction cache and PC)
b. instruction decode(decode stage of a trafitional RISC pipeline)
c. instruction dispatch (reads source operands from a register file.)
d. ALU (typical one but does not do any processing, the compute core does the processing)
e. register file
-> compute core
Systolic Array Based Implementation.
• XILINX SDACCEL FOR CPU GPU FPGA INTEGRATION
OPENCL TOOL TO MAP NEURAL NETWORKS DIRECTLY TO HARDWARE