Taman

:
.
TABLE OF CONTENTS
Undertaking.......................................................................................................................iii
Acknowledgements............................................................................................................iv
Table of Contents................................................................................................................v
List of figures....................................................................................................................vii
List of tables.....................................................................................................................viii
Chapter 1.............................................................................................................................1
1.1 Overview......................................................................................................................1
1.1.1 Convolutional Neural Network...............................................................................3
1.1.2 Neuron....................................................................................................................4
1.1.3 Bias.........................................................................................................................5
1.1.4 Layers.....................................................................................................................6
1.1.5 Fully Connected and Convolutional Layers...........................................................6
1.1.6 Weights...................................................................................................................8
1.1.7 Activation Function................................................................................................9
1.1.8 Hyper-Parameters.................................................................................................10
1.1.8.1 Kernel size.....................................................................................................11
1.1.8.2 Stride..............................................................................................................11
1.1.8.3 Padding..........................................................................................................12
1.1.9 Pooling Layers....................................................................................................12
1.1.10 Softmax Layer....................................................................................................13
1.2 Objectives.................................................................................................................14
1.3 Challenges in Implementation..................................................................................14
1.4 Organization of Report.............................................................................................15
Chapter 2...........................................................................................................................16
2.1 Introduction..............................................................................................................16
2.2 Literature Survey......................................................................................................16
Chapter 3...........................................................................................................................20
3.1 Proposed Methodology.............................................................................................20
3.2 Network Selection....................................................................................................21
3.3 Dataset......................................................................................................................24
3.4 Software Modeling...................................................................................................25
3.4.1 MATLAB...........................................................................................................26
a) Transfer Learning..................................................................................................26
b) Fixed Point Calculations.......................................................................................27
c) MATLAB Fixed Point Designer Tool..................................................................28
3.4.2 Vivado based Testing...........................................................................................29
3.4.2.1 Convolution Block...........................................................................................30
3.4.2.2 Max Pooling Block..........................................................................................33
3.4.2.3 Bias Addition and ReLU Activation Function................................................34
Chapter 4...........................................................................................................................35
4.1 Introduction..............................................................................................................35
4.2 MATLAB based Results..........................................................................................35
4.3 Vivado Based Results...............................................................................................38
Conclusion.......................................................................................................................40
References.......................................................................................................................41
LIST OF FIGURES
Figure 1.1 Neuron...............................................................................................................1

Figure 1.2 Deep Neural Network........................................................................................2
Figure 1.3: CNN..................................................................................................................4
Figure 1.4: Operation of Single Neuron..............................................................................5
Figure 1.5: Convolution Layer............................................................................................7
Figure 1.6: Convolution Window........................................................................................8
Figure 1.7: Working of a single neuron..............................................................................9
Figure 1.8: Comparison of ReLU and Sigmoid................................................................10
Figure 1.9: Example of Stride =1......................................................................................11
Figure 1.10: Padding.........................................................................................................12
Figure 1.11: Max Pooling.................................................................................................13
Figure 1.12: Softmax Layer...............................................................................................14
Figure 3.1: Block Diagram................................................................................................21
Figure 3.2: VGG-16 Flattened Model...............................................................................22
Figure 3.3: VGG-16 Model .............................................................................................. 23
Figure 3.4: Network Architecture for Digit Recognition ................................................. 24
Figure 3.5: Example of Dataset .......................................................................................25
Figure 3.6: MATLAB Modelling........................................................................................26
Figure 3.7: Fixed Point Represesntation...........................................................................27
Figure 3.8: Fixed Point Model..........................................................................................29
Figure 3.9: Neural Network Implementation Block Diagram on FPGA...........................33
Figure 3.10: Schematic of Convolution.............................................................................32
Figure 3.11: Schematic of Max Pooling Operation.......................................................... 34
Figure 4.1: Software Modelling Results ...........................................................................35
Figure 4.2: Comfusion Matrix ..........................................................................................36
Figure 4.3: Cross Validity Graph .................................................................................... 37
Figure 4.4: Cpnvolution Operation...................................................................................39
Figure 4.5: Max Pooling .................................................................................................. 39
LIST OF TABLES
Table 1 Resources Utilization Table..................................................................................38

CHAPTER 1
Introduction
1.1 Overview
Inspired by the way the brain processes information, scientists and engineers have been
researching neural networks (NNs) since the early 1940s. NNs are an information
processing paradigm inspired by the way biological nervous systems, such as the brain,
process information. The key element of this paradigm is the novel structure of the
information processing system. It is composed of many highly interconnected processing
elements, neurons, working in parallel to solve a specific problem. The neuron is made
up of four main parts: dendrites, synapses, axon and the cell body. A neuron is essentially
a system that accepts electrical currents which arrive on its dendrites. It sums these and if
they exceed a certain threshold it issues a new pulse which propagates along an axon. The
information is transmitted from an axon to a dendrite via a synapse, which is done by
means of chemical neurotransmitters across the synaptic membrane.
Figure 1.1: Neuron[1]
In recent studies, neural network-based classifiers have been widely used in number of
classification problems which related to speech and for pattern recognition. To implement
neural networks for real time application, huge computational resources are vital for this
1
determination. The viability of Neural Networks in profitable application decreases due
to a limited size of neural network that can be performed using FPGA chip. However,
embedding a neural network in FPGA becomes a difficult task when number of
constraints of NN classifier increases.
Deep neural networks are used in many applications including image, audio and video
processing and analysis in many domains, where they have shown to outperform the
conventional machine learning methods and human experts. There is a compelling need
of deep neural networks in mobile devices and embedded systems.
Figure1.2: Deep Neural Network[2]
However, deep neural networks have high computational complexity and therefore, most
of the modern CPUs are not able to achieve the speed requirements of real time
embedded applications, such as video processing in autonomous cars or biomedical
devices, which demands high accuracy and real time object recognition. GPUs can be
used to speed up the computations but their energy efficiency is low as compared to
ASICs and FPGAs FPGA offer speed comparable to dedicated and fixed hardware
systems for parallel algorithm acceleration, while as with a software implementation,
retaining a high degree of flexibility for device reconfiguration as the application
demands.
This project aims to build hardware architecture of Deep Neural Network on FPGA
which is scalable and re-configurable to quickly adapt to the different configurations of
2
Neural Network Architectures. Out of the different DNNs such as recurrent neural
networks, feed forward neural network we have decided to implement deep convolutional
neural network due to its vast applications in Computer vision, image classification, and
image recognition. Our application is Handwritten Digit Recognition. In this project, we
present a hardware application of multilayer convolutional neural networks (CNN) using
re-configurable field-programmable gate arrays (FPGAs) chips.
1.1.1 Convolutional Neural Network
A convolutional neural network (CNN) contains one or more convolutional layers,

pooling or fully connected, and uses a variation of multilayer perceptron’s. Multilayer
perceptron’s usually mean fully connected networks, that is, each neuron in one layer is
connected to all neurons in the next layer. Convolutional layers use a convolution
operation to the input passing the result to the next layer. This operation allows the
network to be deeper with much fewer parameters.
CNNs are regularized versions of multilayer perceptron’s. The "fully-connectedness" of

these networks makes them prone to overfitting data. Typical ways of regularization
include adding some form of magnitude measurement of weights to the loss function.
However, CNNs take a different approach towards regularization: they take advantage of
the hierarchical pattern in data and assemble more complex patterns using smaller and
simpler patterns. Therefore, on the scale of connectedness and complexity, CNNs are on
the lower extreme.
A typical CNN is shown below:
3
Fi
gu
re
1.
3:
CNN[3]
There are many types of CNN Architectures. We have selected VGG-16 Model for
implementing our application.
1.1.2 Neuron
Many of the terminology used for neural networks is borrowed from the field of
neuroscience which is not a surprise at all as the idea behind neural networks is to try to
solve complex problems by mimicking the human brain structure. The basic unit of the
neural network. It gets the definite number of input data and the aberration value. Signal
is arriving, multiplied by the mass value. If the neuron has n inputs, it has n adjustable
mass values during training. Obviously, even the most straightforward cerebrum of the
modest natural product fly still presents an authoritative intricacy a long way from being
repeated by any fake neural organization created by people these days.
¿ ¿
z=x 1∗w1 + x 2∗w 2 … … … .+ xn wn+ b 1 …………….(1)
^y =aout =sigmoid ( z) …………….(2)
1
sigmoid ( z ) = …………….(3)
1+ e−z
4
Figure 1.4: Operation of Single Neuron[4]
Living cerebrum cells present complex practices which are not yet completely saw so the
idea of neuron with regards to fake neural organizations is only an estimate of the
usefulness of a genuine organic neuron. Thus, in this unique situation, a neuron is only a
computational hub with at least one mathematical information inputs and a solitary yield.
This yield is communicated, in this way permitting neurons to be interconnected with
other neurons in the style of super-structures known as layers. Every neuron plays out the
straight mix of its weighted data sources, that is, it aggregates all the data sources,
everyone recently duplicated by its corresponding constant.
1.1.3 Bias
Another input for neurons and which always has a value of one (1) and has its own
weight of link. This confirms that even if all inputs are not present in network I mean (all
0), even that the neuron will be activated. A bias node is additional to the neural networks
to facilitate book learning of these patterns. The bias node acts like an input node that
always produces a constant value of 1 or another constant number. Due to this specific
feature, that is not associated with the previous layer. Here constant 1 is called deviation
activation. Bias neurons permit you to move the output of the activation function. This
will be existing later in the context of activation functions.
5
1.1.4 Layers
In Neural Network, neurons are assembled in greater structures called layers. Think of a
layer as a container of neurons. A layer groups a number of neurons together. It is used
for holding a collection of neurons.Without layers, it becomes difficult to extract specific
features according to our application. By including layers which contain neurons
performing nonlinear functions, their capacities for critical thinking will greatly
improved. With the correct sequence, neurons assembled in the same layer extract
specific feature from the given data set. For example, in a neural network performing
picture classification, each layer is extracting specific features from the previous layer,
such as edges, shades or colors. The degree to which it correctly classifies specific feature
is defined while learning.
The Neural Network is developed from 3 kind of layers:
 Input layer — introductory information for the neural organization.
 Hidden layers — moderate layer among info and yield layer and spot where all
the calculation is finished.
 Output layer — produce the outcome for given data sources.
1.1.5 Fully Connected and Convolutional Layers

Fully Connected layers in a neural network are those layers where all the inputs from one
layer are connected to every activation unit of the next layer. In most popular machine
learning models, the last few layers are full connected layers which compiles the data
extracted by previous layers to form the final output. It is the second most time-
consuming layer second to Convolution Layer. It takes the output of the previous layers,
“flattens” them and turns them into a single vector that can be an input for the next stage.
Convolutional layers are the major building blocks used in convolutional neural
networks. A convolution is the simple application of a filter to an input that results in an
activation. Repeated application of the same filter to an input results in a map of
activations called a feature map, indicating the locations and strength of a detected
feature in an input, such as an image.
6
The innovation of convolutional neural networks is the ability to automatically learn a
large number of filters in parallel specific to a training dataset under the constraints of a
specific predictive modeling problem, such as image classification. The result is highly
specific features that can be detected anywhere on input images.
Figure 1.5: Convolution Layer[5]
As every network has related its own weight, the quantity of loads required by a
multilayer perceptron would develop gigantically. A way to manage the network is by
utilizing convolutional layers, where every neuron is associated with a predetermined
number of neurons in the past layer. On a fundamental level, a convolutional neural
network ought to have similar learning capacities than a completely associated one. The
thing that matters is that fully connected layers play out a worldwide activity, as they can
present any sort of reliance got from the information, and convolutional layers perform a
nearby activity as every neuron is taking a little bit of the information in the past layer
and that is the reason they perform so well in picture investigation applications. That little
part of information that is being dissected is otherwise called the nearby responsive field
or convolution window
7
Figure 1.6: convolution window[6]
And the arrangement of loads used to figure the weighted aggregate is known as portion
or channel. One can consider it as applying a channel to a picture by sliding it up and
down the pixels. Each pixel of the yield picture is a direct blend of the qualities contained
in its comparing nearby responsive field, which is shaped by the current information pixel
and its neighboring pixels.Ordinarily, in convolutional neural networks, a fully connected
layer is set at the last phase of the neural network as a classifier, to isolate the information
into the different classifications. Since every one of its neurons has associations to all the
components in the past layer, they can remove any sort of significant conditions from the
info information. The fully connected layer is the cause for the elevated level thinking in
the neural network.
1.1.6 Weights
Weights (Parameters) — a weight addresses the nature of the relationship between units.
If the weight from node 1 to node 2 has more vital enormity, it infers that neuron 1 has
more unmistakable effect over neuron 2. A weight chops down the noteworthiness of the
data regard. Weights near zero strategies changing this data will not change the yield.
Negative weights mean growing this data will lessen the yield. A weight picks how much
effect the information will have on the yield.
8
Figure 1.7:
Working of a single neuron
As an input enters the node, it gets multiplied by a weight value and the resulting output
is either observed or passed to the next layer in the neural network. Often the weights of a
neural network are contained within the hidden layers of the network. The neural network
contains a movement of hidden layers which apply changes to the data. It is inside the
center points of the hidden layers that the weights are applied. For example, a lone center
may take the data and copy it by an allotted weight regard, by then incorporate an
inclination before passing the data to the accompanying layer. The last layer of the neural
network is in any case called the output layer. The output layer every now and again
tunes the commitments from the covered layers to produces the ideal numbers in a
foreordained range.
1.1.7 Activation Function

In a neural network, the activation work is answerable for changing the added weighted
contribution from the node into the activation of the node or yield for that input.
Activation functions used to inset nonlinear into neural networks. Its fondness values to a
lesser extent, which means that the sigmoid activation function crushes values from 0 to
1. There are several working functions used in deep learning. Activations or transfers of
neural computation functions decide the limits of nerve output values. Neural networks
can use many altered operational functions. Choosing the start function is a vital factor
because it can disturb how input data is configured.
We used ReLU as activation function in our network the rectified linear activation
function or ReLU for short is a piecewise linear function that will output the input
directly if it is positive, otherwise, it will output zero. It has become the default activation
9
function for many types of neural networks because a model that uses it is easier to train
and often achieves better performance. For a given node, the inputs are multiplied by the
weights in a node and added together. This worth is alluded to as the added activation of
the node. The added initiation is then changed by means of an activation work and
characterizes the yield or "activation" of the hub.
The easiest initiation work is alluded to as the Linear Activation where no change is
applied by any stretch of the imagination. A network contained linear activation
capacities is anything but difficult to prepare yet can't learn complex planning capacities.
Linear activation capacities are as yet utilized in the yield layer for networks that foresee
an amount (for example relapse issues).
Nonlinear activation is favored as it permits the hubs to learn more intricate structures in
the information. Customarily, two broadly utilized nonlinear enactment capacities are the
sigmoid and hyperbolic tangent activation functions.
Figure 1.8: Comparison of ReLU & Sigmoid[7]
1.1.8 Hyper-Parameters
In machine learning, a hyperparameter is a parameter whose value is used to control the
learning process. By contrast, the values of other parameters (typically node weights) are
derived via training. Hyperparameters can be classified as model hyperparameters, that
cannot be inferred while fitting the machine to the training set because they refer to
the model selection task, or algorithm hyperparameters, that in principle have no
influence on the performance of the model but affect the speed and quality of the learning
10
process. An example of a model hyperparameter is the topology and size of a neural
network. The fundamental hyper-parameters of a CNN are the size of the receptive field,
the kernel (filter) size, the padding, the stride length, and the dimensions of the
Activation volumes. A portion of these hyper-parameter’s boundaries have been as of
now referenced in the past areas, and the rest are listed following. There are other hyper-
parameters, that are not recorded here, that direct the conduct of the preparation
calculation and how it takes in the boundaries from the information.
1.8.1.1 Kernel size
A kernel is a small matrix used to apply effects like the ones which you might like. In this
case it generally refers to Convolution. The size of the kernels that is used in the for
convolution is diverse. Typical sizes are 1x1, 3x3, 5x5 and 7x7. The rule for using a
specific kernel size relies upon the relative size of the feature that one needs to capture:
the smaller the size of the feature to be extracted, the smaller the filter.A common choice
is to keep the kernel size at 3x3 or 5x5. The first convolutional layer is often kept larger.
1.8.1.2 Stride
Stride means "the logical memory address distance between two successive pixels of the
image on a given axis". The term stride is used to refer to the length of this displacement.
It is also possible to use a shift bigger than one pixel, or non-unity strides, to reduce the
dimension of the activation volumes and the computational effort. For instance, a stride
of 2 will produce an output with half the dimensions of the original.
Figure 1.9: Example of Stride =1[8]
11
1.8.1.3 Padding
Padding is a term relevant to convolutional neural networks as it refers to the amount of

pixels added to an image when it is being processed by the kernel of a CNN. For
example, if the padding in a CNN is set to zero, then every pixel value that is added will
be of value zero. If, however, the zero padding is set to one, there will be a one pixel
border added to the image with a pixel value of zero.
This impact can be seen more plainly by thinking about the accompanying model, where
a 5x5 network is convolved with a 3x3 bit. Without padding, the result is a 3x3 network,
and endeavoring to convolve that framework with another 3x3 portion will bring about a
1x1 framework. Notwithstanding, if the first 5x5 network is padded with zeros all around
the outskirts, the outcome would be another 5x5 grid (measurements keep a similar size),
and by padding this network once more, one can proceed the same number of 3x3
convolutions as wished.
Figure 1.10: Padding[9]
1.1.9 Pooling Layer
A pooling layer is another building block of a CNN. Its function is to progressively

reduce the spatial size of the representation to reduce the amount of parameters and
computation in the network. Pooling layer operates on each feature map independently.
Pooling layers provide an approach to down sampling feature maps by summarizing the
presence of features in patches of the feature map. Two common pooling methods are
average pooling and max pooling that summarize the average presence of a feature and
12
the most activated presence of a feature, respectively. Maximum pooling, or max pooling,
is a pooling operation that calculates the maximum, or largest, value in each patch of
each feature map.
Figure 1.11: Max Pooling[10]
1.1.10 SoftMax Layer
SoftMax layer is typically the final output layer in a neural network that performs multi-
class classification (for example: object recognition). The name comes from
the SoftMax function that takes input as a number of scores values, and squashes them
into values in the range between 0 and 1. Therefore, they represent a true probability
distribution. SoftMax assigns decimal probabilities to each class in a multi-class problem.
Those decimal probabilities must add up to 1.0. This additional constraint helps training
converge more quickly than it otherwise would. The SoftMax layer must have the same
number of nodes as the output layer.
13
Figure 1.12: Softmax Layer[11]
1.2 Objectives
After studying a previous work, we develop the following objective to be achieved in this
project. These objectives lead to build hardware architecture of deep neural network on
FPGA. We have chosen a VGG-16 deep neural network architecture to implement it and
our application is handwritten digits recognition.
1. To reduce computational resources during CNN implementation on FPGA to

reduce cost.
2. We will put maximum efforts in designing efficient neuron for proper functioning
of Neural Network. Based on structure, a neuron can be divided into numerous
computational blocks. The integration of these separately implemented blocks
leads to complete design of neuron.
3. To increase the efficiency using an appropriate Network according to our

resources and application. (VGG-16)
1.3 Challenges in Implementation
It is highly difficult to implement neural networks with limited constrains of

computational resources because in general, the inputs of the neural network are
normalized between –1 and +1. Due to this reason, the network involves signed floating
point computations. Moreover, the involvement of nonlinear excitation function and
pooling layers requires huge hardware resource.
14
The other main challenges in realization of the CNN using FPGA are
1. Parallel or Sequential implementation
2. Word length/Bit precision
It is highly important to make decision between use of Parallel computations or

sequential implementation. Parallel computation demands larger resource therefore to
lessen computational cost, sequential implementation is preferred with compromise on
speed of computation. In view of above issues in implementation of neural network of
FPGAs, it is highly challenging to design and implement a CNN with reduced
computational resources.
1.4 Organization of Report
The rest of thesis is organized as follows.
In the very next chapter background related work has been discussed. Journal,
conference, and thesis and that have been published are discussed and important key
points are mentioned. Combined more than 15 papers have been studied and data of
important papers are discussed here only.
Then machine learning algorithms and its implementation on FPGA has been discussed
in chapter 3. The first step in our project is to select dataset of our application, we
selected the handwritten digit recognition, by MNIST and weights are acquired. The
second step of the project is to decide architecture of deep neural network for
implementation according to our available resources. In the third step classifier will be
trained and implemented on FPGA.
At last analysis of results have been discussed in detail. Features are plotted in MATLAB
and graphically analyze. Then Classifier accuracy and comparison has been done,
conclusion of overall work is given and the future work is discussed.
15
CHAPTER 2
Literature Review
2.1 Introduction
This project aims to build hardware architecture of deep neural network on FPGA which
is scalable and reconfigurable to quickly adapt to the different configurations of neural
network architectures. This will allow the use of neural networks in Real life embedded
applications. Our application is the handwritten digit recognition using neural network
implementation on FPGA. We have chosen a VGG-16 deep neural network architecture
for implementation. For this purpose, a comprehensive study is made to investigate the
previous work in this area. The later part of this chapter discussed some of the most
useful, efficient, and comprehensive research studies made to implement neural network
on digital hardware (FPGA).
2.2 Literature Survey
An FPGA accelerator for Deep CNNs using the roofline model was proposed by Chen
Zhang, Peng Li and Guangyu Sun of University of California in 2015[12]. For any
solution of a CNN design, they quantitatively analyze its computing throughput and
required memory bandwidth using various optimization techniques. Then, with the help
of roofline model, they identified the solution with best performance and lowest FPGA
resource requirement. As a case study, they implemented a CNN accelerator on a VC707
FPGA board and compare it to previous approaches. Their implementation achieved a
peak performance of 61.62 GFLOPS(floating-point operations per second) under
100MHz working frequency, which outperformed previous approaches significantly.
16
An optimized FPGA-based accelerator design targeting at ImageNet classification was
proposed by Yonghemi Zou and Jingfei Jiang of National University of Defense China in
2015 [13] which outperformed all previous works. Despite its stunning performance,
design did not explore the parameter space of fixed-point precision, though using fixed
point precision is considerably promising as design had pointed out. They also inferred
that the Xilinx HLS tool design used was highly productive in implementing a deep
convolutional neural network. They designed and implemented a 5-layer accelerator for
MNIST digit database using HLS tool in Vivado 2014.4 system suite. They compared
performance on their FPGA platform with the performance of the target CNN. In terms
of the running time of processing one input feature map, their work was 16.42 times
faster than the MATLAB/CPU code.
The dynamic-precision data quantization method and a convolve design efficient for all
layer types in CNN was proposed by Jiantao Qiu, Jie Wang and Song Yao of Tsinghua
University [14] to improve the bandwidth and resource utilization. Results show that only
0.4% accuracy loss was introduced by their data quantization flow for the very deep
VGG16 model when 8/4-bit quantization was used. The system on Xilinx Zynq ZC706
board achieved a frame rate at 4.45 fps with the top-5 accuracy of 86.66% using 16-bit
quantization, which outperformed previous approaches significantly.
S. Coric et al[15], proposed a design of general-purpose neuron using back propagation

algorithm. Different functions including sigmoid activation function are also
implemented. This neuron was then used in the NN. Hardware Implementation was done
on Xilinx FPGA.
Yufeng Hao et al[16], worked on implementation of a general architecture of Neural

Networks on FPGA. He implemented general architecture on XILINX ZU9CG System
on Chip (SOC) platform. It is highly efficient and is adaptive to application. It has a
higher programming ability due to the presence of Dual-core ARM Cortex A53. A lot of
General architectures of DNN are Implemented but none of them are subjected to specific
application so no mention of accuracy was mentioned.
Jin Hee Kim et al[17], worked on FPGA Based Inference Accelerator from Multi-
Threaded C software. Software Implementation uses producer/consumer model with
17
parallel threads interconnected by FIFO queues. HLS tool synthesis threads into parallel
FPGA Hardware. Complete system includes different layers which implement the
convolution, pooling, and padding functions. This whole system is implemented on mid-
sized Intel Arria 10 SOC FPGA. The architecture selected was VGG-16.
Ke Xu et al[18], proposed FPGA based accelerator for VGG-16. Network was

implemented in Python and C. Dynamic fixed-point strategy was used for a range of
weights. For data transfer between FPGA and HPS(Hard Processor Systen), SDRAM is
used as a bridge.
Roman A. Solovyev et al[19], proposed a design and implementation of FPGA-based

CNN with fixed-point calculations that allows to achieve the exact performance
depending upon the application. An application of Digit recognition was selected.
Implementation requirements include the minimal speed of 30 FPS. VGG family of
architecture was used for Implementation. The FPGA kit which was used is compact
development board DE0-Nano due to its large number of resources.
Shepin Zhai et al[20], proposed FPGA based accelerator for CNN. Lenet-5 architecture
model of CNN is used for implementation. Network recognition accuracy rate of 97 %
was achieved which is same as CPU core i5 and NVIDIA GeForce GTX 960 GPU
without effecting any precision so FPGA based accelerators are replacing GPUs which
are expensive and require high power.
Min Zhang et al[21], proposed optimized compression for implementing CNN on FPGA.
Many resources including high functioning GPUs are required otherwise. Compression
strategies like Reversed-pruning, peak pruning, quantization was proposed to compress
the AlexNet by a large amount. It compressed a network from 243 Mb to 8.7 Mb. Its
effectiveness was verified by an accelerator implemented on a Xilinx ZCU104 evaluation
board.
In article [22], the writers present a hardware application of a fully multi-layer ANN
using (FPGA). Each node is implemented using two FPGA XC3042 circuits. Offline
training takes place on a personal computer. The authors have successfully tested the
network performance. They had presented a successful hardware application of a simple
ANN. We can spread the application to the implementation of additional complex
18
networks. Re-configuration and adjustment are the core features of the equipment. For
the new application, the weights, deflections, and scaling parameters in CLB must be
redefined without altering the basic design. We can easily increase the network by
inserting more nodes in the same design. This will magnificently reduce the size and also
increase speed by removing the delay between the IO contacts of two FPGAs.
19
CHAPTER 3
Proposed Methodology
3.1 Proposed Methodology
An integral part of our project is to select the architecture of neural network according to
our application which is Handwritten Digit Recognition and then to implement it. For this
we must select a network and dataset. Proposed Methodology as shown in block diagram.
After the architecture selection we have two basic parts to check the authenticity of our
network which are listed below.
1. Selecting a pre-trained neural network for our application
2. Extracting weights and biases from our pretrained neural network
3. MATLAB and Python Based Testing & Software Modeling of Neural Network
4. Verilog based Hardware Modeling & Testing of Neural Network
Software Model corresponds to MATLAB and Python whereas Hardware Model
corresponds to Vivado.
In MATLAB based software modeling of Neural Network, firstly the dataset is prepared
for the training purposes.We have downloaded the pre-trained dataset from MNIST, so
we have gathered the values of weights and biases using python and develop a Model of
our network on MATLAB.Then we reproduce the VGG16 model using Python, including
convolution layers, fully connected layers, pooling layers and activation functions (ReLU
and SoftMax). As said before, Keras is used for verification, we compare the result with
the Keras model to verify the correctness.
20
In Verilog based testing of network, digital hardware is designed to test the pre-trained
network. This testing process requires trained weights and biases for implementation of
convolutional neural network of FPGA which we already obtained using python. The
implementation of all the above-mentioned blocks is carried out on VIVADO which is
one of the best platforms for simulation of digital hardware designs.
Figure 3.1: Block Diagram
3.2 Network Selection
We have selected VGG-16 Network for our application of Handwritten Digit

Recognition. VGG-16 is a popular convolutional neural network structure with sixteen
(16) layers. VGG is a Convolutional Neural Network architecture, it was proposed by
Karen Simonyan and Andrew Zisserman of Oxford Robotics Institute in the year
VGG was a breakthrough in the world of Convolutional Neural Networks and achieved
great result in the image net classification data set
21
VGG-16 basic Model flattened architecture is shown in Figure.
Figure 3.2: VGG-16 Flattened Model[23]
Our VGG16 model consists of 16 weight layers including thirteen convolutional layers
with filter size of 3 X 3, and fully connected layers with filter size of 3 X 3, and fully
connected layers. The configurations of fully connected layers in VGG-16 are the same
with AlexNet. The stride and padding of all convolutional layers are fixed to 1 pixel. All
convolutional layers are divided into 5 groups and each group is followed by a max-
pooling layer. Max pooling reduces the size of image by one half as the starting
dimensions of image are 28*28 or 784 pixels after the first convolution layer the size of
image is retained while after first max pooling layer the size of image is reduced by half.
Remaining within the constraints of the VGG 16 model we modified our network a bit
according to the needs of our application and data set. Also, we kept in mind our
hardware resources and modified the network in such a way that max efficiency could be
achieved.
Here is our proposed VGG 16 Model
22
Figure 3.3: VGG 16 Model
As our ultimate goal is to implement this network on a FPGA, we decided to modify this
network in such a way that it is easy for us to implement it on hardware at later stages.
Our next step was to devise an architecture of our network as the model illustrates itself
that we have a input layer then a convolution layer and then a max pooling layer and then
this combination of convolution layer and max pooling layer is repeated again and again
throughout the network we decided to make our architecture such that it can be reused
again so we made a convolution layer and a max pooling layer with ReLU activation
function in between them.
23
Figure 3.4: Network Architecture for Digit Recognition
Our input images were grayscale images with a size of 28x28 these were flattened into
total of 784 pixel in this way it is ore easy for us to understand the working of neural
network and how during convolution each pixel is accessed and convolved. We
performed a 2x2 zero padding so that we can retain every single feature and detail in our
input images then there is the first convolution layer it has its unique sets of weights
w1……….w64 as there are total of 64 kernels in first convolution layer then there is a
2x2 max pooling layer that halves the size of input feature map by only picking the
feature with highest value in its 2x2 window.
3.3 Dataset
We used the popular MNIST dataset for the handwritten digit recognition. Dataset
consists of 60000 images for digits from 0 to 9 and 10000 images for testing. Dataset was
24
in raw form and Python and MATLAB were used to separate it into 9 classes and saving
it into .tiff format. It is a subset of a larger set available from NIST. The original black
and white (bi-level) images from MNIST were size normalized to fit in a 20x20 pixel box
while preserving their aspect ratio. The images were then centered in a 28x28 image by
computing the center of mass of the pixels and translating the image so as to position this
point at the center of the 28x28 field.
Figure 3.5: Example of Dataset
3.4 Software Modeling
Firstly, we have modeled our project in MATLAB/Python then we went for hardware
Modeling.
We began by downloading a pre-trained VGG-16 neural net and then by using python
obtained its weight and biases. Then for the purpose of resource modelling and
parameters calculation we decided to implement our neural network in MATLAB. This
gave us the option to test and model of our network according to our hardware resources
as we went along further. As the weight and biases for our network had both decimal and
non-decimal integers. We used MATLAB’s fixed point designer tool and Python’s fixed
point designer library to convert our computations into fixed point to counter these
25
decimal integers values before they cause us trouble in our hardware implementation.
Thus, we tried to take into account every single problem that can cause us trouble in our
hardware implementation of our network and tried to counter that trouble in software
stage by rigorous testing and using all the resources that we had our hands on.
3.4.1 MATLAB
a) Transfer Learning
For the purpose of testing and resource modelling we decided to implement our neural
network on MATLAB first. For this purpose, we used transfer learning application of
MATLAB, it gave us the option of testing our neural network by transferring our pre-
trained networks weights and biases to MATLAB’s own VGG-16 model. We made small
changes in the VGG-16 model of MATLAB as it had inputs of 228x228 RGB images.
After making the necessary adjustments, we used our obtained weight and biases to test
our network and achieved an accuracy of 90%. The network had no issue of overfitting to
different classes and displayed almost same accuracy towards all classes. For this testing
purpose we used NVIDIA GeForce GTX1660super.
Figure 3.6: MATLAB Modelling
26
b) Fixed Point Calculations
As we have stated earlier that weight and biases, we obtained in first stage of our project
had many fractional values this present us with a problem very common in hardware
implementations. Normally we have binary numbers for the representation of integers in
computers and other hardware devices. The Fractional numbers represented in binary
numbers have two parts, the bits that represent the integer number (the part before the
radix point) and the bits that represent the fractional part (the part after the radix point).
Now the point is that What if we had only a limited number of binary bits in which to
store our fractional binary number? This is common in many modern computers systems;
how would we know how many bits to use for the integer part and how many bits to use
for the fractional part?
To solve this problem, it is a common practice to use fixed- or floating-point
representation for fractional integers. For our purposes we have selected fixed point
representation as major advantage of using a fixed-point representation is performance.
As the value stored in memory is an integer the CPU can take advantage of many of the
optimizations that modern computers have to perform integer arithmetic without having
to rely on additional hardware or software logic.
In fixed point representation a fractional number has two parts first is the integer part of
the fractional number which is stored as a signed integer in two’s complement format and
the second part is the fractional part that is represented by a fixed number of bits to the
left of its notational starting position to the right of the least significant bit.
27
Figure 3.7: Fixed Point Representation
As almost all of the weights and biases are fractional numbers represented in the fixed-
point notation almost all of our arithmetic and computation are now in fixed point form
and for fixed-point calculations with the limited width of weights and intermediate
results, rounding errors inevitably arise and accumulate from layer to layer, and can lead
to” inaccurate” predictions which can be a problem for us. To resolve this problem, we
had to model our whole neural net computations according to fixed point number system.
For this we used MATLAB fixed point designer tool and Simple python fixed point
module (SPFPM) to model each of the computation according to our fixed-point number
that means obtaining the number of bits for fractional and integer part for every
computation at every step and then deciding the total number of bits for fractional and
integer parts that can compensate all computations and can result in accurate predictions.
Our approach was to use dynamic range method to model our weight and biases that is
that we first obtained the range of our dataset which in our case were values of weight
and biases then by using these ranges we determined an absolute minimum and
maximum value.
Then we modelled this absolute value by using different tool such as
 MATLAB Fixed Point Designer Tool
 Simple Python Fixed Point Module (SPFPM)
Once these ranges are evaluated then they are modelled using MATLAB Fixed Point
Designer Tool. The most critical resource was the embedded multipliers, which
performed the fixed-point multiplications. Even with the high using percentage, the
availability of this component was enough to parallelize the whole model using
experimental number of training points.
c) MATLAB Fixed Point Designer Tool
As the most computationally expensive operation of whole neural network is convolution

operation and it also makes use of values of weight and biases that’s why our main focus
28
was on convolution operation and we modelled the convolution process. The main
components of convolution operation is dot product (pointwise multiplication and
addition). For this a Simulink model was drawn for the dot product which was then
analyzed using MATLAB Fixed Point Designer Tool.
Figure 3.8: Fixed Point Model
With the approach of dynamic design, decimal values are modeled as signed integer
numbers using 32-bit the most significant bit is the sign bit, and the remaining bits
constitute an integer and a fractional component. The number of bits to represent the
fractional and integer parts of the fixed-point numbers is respectively 16 and 15 bits.
3.4.2 Vivado based Testing
29
In Verilog based testing of network, digital hardware is designed to test the trained
network. This testing process requires trained weights and biases for implementation of
convolutional neural network of FPGA. The building blocks of Verilog based testing of
neural network consist of adder, multiplier and Look up Table (LUT). The
implementation of all the above-mentioned blocks is carried out on VIVADO which is
one of the best platforms for simulation of digital hardware designs.
Single neuron is a basic element of CNN so that firstly we must implement a single
neuron on FPGA and then we can implement complete CNN. For single-neuron
implementation we decided to design a block for each of the major operation. The first
and the main block of the whole CNN is convolution core it consists of a single
convolutional layer that performs the convolution between the 3x3 filters and the input
feature map. As throughout the net there are different convolutional layers with different
number of filters our convolution blocks dimensions can be changed but the process and
operation of this block remains same only input and output dimensions are changed.
Similarly, we have a Max pooling block and a Relu Activation block. All these blocks are
used again and again throughout the network.
Figure 3.9:Neural Network Implementation Block Diagram On FPGA
Here is how each of these blocks are designed and how they work.
3.4.2.1 Convolution Block
30
Our inputs to the first layer (input layer) of our VGG 16 network were 28*28 grayscale
images. Each pixel of image was considered as separate input, so we had 784 pixels or
inputs coming into our input layer. Next up is the convolutional layer we have 64 kernels
of size 3x3 in our first convolution layer that is 64 neurons in our first convolution layer
with 64 different weights value. These weights had already been obtained during our
training process on python. During convolution operation Each of these weights’ values
are multiplied with 3x3 filters. These filters are then convolved with input image of 784
pixels(28x28). To maintain the input and output to this convolution block we used a
padding of 2x2.
The formula for collecting output of convolution layer is

Output = [(W−K+2P)/S] +1
By taking a stride of 1 and padding of 2x2 and kernel size of 3 we can calculate the
output of our first convolution layer to be of size 32x32 that is same as our input after
performing a padding of 2x2.
Due to memory constraints, we improvised by using a separate convolution block for our
network’s convolution operation. The convolution block requires two matrices as inputs
the first operand is an 8x8 pixel block loaded from the input feature map, and the second
operand is a 3x3 kernel loaded from the weight cache. The weights are written into a 3x3
register array, and the 8x8 pixel block is written onto a 10x10 register array, both
implemented in the FPGA as distributed RAM memory (based on LUTs3) which allows
that all the individual registers can be accessed simultaneously in the same clock cycle.
In remaining convolutional layers there are 128 ,256 and 512 kernels. For Each of these
convolutional layers this convolution block is reused by modifying its dimension
according to our neuron sizes of 128,256 and 512 with the operation remaining same for
each layer.
The convolution operation is done by processing the input feature maps block by block,
one at a time, hence the output feature maps are also produced one block at a time. The
use of a convolution block gives us the option of reusability that can prove to be very
crucial when using a system with limited resources.
31
.
The Convolution Core is the module that performs the matrix convolutions iteratively,
aiming to keep latency as low as possible. For that purpose, the Data blocks stored in the
block RAM are transferred to smaller caches made up by internal registers with much
faster access. The convolution operation is done by processing the input feature maps
block by block, one at a time, hence the output feature maps are also produced one block
at a time. Amount of latency needed to finish the whole convolution of an input volume
will depend on the number of channels and its size; therefore, it will vary from one layer
to another, for they have different hyper-parameters. The 2-D convolution operation is
performed by sliding the 3x3 kernel throughout all the pixels contained in the current 8x8
block, that has been previously padded, one pixel at a time. The kernel is centered over
the pixel that is going to be processed, which, together with its neighboring pixels
surrounding. It forms a pixel area termed as local receptive field4 (LRF), also known as
convolution window. The compiler creates 8 different convolution windows (each made
of a 3x3 register array) and the same number of kernel register arrays. In the same way,
the compiler implements 8 different MAC units with 9 multipliers and one adder tree
each. As a result, eight output pixels can be processed in the same clock. Once both the
3x3 array registers are loaded (the convolution window and the kernel), the convolution
unit can perform a partial convolution computation, which is iteratively done pixel by
pixel.
32
Figure 3.10: Schematic of Convolution
3.4.2.2 Max Pooling Block
In our network we used a max pooling layer of size 2x2 it reduced our output by a factor
of 2 the output from the first convolutional layer was of size 32x32 after passing from the
first layer it got reduced to 16x16 and then similarly to 8x8 and so on.
Following our previous strategy, we also made a max pooling block that will be used
again and again throughout the network. Its output calculations are fairly simple as the
relation between input and output is that each time an input is fed into a max pooling
layer its dimensions are halved by a factor of two. So, for different max pooling layer we
change the dimensions according to input and rest of the operation remains same. To
implement the operation of Max Pooling we took all the resulting values from the
convolution operation and by using comparator compare them which each other by
projecting a 2x2 window on the input. After comparing each of the input value with other
input values the top one value is retained in a 2x2 window thus reducing the size of
output by half. The inputs are obtained from the memory block and the product is
returned to the same block
33
To implement the operation of Max Pooling we took all the resulting values from the
convolution operation and by using comparator compare them which each other. After
comparing each of the resulting value with other the top two values are retained thus
reducing the size of output by half. The inputs are obtained from the memory block and
the product is returned to the same block. There would be no difference in terms of
latency though if blocks were accessed in top-to-down precedence order.
Figure 3.11: Schematic of Max Pooling Operation
3.4.2.3 Bias Addition and ReLU Activation Function

The third block encompasses two main operation one is the bias addition and second is
the implementation of ReLU activation function. Once the convolution operation is
completed for one block, the arithmetic logic unit partially holds the value of output. The
arithmetic logic unit also adds the corresponding bias to the output when all the input
34
channels have been convolved, performing an elementwise addition between the bias (a
scalar) and an 8x8 matrix. After adding the bias, the ReLU activation function is applied,
which consists in zeroing all the negative values in the matrix. When applying bit
shifting, quantities are not truncated but rounded; the rounding method applied is the
rounding-to-nearest method, which rounds the quantity to the closest representable
number in the direction of positive infinity then on each of these values’ activation
function is applied and the output is stored in memory block.
35
CHAPTER 4
Results
4.1 Introduction
In this chapter, the details of results obtained from MATLAB based implementation and
Vivado based digital hardware design are discussed with evaluation parameters Accuracy
is assessed with a test dataset, and the obtained data is used to build a confusion matrix
and to calculate the recall and the precision of the model. Finally, schematics and
simulations during the exploration of the design space are presented and reviewed.
4.2 MATLAB based Results
Using the VGG 16 model for handwritten digits recognition, a competitive accuracy of
above 92% is achieved.
Figure 4.1: Software Modeling Results
The difference must be accounted to the fact that the validation dataset and the test
dataset are not equal, but nonetheless, a test accuracy about 90% should be considered
rather satisfactory. Dependent quantity, considering that the algorithm has been described
properly. It depends on the training methodology and how good the attained weights are.
36
Once the weights are hard coded in the FPGA hardware, the accuracy of the accelerator
must be the same whatever the chosen design to implement the algorithm is.
Fig
ure 4.2: Confusion Matrix
The terms in the matrix diagonal correspond to true positives, meaning the number of
times an actual class has been correctly inferred by the model, while the rest of the
elements out of the diagonal are the false positives or the number of times an actual class
has been confused by another one. Confusion matrix which contains two important
metrics: the true positive rate (TPR, also termed as recall or sensitivity) and the model
accuracy.
37
Figure 4.3: Cross Validity Graph
In order to improve the TPR and the accuracy, different actions may be required before
retraining the model again, such as increasing the number of images for a specific class
or, on the contrary, drop-out some images that could be problematic for the training. For
the sake of the project, the refinement of the training dataset has not been taken further,
and the focus has been kept in the optimization of the hardware.
38
4.3 Vivado Based Results
The digital hardware is test on limited number of test images and it provide reasonable
results even after conversion from double precision to single precision design. The
network showed an accuracy of 80%. Resource utilization is listed below.
Resources Used/Available Utilization
Logic Slices 2694/15,850 22%
DSP Slices 86/240 36%
4 input LUTs 10778/63400 17%
Max Clock Frequency 16.376 MHz ---
Table 1: Resource Utilization Table
These are the resource utilization for the convolution block that was most
computationally expensive block in our whole network. As written earlier the FPGA
board used was nexys 4 fpga board xc7a100t-1csg324c.
Following are the simulation performed on vivado on a limited set of test images. The
results were satisfactory. .
The simulations performed were for convolution and max pooling operations.
39
Figure 4.4: Convolution Operation
Figure 4.5: Max Pooling
Following our earlier block strategy these max pooling and convolution blocks will be
used again and again and at each new layer their dimensions will be changes with the rest
of the arithmetic operation remaining same for each new max pooling and convolution
layer. As we had resources constraints this strategy worked very well for us.
Conclusion
40
In this project we propose a plan and a usage of FPGA-based CNN with fixed-point
counts that permits to accomplish the recognition of handwritten digits. Because of the
diminished number of parameters, we dodge regular issues with memory transfer speed.
Proposed technique can be actualized on an essential FPGAs, yet additionally is
adaptable for the utilization on FPGAs with huge number of logic cells. Furthermore, we
show how existing open datasets can be changed to more likely adjust them for reality
applications. Neural Network with multilayers was actualized utilizing Field
Programmable Gate Arrays (FPGA). For Handwritten digit recognition framework, the
general accuracy is 92% which is very palatable. The computerized equipment is test on
predetermined number of test pictures and it give sensible outcomes even after change
from double precision to single precision design.
References
41
[1] L. Maguire, T. McGinnity, B. Glackin, A. Ghani, A. Belatreche and J. Harkin,
"Challenges for large-scale implementations of spiking neural networks on
FPGAs", Neurocomputing, vol. 71, no. 1-3, pp. 13-29, 2007. Available:
10.1016/j.neucom.2006.11.029
[2]"转载：神经网络入门（阮一峰）", 知乎专栏. [Online]. Available:

https://zhuanlan.zhihu.com/p/37617671.
[3]"Reti neurali per applicazioni di visione artificiale", Elettronica-plus.it. [Online].

Available: http://elettronica-plus.it/reti-neurali-per-applicazioni-di-visione-
artificiale_92441/.
[4] K. Ahirwar, "Everything you need to know about Neural Networks | Hacker
Noon", Hackernoon.com, 2017. [Online]. Available: https://hackernoon.com/everything-
you-need-to-know-about-neural-networks-8988c3ee4491.
[5] S. H, "Neural Network 2: ReLU and 초기값 정하기", 호두코딩, 2019. [Online].

Available: https://seoyoungh.github.io/deep-learning/zerotoall-9/.
[6] "3D Convolutions : Understanding + Use Case", Kaggle.com, 2018. [Online].

Available: https://www.kaggle.com/shivamb/3d-convolutions-understanding-use-case.
7
[7]S. Lee, S. Jung and J. Lee, "Prediction Model Based on an Artificial Neural Network
for User-Based Building Energy Consumption in South Korea", Energies, vol. 12, no. 4,
p. 608, 2019. Available: 10.3390/en12040608
[8] M. Loey, "Convolutional Neural Network Models - Deep Learning", Slideshare.net,

2017. [Online]. Available: https://www.slideshare.net/mohamedloey/convolutional-
neural-network-models-deep-learning.
[9]A. Dertat, "Applied Deep Learning - Part 4: Convolutional Neural

Networks", Medium, 2017. [Online]. Available: https://towardsdatascience.com/applied-
deep-learning-part-4-convolutional-neural-networks-584bc134c1e2.
[10]S. Sudhakar, "Convolution Neural Network", Medium, 2017. [Online]. Available:

https://towardsdatascience.com/convolution-neural-network-e9b864ac1e6c?
gi=9ff60fa1ac38
[11]A. Poernomo and D. Kang, "Content-Aware Convolutional Neural Network for

Object Recognition Task", International journal of advanced smart convergence, vol. 5,
no. 3, pp. 1-7, 2016. Available: 10.7236/ijasc.2016.5.3.1
[12]C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan and J. Cong, "Caffeine: Toward
Uniformed Representation and Acceleration for Deep Convolutional Neural
42
Networks", IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, vol. 38, no. 11, pp. 2072-2085, 2019. Available: 10.1109/tcad.2017.2785257
[13]Y. Zhou and J. Jiang, "An FPGA-based accelerator implementation for deep
convolutional neural networks", 2015 4th International Conference on Computer Science
and Network Technology (ICCSNT), 2015. Available: 10.1109/iccsnt.2015.7490869
[14]J. Qiu et al., "Going Deeper with Embedded FPGA Platform for Convolutional
Neural Network", Proceedings of the 2016 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, 2016. Available: 10.1145/2847263.2847265
[15]S. Coric, I. Latinovic and A. Pavasovic, "A neural network FPGA

implementation", Proceedings of the 5th Seminar on Neural Network Applications in
Electrical Engineering. NEUREL 2000 (IEEE Cat. No.00EX287), 2002. Available:
10.1109/neurel.2000.902397.
[16] Y. Hao, "A General Neural Network Hardware Architecture on FPGA", arXiv.org,

2017. [Online]. Available: https://arxiv.org/abs/1711.05860.
[17] J. Kim, B. Grady, R. Lian, J. Brothers and J. Anderson, "FPGA-based CNN

inference accelerator synthesized from multi-threaded C software", 2017 30th IEEE
International System-on-Chip Conference (SOCC), 2017. Available:
10.1109/socc.2017.8226056
[18] D. Wang, J. An and K. Xu, "PipeCNN: An OpenCL-Based FPGA Accelerator for

Large-Scale Convolution Neuron Networks", arXiv.org, 2016. [Online]. Available:
https://arxiv.org/abs/1611.02450.
[19] R. Solovyev, A. Kustov, D. Telpukhov, V. Rukhlov and A. Kalinin, "Fixed-Point

Convolutional Neural Network for Real-Time Video Processing in FPGA", 2018.
[20] S. Zhai, C. Qiu, Y. Yang, J. Li and Y. Cui, "Design of Convolutional Neural

Network Based on FPGA", Journal of Physics: Conference Series, vol. 1168, no. 6,
2019. Available: 10.1088/1742-6596/1168/6/062016.
[21] M. Zhang, L. Li, H. Wang, Y. Liu, H. Qin and W. Zhao, "Optimized Compression
for Implementing Convolutional Neural Networks on FPGA", Electronics, vol. 8, no. 3,
2019. Available: 10.3390/electronics8030295
[22] N. Botros and M. Abdul-Aziz, "Hardware implementation of an artificial neural

network using field programmable gate arrays (FPGA's)", IEEE Transactions on
Industrial Electronics, vol. 41, no. 6, pp. 665-667, 1994. Available: 10.1109/41.334585.
[23] V. Khandelwal, "The Architecture and Implementation of VGG-16", Medium, 2017.

[Online]. Available: https://medium.com/towards-artificial-intelligence/the-architecture-
and-implementation-of-vgg-16-b050e5a5920b.
43
44
45

Taman

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Taman

Uploaded by

Copyright:

Available Formats

:

Figure 1.1 Neuron...............................................................................................................1

Table 1 Resources Utilization Table..................................................................................38

Figure 1.1: Neuron[1]

Figure1.2: Deep Neural Network[2]

1.1.1 Convolutional Neural Network

A convolutional neural network (CNN) contains one or more convolutional layers,

CNNs are regularized versions of multilayer perceptron’s. The "fully-connectedness" of

A typical CNN is shown below:

^y =aout =sigmoid ( z) …………….(2)

The Neural Network is developed from 3 kind of layers:

 Input layer — introductory information for the neural organization.

 Output layer — produce the outcome for given data sources.

1.1.5 Fully Connected and Convolutional Layers

Figure 1.5: Convolution Layer[5]

Working of a single neuron

1.1.7 Activation Function

Figure 1.8: Comparison of ReLU & Sigmoid[7]

1.8.1.1 Kernel size

Figure 1.9: Example of Stride =1[8]

Padding is a term relevant to convolutional neural networks as it refers to the amount of

Figure 1.10: Padding[9]

1.1.9 Pooling Layer

A pooling layer is another building block of a CNN. Its function is to progressively

Figure 1.11: Max Pooling[10]

1.1.10 SoftMax Layer

1. To reduce computational resources during CNN implementation on FPGA to

3. To increase the efficiency using an appropriate Network according to our

1.3 Challenges in Implementation

It is highly difficult to implement neural networks with limited constrains of

1. Parallel or Sequential implementation

2. Word length/Bit precision

It is highly important to make decision between use of Parallel computations or

1.4 Organization of Report

The rest of thesis is organized as follows.

2.2 Literature Survey

S. Coric et al[15], proposed a design of general-purpose neuron using back propagation

Yufeng Hao et al[16], worked on implementation of a general architecture of Neural

Ke Xu et al[18], proposed FPGA based accelerator for VGG-16. Network was

Roman A. Solovyev et al[19], proposed a design and implementation of FPGA-based

3.1 Proposed Methodology

1. Selecting a pre-trained neural network for our application

2. Extracting weights and biases from our pretrained neural network

4. Verilog based Hardware Modeling & Testing of Neural Network

Software Model corresponds to MATLAB and Python whereas Hardware Model

the Keras model to verify the correctness.

implementation of all the above-mentioned blocks is carried out on VIVADO which is

one of the best platforms for simulation of digital hardware designs.

Figure 3.1: Block Diagram

3.2 Network Selection

We have selected VGG-16 Network for our application of Handwritten Digit

great result in the image net classification data set

Figure 3.2: VGG-16 Flattened Model[23]

Figure 3.5: Example of Dataset

3.4 Software Modeling

Figure 3.6: MATLAB Modelling

 MATLAB Fixed Point Designer Tool

 Simple Python Fixed Point Module (SPFPM)

c) MATLAB Fixed Point Designer Tool

As the most computationally expensive operation of whole neural network is convolution

Figure 3.8: Fixed Point Model

3.4.2 Vivado based Testing

Figure 3.9:Neural Network Implementation Block Diagram On FPGA