Fundamentals of Neural Networks

FUNDAMENTALS OF NEURAL NETWORKS
1. INTRODUCTION
Deep Learning is a subfield of Artificial Intelligence and Machine Learning that is

inspired by the structure of a human brain.
Deep learning algorithms attempt to draw similar conclusions as humans would by

continually analyzing data with a given logical structure called Neural Network. These
neural networks are designed to simulate the human brain’s ability to learn and make
decisions [2].
Figure 1: Basic Structure of Deep Learning.[2]
1.1. Neural networks

Neural networks are a fundamental concept in deep learning, subset of machine
learning. They are inspired by the structure and function of the human brain, consisting
of Interconnected nodes or neurons that process and transient Information. In deep
learning, neural networks are used to model and solve complex problems by learning
from large amounts of data [2].
1.1.1. Neuron Model

Structure: - The neuron model used in artificial neural networks consists of three main
components:
a. Dendrites: - These are the input channels through which the neuron receives
signals from other neurons or external stimuli.
1
b. Cell Body (Soma):- This is the main processing unit of the neuron where the
received signals are Integrated and transformed into an output signal.
c. Axon: - This in the output channel through which the neuron sends its output
signal to other neurons or target cells.
Function: - The neuron model operates based on the following principles:
a. Signal Integration: The neuron integrates the incoming signals from the
dendrites, taking into account their respective strength or weights.
b. Activation Function: The integrated signal is then passed through an activation
function, which determines the output signal of the neuron.
c. Signal Transmission: If the output signal exceeds a certain threshold, it is
transmitted through the axon to other neurons or target cells [2].
Figure 2: Neuron Structure.[2]
1.1.2. Components of Neural Network
Neuron/Node: - The basic computational unit of an ANN which processes input data
and gives output.
Input layer: - The initial layer of neurons that receives input data also called as
features.
Hidden layers: - Intermediate layers of neurons of an ANN that come between input
and output layers
Output layer: - The final layer of neurons that produces the network’s output
2
Weights: - Each connection between nodes is assigned a weight, which determines the
strength of the connection and its impact on the final output.
Activation Function: - An activation function is applies to the weight sum of input at

each node, introducing a non- linearity and enabling the network to learn complex
patterns [2].
Figure 3: Layers of Neural Network.[2]
1.2. Types of Neural Network

The different types of neural networks in deep learning, such as convolutional neural
networks (CNN), recurrent neural networks (RNN), artificial neural networks (ANN)
Feedforward Neural Networks (FNN), etc. are changing the way we interact with the
world.
Artificial Neural Networks: Artificial Neural Networks are computational model inspired
by the structure and function of the human brain. They consist of interconnected nodes,
known as artificial neurons or nodes, which work together to process and analyse
complex patterns and data
Convolutional Neural Networks: Convolutional Neural Networks s are used for image
recognition and processing, and are particularly useful for finding patterns in images to
recognize objects, classes, and categories. They can also be use a mathematical
operation called convolution to filter the input data and produce a feature map.
3
Feedforward Neural Networks: Feedforward neural networks are a type of artificial

neural network commonly used in deep learning. They are designed to process input
data and produce output predictions without any feedback connections.
Recurrent Neural Networks: Recurrent Neural Networks are a type of deep learning
neural network that are designed to process sequential data. Unlike Feedforward neural
networks, which process data in a strictly forward direction, RNNs have recurrent
connections that allow them to retain information from previous steps in the sequence.
Generative Adversarial Networks: Generative adversarial networks (GANs) are a

class of deep learning neural networks that are used for generating new data samples
that resemble a given training dataset. GANs consist of two main components: the
generator network and the discriminator network. The generator network generates new
samples, while the discriminator network evaluates the generated samples and tries to
distinguish them from real samples. The two networks are trained together in a
competitive process, where the generator tries at fool the discriminator, and the
discriminator tries in accurately classify the samples an real or generated [2].
1.3. Weights and Biases

Weights and biases (commonly referred to as w and b) are the learnable parameters of
some machine learning models, including neural networks.
Neurons are the basic units of a neural network. In an ANN, each neuron in a layer is
connected to some or all of the neurons in the next layer. When the inputs are
transmitted between neurons, the weights are applied to the inputs along with the bias.
(1)
Weights control the signal (or the strength of the connection) between two neurons.
In other words, a weight decides how much influence the input will have on the output.
Biases which are constant are an additional input into the next layer that will always
have the value of 1. Bias units are not influenced by the previous layer (they do not
have any incoming connections) but they do have outgoing connections with their own
4
weights. The bias unit guarantees that even when all the inputs are zeros there will still
be activation in the neuron [3].
Figure 4: Mathematical Block of Weights and Bias.[3]
1.4. Linear Regression

Linear regression is a type of supervised machine learning algorithm that computes the
linear relationship between a dependent variable and one or more independent
features. When the number of the independent feature, is 1 then it is known as
Univariate Linear regression, and in the case of more than one feature, it is known as
multivariate linear regression.
1.4.1. Simple Linear Regression

This is the simplest form of linear regression, and it involves only one independent variable and
one dependent variable. The equation for simple linear regression is:
(2)
Where:
Y is the dependent variable
X is the independent variable
Β0 is the intercept
Β1 is the slope
5
1.4.2. Multiple Linear Regression

This involves more than one independent variable and one dependent variable. The equation for
multiple linear regressions is:
(3)
Where:
Y is the dependent variable
X1, X2, …, Xp are the independent variables
Β0 is the intercept
Β1, β2, …, βn are the slopes
The goal of the algorithm is to find the best Fit Line equation that can predict the values based on
the independent variables. In regression set of records are present with X and Y values and these
values are used to learn a function so if you want to predict Y from an unknown X this learned
function can be used. In regression we have to find the value of Y, So, a function is required that
predicts continuous Y in the case of regression given X as independent features [8].
best Fit Line
The best Fit Line equation provides a straight line that represents the relationship between the
dependent and independent variables. The slope of the line indicates how much the dependent
variable changes for a unit change in the independent variable(s).
6
Figure 5: Linear Regression.[8]
Here Y is called a dependent or target variable and X is called an independent variable also
known as the predictor of Y. There are many types of functions or modules that can be used for
regression. A linear function is the simplest type of function. Here, X may be a single feature or
multiple features representing the problem.
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x)). Hence, the name is Linear Regression. In the figure above, X (input) is
the work experience and Y (output) is the salary of a person. The regression line is the best-fit
line for our model.
We utilize the cost function to compute the best values in order to get the best fit line since
different values for weights or the coefficient of lines result in different regression lines [8].
1.5. Perceptron
A perceptron is a fundamental building block of artificial neural networks. It is a type of
artificial neuron that takes multiple inputs, applies weights to them, and produces an
output based on a specified activation function.
How Does the Perceptron Algorithm Work?
The perceptron algorithm is used to train the weights of a perceptron in order to make
accurate predictions. It follows these steps:
1. Initialize the weights and bias to random values

2. For each training example, calculate the weighted sum of the inputs
3. Apply the activation function to the weighted sum to get the output.
4. Compare the output with the target output and adjust the weights and bias
accordingly.
5. Repeat steps 2-4 until the desired level of accuracy is achieved or a maximum
number of iterations is reached
7
Figure 6: Block Diagram of Perceptron.[4]
Let considered a step function defined in the following way

y=1 if x>0
y=0 if x<=0 (4)
The input to a perceptron is the sum of weights multiplied with their respective inputs
and the bias:
(5)
In vector form, we can write z as the dot product between the input vector
x = (x1, …, xm )ᵗ and the weight vector w = (w1, …, wm )ᵗ plus the bias:
Z = wᵗx + b (6)
Applying an activation function f (z) on the net input that generates a binary output
(0/1 or -1/+1).We can write this entire computation in one equation:
O =f (wᵗx + b) (7)
Where f is the chosen activation function and o is the output of the perceptron [4].
1.5.1. Perceptron as Linear Classifiers

The perceptron is a type of a linear classifier, since it divides the input space into two
areas separated by the following hyper plane. The equation of the separating hyper
plane can be written as:
wᵗx + b= 0 (8)
8
The weight vector w is orthogonal to this hyper plane, and thus determines its
orientation, while the bias b defines its distance from the origin.
Every example above the hyper plane (wᵗx + b > 0) is classified by the perceptron as a
positive example, while every example below the hyper plane (wᵗx + b < 0) is classified
as a negative example.
Figure 7: Perceptron as a linear classifier.[4]
Linear classifiers are capable of learning only linearly separable problems, i.e.,
problems where the decision boundary between the positive and the negative examples
is a linear surface [4].
For example, the following data set is not linearly separable, therefore a perceptron
cannot classify correctly all the examples in this data set:
Figure 8: Non-linearly separable data set.[4]
9
1.5.2. Perceptron Learning Algorithm
Figure 9: Snapshot of Perceptron Algorithm.
1.6. Multi-Layer Perceptron

Multi-Layer Perceptron(MLP) is a type of artificial neural network characterized by its
multiple layers of nodes (neurons) Multi-Layer Perceptron notation refers to the
symbolic representation used to illustrate the architecture and connections within a
neural network. In MLP, nodes (neurons) are organized into layers, including an input
layer, hidden layers, and an output layer. The notation visually represents the flow of
information between layers through weights and activation functions. Each connection
between nodes is associated with a weight, and each node applies an activation
function to the weighted sum of its inputs. MLP are trained using algorithms like
backpropagation to learn and make predictions [5].
10
Figure 10: Multi-Layer Perceptron.[5]
It tries to minimize the error. But the MLP has a different process called
backpropagation.
1. Select a number of training instances a network will process each time
2. Pass in training instances into the input layer →hidden layer → output layer
3. Compute an output error based on the output from the output layer
4. Go through the network in a reverse order to measure how each connection is related
to the output error
5. Update the weights to reduce the output error
6. Repeat steps 2 to 5 until it covers every training instance
7. Repeat step 6 until one loop over the entire data set [5].
11
2. NEURAL NETWORK DESIGN
2.1. Activation Function

In a neural network, an activation function normalizes the input and produces an output
which is then passed forward into the subsequent layer. Activation functions add non-
linearity to the output which enables neural networks to solve non-linear problems. In
other words, a neural network without an activation function is essentially just a linear
regression model.
Activation Function Types: Common activation functions include Sigmoid, Tanh,ReLU,

PReLU/Leaky ReLU and ELU but there are many others.
2.1.1. Sigmoid Function
The sigmoid function is one of the most commonly used activation functions. Its mathematical
representation is
1
σ ( x )= −x
(9)Sigmoid functions, commonly used in logistic regression and basic
1+e
neural networks, serve as introductory activation units in machine learning. However,
they are less suitable for advanced neural networks due to drawbacks such as the
vanishing gradient problem. Despite being popular for beginners, the sigmoid function's
short-range derivative leads to information loss, especially in deeper neural networks,
where data compression and loss escalate at each layer. The positive output of the
sigmoid function contributes to both vanishing and exploding gradient issues, making it
less ideal for early layers. In contrast, it can be used in the last layer [7].
2.1.2. Tanh Function

To avoid the disadvantages of the sigmoid function, researchers have designed many
other kinds of activation functions, including the tanh function. Tanh function is defined
as
sinh(x ) e x −e− x
tanh(x) = = = 2σ(2x) – 1 (10)
cosh (x) e x +e− x
12
The graph of the tanhfunction is shown in Fig. 5 Compared to the sigmoid function, the tanh
function is zero-centered, as it transforms the input into the symmetric range of (−1, 1).
Therefore, the tanh function solves the nonzero-centered problem of the sigmoid function.
However, when the input is too large or small, the output of the tanh function is always smooth
with a small gradient. This is not conducive to the weight update. Thus, the tanh function does
not solve the vanishing gradient problem either [6].
2.1.3. ReLU Function
The Rectified Linear Unit (ReLU) was first applied to the restricted Boltzmann machine,
which is a universally used activation function currently. When the input is negative, the
output of the ReLU function is 0; otherwise, the output is equal to the input. Its formal
definition is
f (x) = max(0,x) (11)

The computation of the ReLU function is particularly simple, since ReLU does not have
an exponential operation in contrast to the tanh and sigmoid functions. The ReLU
function only needs to return the maximum value between 0 and x, which can be
implemented with one computer instruction. Moreover, when x > 0, the gradient of ReLU
does not decay, as shown in Fig. 2.12, thus alleviating the vanishing gradient problem.
Therefore, in deep learning, especially neural networks with over hundreds of layers
(e.g., ResNet), activation functions like ReLU are commonly used [6].
Figure 11: Types of Activation Functions Graphical Representation.[7]
13
2.1.4. PReLU/Leaky ReLU function

Since the ReLU function may die when x < 0, many improved versions of ReLU have appeared,
including Leaky ReLU and Parametric ReLU(PReLU).
The Leaky ReLU function is defined as
f (x) = max(αx,x) (12)
where the parameter α is a small constant in the range of (0, 1). When x < 0, the Leaky ReLU
function has a very small slope α, as shown in Fig. 6, which prevents ReLU from dying.
The definition of the PReLU function is similar to that of the Leaky ReLU, the only difference
being that α is a tunable parameter. Each channel has one parameter α, which is obtained through
back-propagation training [6].
Figure 12:Leaky ReLU function.[1]
2.1.5. ELU function

The Exponential Linear Unit (ELU) function [67] combines the sigmoid and ReLU functions. It
is defined as
(13)
where α is a tunable coefficient that can control the saturation position of the ELU in the
negative domain. The mean value of the ELU output shifts towards zero, which speeds up model
convergence. When x > 0, the ELU function outputs y = x, which avoids the vanishing gradient
problem. When x ≤ 0, ELU is left soft saturated, as shown in Fig.7 which prevents neurons from
dying. The disadvantage of ELU is that it involves exponential computation. Thus the
computation complexity is relatively high [6].
14
Figure 13: ELU function.[1]
2.2. Loss Function

A loss function is a function that measures how well a deep learning model’s predicted
outputs match the true output labels. The loss function is used to optimize the model by
minimizing the loss, which means that the model makes fewer mistakes on the training
data. The loss function is an important part of the machine learning process because it
provides a way to evaluate the model’s performance and guide the optimization process
by indicating how the model’s predictions differ from the true output labels.
Several different loss functions in neural networks exist, such as mean squared error,
binary cross-entropy loss and hinge loss. Mean squared error is often used for
regression tasks, while binary cross-entropy loss is commonly used for classification
tasks [7].
2.2.1. Mean squared error (MSE) loss
Mean squared error loss is a common loss function used for regression tasks. It is the
mean of the squared differences between predicted and true output. The MSE loss is
calculated by taking the average squared difference between the predicted and true
values for all the samples in the dataset. The MSE loss is used to evaluate the
performance of a model on a regression task and is often used as an optimization
objective when training a model.
The following equation gives the MSE loss:
(14)
15
Where n is the number of samples and ypredand ytrueare the predicted and true output
values, respectively [7].
Figure 14: Graphical Representation of Mean Square error loss.[7]
2.2.2. Binary cross-entropy (BCE) loss
Binary cross-entropy loss is a loss function used for binary classification tasks. It is
defined as the negative log probability of the true class. It is calculated by taking the
negative log of the predicted probability of the true class.
The mathematical formulation for BCE loss is:
BCE = (15)
Where yiis the true label (0 or 1) and p is the predicted probability of the true class
(a value between 0 and 1) [7].
2.2.3. Hinge loss
Hinge loss is a function that trains linear classifiers, such as support vector machines
(SVMs). It is the maximum difference between the predicted and true margins and a
constant value. The predicted margin is the distance between the decision boundary
and the closest training data point. The true margin is the distance between the decision
boundary and the true label.
16
The following equation gives the hinge loss:
(16)
Where ypred the predicted margin (a value between -1 and 1) and ytrue is the true label
(-1 or 1), respectively [7].
Figure 15: Graphical Representation of Hinge Loss.[7]
17
3. NEURAL NETWORK TRAINING
Neural networks are trained to minimize the difference between predicted and true values
through forward and backward propagation. Forward propagation computes hidden layer outputs
iteratively from input vectors, weights, and activation functions, extracting features. Backward
propagation calculates loss based on forward results, and then adjusts weights and biases using
gradient descent and the chain rule to minimize the loss function [6].
Figure 16: Neural Network Training.[7]
c.1. Forward Propagation

The forward propagation of each neural network layer consists of two steps: first, the
transpose of the weight matrix is multiplied by the input vector. Then, the products pass
through the nonlinear activation function to get the output vector.
The input of the neural network in has three neurons, denoted as x=[x1; x2; x3]. The
hidden layer contains three neurons, denoted as h = [h1; h2; h3].The output layer
contains two neurons, denoted as ^y =[ ^y 1; ^y 2]. The corresponding bias vector of the
connections between the input and the hidden layer is b(1)and the weight matrix is
(17)
18
The corresponding bias vector of the connections between the hidden layer and the output layer
isb(2), and the weight matrix is
(18)
In this neural network, the sigmoid function is used as the activation function,
1
σ ( x )= −x
(19)
1+e
In the forward propagation process of computing the hidden layer given the input, the transpose
of the weight matrix w(1)is multiplied by the input vector x, and then the bias vector b(1)is added:
(20)
With the sigmoid activation function, the output of the hidden layer is
1
h ( x )= −v (21)
1+e
The forward propagation process of computing the output layer given the hidden layer is similar
[1].
c.2. Backward Propagation

For backward propagation, first, the loss function is calculated with the difference between the
neural network output and the true value. Then, partial derivatives of the loss function with
respect to every weight and bias are computed. Finally, the weights and biases are updated.The
mean squared error (MSE) is used as the loss function in the example of the previous subsection.
The value of the loss function on the sample (x, y) is
(22)
As the weights W are randomly initialized, the value of the loss function is large.
19
To measure the impact of W on the loss function, we take the weight between the second node of
the hidden layer and the first node of the output layer w(2)
2 ,1(denoted as ω) as an example, and we
compute the partial derivative of the loss function L(W )With respect to ω using the chain rule.
First, the partial derivative of the loss (W) with respect to ^y 1is computed, then the partial
derivative of ^y 1with respect to z1 is computed, as well as the partial derivative of z 1 on ω, and
finally the three derivatives are multiplied together as [1].
(23)
Combined with the example in the previous subsection, the partial derivative of the loss function
with respect to ω is calculated. The overall loss function is computed as
(24)
Where z 1 is calculated by multiplying the outputs of the hidden layers h 1 , h2 , and h 3 by

(2) (2) (2)
w 1 ,1, ω, and w 3 ,1respectively, and then summing the product with the biasb 1 :
(25)
Similarly, we can update other weights in W (2)

The above process is the first step of back-propagation. The remaining weights from the
input to the hidden layer and from the hidden layer to the output layer can be calculated
and updated using the same chain rule.
Back-propagation is a crucial technique in neural networks that helps in reducing the
output error by propagating it backwards, layer by layer. To do this, the impact of each
weight (ω) on the overall loss is calculated through partial derivatives. The calculated
derivative is then multiplied by a step size and used to update the weight matrix of the
entire network. After completing one round of back-propagation, the entire parameter
model gets updated. The process continues by feeding a new input sample, computing
the model error, and updating the model. This iterative process helps minimize the
difference between the predicted and true values. Training is completed when the error
reaches a predefined threshold [1].
4. OVERFITTING AND REGULARIZATION
20
4.1. Overfitting
Overfitting occurs when a model closely fits the training data but fails to generalize well
to new data, resulting in low training error but high test error. This is common in neural
networks with excessive layers and parameters. Underfitting, conversely, happens when
the model is too simple to capture the underlying structure of the data, leading to high
training error due to insufficient training features. Underfitting can be addressed by
increasing training samples or using a more sophisticated model. Examples illustrate
appropriate fitting (small error arc), underfitting (simple straight line with large training
error), and overfitting (strange curve). Deep learning models with numerous parameters
may lead to overfitting, where the model closely fits the training set but lacks
generalization ability, resulting in poor performance on new data.
Figure 17: Types of Overfitting.[1]

The examples demonstrate that with three variables, samples can be fitted with a
quadratic curve. However, fitting a quartic curve may lower the training error but fail to
generalize to new samples. Regularization techniques can help mitigate the impact of
higher-order terms in models, such as third- and fourth-order terms, improving
generalization performance [1].
Figure 18: Different fitting functions.[1]
4.2. Regularization
4.2.1. Parameter norm penalty
21
Parameter norm penalty, also known as weight decay or regularization, involves adding
a penalty term to the loss function during training to discourage large parameter values.
This penalty helps prevent overfitting by encouraging the model to learn simpler
patterns and reduces the model's sensitivity to noise in the training data. Commonly
used parameter norm penalties include L1 regularization (which adds the absolute
values of the weights to the loss) and L2 regularization (which adds the squared values
of the weights to the loss) [1].
Figure 19: Solving overfitting problem by regularization.[1]
4.2.2. Sparsification
Sparsification involves making many weights or neurons in a neural network become
zero during training, with some techniques achieving up to 90% sparsity. This reduces
computation during forward propagation, as calculations on zero weights or neurons
can be skipped. Sparsification is typically achieved by adding penalty terms to the
training process [1].
4.2.3. Bagging
Bagging (bootstrap aggregating) involves training multiple models to jointly vote on the output
for test samples, aiming to improve performance. It constructs k different datasets by sampling
from the original dataset, maintaining consistent set sizes. Bagging allows using the same or
different models, training algorithms, and objective functions. For example, if one pretrained
neural network model struggles with cat recognition; two other models with varying parameters
and network topologies can be built. These models can also employ different machine learning
methods, such as neural networks, support vector machines, or decision trees. The final
recognition result can be obtained by averaging outputs or training an additional classifier to
choose among outputs, ultimately reducing recognition error [1].
22
Figure 20: Bagging.[1]
4.2.4. Dropout
Dropout randomly removes nodes in hidden layers during training to prevent overfitting, unlike
L1 and L2 regularization which penalize high-order terms. This technique creates different sub
networks by removing hidden nodes. A mask vector μ is set, where each element corresponds to
an input or a hidden node. Random sampling of μ is performed, and each node is multiplied by
the corresponding mask during forward propagation. Typically, input nodes have a sampling
probability of 0.8, while hidden nodes have 0.5, dropping half of them. This improves training
performance, and during testing, all nodes are used with weights adjusted by their sampling
probabilities.
Figure 21: An example of dropout.[1]
5. CONCLUSION
23
This report has given a brief introduction to the fundamentals of neural networks. In the
limited space it is not possible to discuss all possible topics in which neural networks
have been applied to control system problems.
Section 1 On this basis, first introduces the confusing concepts of the deep learning
after that neural network structure and components, also discuss about types of neural
network, concept of weight and biases and then introduces the basic deep learning
method—linear regression and its training process .At the end of this section, the
principle of the simplest neural network—perceptron—is illustrated; then it is expanded
to two-layer and multilayer deep neural networks.
Section 2 introduces activation functions and their types of activation functions. At the
end of this section, discuss the concepts of loss functions and their types of loss
functions.
Section 3 introduces the calculation process of forward propagation and back-
propagation in neural network training. After understanding these technologies, readers
can do some neural network experiments.
Section 4 introduces the concept of overfitting and their types of overfitting and finally,
how can be solved overfitting by help of regularization.
This report only outlines the basic content related to neural networks. Interested readers
can access papers in related fields to understand more specific related knowledge,
including regularization, loss functions, etc.
24

Fundamentals of Neural Networks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fundamentals of Neural Networks

Uploaded by

Copyright:

Available Formats

FUNDAMENTALS OF NEURAL NETWORKS

Deep Learning is a subfield of Artificial Intelligence and Machine Learning that is

Deep learning algorithms attempt to draw similar conclusions as humans would by

Figure 1: Basic Structure of Deep Learning.[2]

1.1. Neural networks

1.1.1. Neuron Model

Function: - The neuron model operates based on the following principles:

Figure 2: Neuron Structure.[2]

1.1.2. Components of Neural Network

Activation Function: - An activation function is applies to the weight sum of input at

Figure 3: Layers of Neural Network.[2]

1.2. Types of Neural Network

Feedforward Neural Networks: Feedforward neural networks are a type of artificial

Generative Adversarial Networks: Generative adversarial networks (GANs) are a

1.3. Weights and Biases

Figure 4: Mathematical Block of Weights and Bias.[3]

1.4. Linear Regression

1.4.1. Simple Linear Regression

Y is the dependent variable

X is the independent variable

1.4.2. Multiple Linear Regression

Y is the dependent variable

X1, X2, …, Xp are the independent variables

Β1, β2, …, βn are the slopes

best Fit Line

Figure 5: Linear Regression.[8]

How Does the Perceptron Algorithm Work?

1. Initialize the weights and bias to random values

Figure 6: Block Diagram of Perceptron.[4]

Let considered a step function defined in the following way

1.5.1. Perceptron as Linear Classifiers

Figure 7: Perceptron as a linear classifier.[4]

Figure 8: Non-linearly separable data set.[4]

1.5.2. Perceptron Learning Algorithm

Figure 9: Snapshot of Perceptron Algorithm.

1.6. Multi-Layer Perceptron

Figure 10: Multi-Layer Perceptron.[5]

1. Select a number of training instances a network will process each time

5. Update the weights to reduce the output error

6. Repeat steps 2 to 5 until it covers every training instance

2. NEURAL NETWORK DESIGN

2.1. Activation Function

Activation Function Types: Common activation functions include Sigmoid, Tanh,ReLU,

2.1.1. Sigmoid Function

2.1.2. Tanh Function

2.1.3. ReLU Function

f (x) = max(0,x) (11)

Figure 11: Types of Activation Functions Graphical Representation.[7]

2.1.4. PReLU/Leaky ReLU function

Figure 12:Leaky ReLU function.[1]

2.1.5. ELU function

Figure 13: ELU function.[1]

2.2. Loss Function

2.2.1. Mean squared error (MSE) loss

The following equation gives the MSE loss:

Figure 14: Graphical Representation of Mean Square error loss.[7]

2.2.2. Binary cross-entropy (BCE) loss

The mathematical formulation for BCE loss is:

2.2.3. Hinge loss

The following equation gives the hinge loss:

Figure 15: Graphical Representation of Hinge Loss.[7]

3. NEURAL NETWORK TRAINING

Figure 16: Neural Network Training.[7]

c.1. Forward Propagation