Professional Documents
Culture Documents
1. INTRODUCTION
a. Dendrites: - These are the input channels through which the neuron receives
signals from other neurons or external stimuli.
1
FUNDAMENTALS OF NEURAL NETWORKS
b. Cell Body (Soma):- This is the main processing unit of the neuron where the
received signals are Integrated and transformed into an output signal.
c. Axon: - This in the output channel through which the neuron sends its output
signal to other neurons or target cells.
a. Signal Integration: The neuron integrates the incoming signals from the
dendrites, taking into account their respective strength or weights.
b. Activation Function: The integrated signal is then passed through an activation
function, which determines the output signal of the neuron.
c. Signal Transmission: If the output signal exceeds a certain threshold, it is
transmitted through the axon to other neurons or target cells [2].
Neuron/Node: - The basic computational unit of an ANN which processes input data
and gives output.
Input layer: - The initial layer of neurons that receives input data also called as
features.
Hidden layers: - Intermediate layers of neurons of an ANN that come between input
and output layers
Output layer: - The final layer of neurons that produces the network’s output
2
FUNDAMENTALS OF NEURAL NETWORKS
Weights: - Each connection between nodes is assigned a weight, which determines the
strength of the connection and its impact on the final output.
Artificial Neural Networks: Artificial Neural Networks are computational model inspired
by the structure and function of the human brain. They consist of interconnected nodes,
known as artificial neurons or nodes, which work together to process and analyse
complex patterns and data
Convolutional Neural Networks: Convolutional Neural Networks s are used for image
recognition and processing, and are particularly useful for finding patterns in images to
recognize objects, classes, and categories. They can also be use a mathematical
operation called convolution to filter the input data and produce a feature map.
3
FUNDAMENTALS OF NEURAL NETWORKS
Recurrent Neural Networks: Recurrent Neural Networks are a type of deep learning
neural network that are designed to process sequential data. Unlike Feedforward neural
networks, which process data in a strictly forward direction, RNNs have recurrent
connections that allow them to retain information from previous steps in the sequence.
Neurons are the basic units of a neural network. In an ANN, each neuron in a layer is
connected to some or all of the neurons in the next layer. When the inputs are
transmitted between neurons, the weights are applied to the inputs along with the bias.
(1)
Weights control the signal (or the strength of the connection) between two neurons.
In other words, a weight decides how much influence the input will have on the output.
Biases which are constant are an additional input into the next layer that will always
have the value of 1. Bias units are not influenced by the previous layer (they do not
have any incoming connections) but they do have outgoing connections with their own
4
FUNDAMENTALS OF NEURAL NETWORKS
weights. The bias unit guarantees that even when all the inputs are zeros there will still
be activation in the neuron [3].
(2)
Where:
Β0 is the intercept
Β1 is the slope
5
FUNDAMENTALS OF NEURAL NETWORKS
(3)
Where:
Β0 is the intercept
The goal of the algorithm is to find the best Fit Line equation that can predict the values based on
the independent variables. In regression set of records are present with X and Y values and these
values are used to learn a function so if you want to predict Y from an unknown X this learned
function can be used. In regression we have to find the value of Y, So, a function is required that
predicts continuous Y in the case of regression given X as independent features [8].
The best Fit Line equation provides a straight line that represents the relationship between the
dependent and independent variables. The slope of the line indicates how much the dependent
variable changes for a unit change in the independent variable(s).
6
FUNDAMENTALS OF NEURAL NETWORKS
Here Y is called a dependent or target variable and X is called an independent variable also
known as the predictor of Y. There are many types of functions or modules that can be used for
regression. A linear function is the simplest type of function. Here, X may be a single feature or
multiple features representing the problem.
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x)). Hence, the name is Linear Regression. In the figure above, X (input) is
the work experience and Y (output) is the salary of a person. The regression line is the best-fit
line for our model.
We utilize the cost function to compute the best values in order to get the best fit line since
different values for weights or the coefficient of lines result in different regression lines [8].
1.5. Perceptron
A perceptron is a fundamental building block of artificial neural networks. It is a type of
artificial neuron that takes multiple inputs, applies weights to them, and produces an
output based on a specified activation function.
The perceptron algorithm is used to train the weights of a perceptron in order to make
accurate predictions. It follows these steps:
7
FUNDAMENTALS OF NEURAL NETWORKS
(5)
In vector form, we can write z as the dot product between the input vector
x = (x1, …, xm )ᵗ and the weight vector w = (w1, …, wm )ᵗ plus the bias:
Z = wᵗx + b (6)
Applying an activation function f (z) on the net input that generates a binary output
(0/1 or -1/+1).We can write this entire computation in one equation:
O =f (wᵗx + b) (7)
Where f is the chosen activation function and o is the output of the perceptron [4].
wᵗx + b= 0 (8)
8
FUNDAMENTALS OF NEURAL NETWORKS
The weight vector w is orthogonal to this hyper plane, and thus determines its
orientation, while the bias b defines its distance from the origin.
Every example above the hyper plane (wᵗx + b > 0) is classified by the perceptron as a
positive example, while every example below the hyper plane (wᵗx + b < 0) is classified
as a negative example.
Linear classifiers are capable of learning only linearly separable problems, i.e.,
problems where the decision boundary between the positive and the negative examples
is a linear surface [4].
For example, the following data set is not linearly separable, therefore a perceptron
cannot classify correctly all the examples in this data set:
9
FUNDAMENTALS OF NEURAL NETWORKS
10
FUNDAMENTALS OF NEURAL NETWORKS
It tries to minimize the error. But the MLP has a different process called
backpropagation.
2. Pass in training instances into the input layer →hidden layer → output layer
3. Compute an output error based on the output from the output layer
4. Go through the network in a reverse order to measure how each connection is related
to the output error
7. Repeat step 6 until one loop over the entire data set [5].
11
FUNDAMENTALS OF NEURAL NETWORKS
The sigmoid function is one of the most commonly used activation functions. Its mathematical
representation is
1
σ ( x )= −x
(9)Sigmoid functions, commonly used in logistic regression and basic
1+e
neural networks, serve as introductory activation units in machine learning. However,
they are less suitable for advanced neural networks due to drawbacks such as the
vanishing gradient problem. Despite being popular for beginners, the sigmoid function's
short-range derivative leads to information loss, especially in deeper neural networks,
where data compression and loss escalate at each layer. The positive output of the
sigmoid function contributes to both vanishing and exploding gradient issues, making it
less ideal for early layers. In contrast, it can be used in the last layer [7].
12
FUNDAMENTALS OF NEURAL NETWORKS
The graph of the tanhfunction is shown in Fig. 5 Compared to the sigmoid function, the tanh
function is zero-centered, as it transforms the input into the symmetric range of (−1, 1).
Therefore, the tanh function solves the nonzero-centered problem of the sigmoid function.
However, when the input is too large or small, the output of the tanh function is always smooth
with a small gradient. This is not conducive to the weight update. Thus, the tanh function does
not solve the vanishing gradient problem either [6].
The Rectified Linear Unit (ReLU) was first applied to the restricted Boltzmann machine,
which is a universally used activation function currently. When the input is negative, the
output of the ReLU function is 0; otherwise, the output is equal to the input. Its formal
definition is
13
FUNDAMENTALS OF NEURAL NETWORKS
(13)
where α is a tunable coefficient that can control the saturation position of the ELU in the
negative domain. The mean value of the ELU output shifts towards zero, which speeds up model
convergence. When x > 0, the ELU function outputs y = x, which avoids the vanishing gradient
problem. When x ≤ 0, ELU is left soft saturated, as shown in Fig.7 which prevents neurons from
dying. The disadvantage of ELU is that it involves exponential computation. Thus the
computation complexity is relatively high [6].
14
FUNDAMENTALS OF NEURAL NETWORKS
Several different loss functions in neural networks exist, such as mean squared error,
binary cross-entropy loss and hinge loss. Mean squared error is often used for
regression tasks, while binary cross-entropy loss is commonly used for classification
tasks [7].
Mean squared error loss is a common loss function used for regression tasks. It is the
mean of the squared differences between predicted and true output. The MSE loss is
calculated by taking the average squared difference between the predicted and true
values for all the samples in the dataset. The MSE loss is used to evaluate the
performance of a model on a regression task and is often used as an optimization
objective when training a model.
(14)
15
FUNDAMENTALS OF NEURAL NETWORKS
Where n is the number of samples and ypredand ytrueare the predicted and true output
values, respectively [7].
Binary cross-entropy loss is a loss function used for binary classification tasks. It is
defined as the negative log probability of the true class. It is calculated by taking the
negative log of the predicted probability of the true class.
BCE = (15)
Where yiis the true label (0 or 1) and p is the predicted probability of the true class
(a value between 0 and 1) [7].
Hinge loss is a function that trains linear classifiers, such as support vector machines
(SVMs). It is the maximum difference between the predicted and true margins and a
constant value. The predicted margin is the distance between the decision boundary
and the closest training data point. The true margin is the distance between the decision
boundary and the true label.
16
FUNDAMENTALS OF NEURAL NETWORKS
(16)
Where ypred the predicted margin (a value between -1 and 1) and ytrue is the true label
(-1 or 1), respectively [7].
17
FUNDAMENTALS OF NEURAL NETWORKS
Neural networks are trained to minimize the difference between predicted and true values
through forward and backward propagation. Forward propagation computes hidden layer outputs
iteratively from input vectors, weights, and activation functions, extracting features. Backward
propagation calculates loss based on forward results, and then adjusts weights and biases using
gradient descent and the chain rule to minimize the loss function [6].
(17)
18
FUNDAMENTALS OF NEURAL NETWORKS
The corresponding bias vector of the connections between the hidden layer and the output layer
isb(2), and the weight matrix is
(18)
In this neural network, the sigmoid function is used as the activation function,
1
σ ( x )= −x
(19)
1+e
In the forward propagation process of computing the hidden layer given the input, the transpose
of the weight matrix w(1)is multiplied by the input vector x, and then the bias vector b(1)is added:
(20)
With the sigmoid activation function, the output of the hidden layer is
1
h ( x )= −v (21)
1+e
The forward propagation process of computing the output layer given the hidden layer is similar
[1].
(22)
As the weights W are randomly initialized, the value of the loss function is large.
19
FUNDAMENTALS OF NEURAL NETWORKS
To measure the impact of W on the loss function, we take the weight between the second node of
the hidden layer and the first node of the output layer w(2)
2 ,1(denoted as ω) as an example, and we
compute the partial derivative of the loss function L(W )With respect to ω using the chain rule.
First, the partial derivative of the loss (W) with respect to ^y 1is computed, then the partial
derivative of ^y 1with respect to z1 is computed, as well as the partial derivative of z 1 on ω, and
finally the three derivatives are multiplied together as [1].
(23)
Combined with the example in the previous subsection, the partial derivative of the loss function
with respect to ω is calculated. The overall loss function is computed as
(24)
20
FUNDAMENTALS OF NEURAL NETWORKS
4.1. Overfitting
Overfitting occurs when a model closely fits the training data but fails to generalize well
to new data, resulting in low training error but high test error. This is common in neural
networks with excessive layers and parameters. Underfitting, conversely, happens when
the model is too simple to capture the underlying structure of the data, leading to high
training error due to insufficient training features. Underfitting can be addressed by
increasing training samples or using a more sophisticated model. Examples illustrate
appropriate fitting (small error arc), underfitting (simple straight line with large training
error), and overfitting (strange curve). Deep learning models with numerous parameters
may lead to overfitting, where the model closely fits the training set but lacks
generalization ability, resulting in poor performance on new data.
4.2. Regularization
4.2.1. Parameter norm penalty
21
FUNDAMENTALS OF NEURAL NETWORKS
Parameter norm penalty, also known as weight decay or regularization, involves adding
a penalty term to the loss function during training to discourage large parameter values.
This penalty helps prevent overfitting by encouraging the model to learn simpler
patterns and reduces the model's sensitivity to noise in the training data. Commonly
used parameter norm penalties include L1 regularization (which adds the absolute
values of the weights to the loss) and L2 regularization (which adds the squared values
of the weights to the loss) [1].
4.2.2. Sparsification
Sparsification involves making many weights or neurons in a neural network become
zero during training, with some techniques achieving up to 90% sparsity. This reduces
computation during forward propagation, as calculations on zero weights or neurons
can be skipped. Sparsification is typically achieved by adding penalty terms to the
training process [1].
4.2.3. Bagging
Bagging (bootstrap aggregating) involves training multiple models to jointly vote on the output
for test samples, aiming to improve performance. It constructs k different datasets by sampling
from the original dataset, maintaining consistent set sizes. Bagging allows using the same or
different models, training algorithms, and objective functions. For example, if one pretrained
neural network model struggles with cat recognition; two other models with varying parameters
and network topologies can be built. These models can also employ different machine learning
methods, such as neural networks, support vector machines, or decision trees. The final
recognition result can be obtained by averaging outputs or training an additional classifier to
choose among outputs, ultimately reducing recognition error [1].
22
FUNDAMENTALS OF NEURAL NETWORKS
4.2.4. Dropout
Dropout randomly removes nodes in hidden layers during training to prevent overfitting, unlike
L1 and L2 regularization which penalize high-order terms. This technique creates different sub
networks by removing hidden nodes. A mask vector μ is set, where each element corresponds to
an input or a hidden node. Random sampling of μ is performed, and each node is multiplied by
the corresponding mask during forward propagation. Typically, input nodes have a sampling
probability of 0.8, while hidden nodes have 0.5, dropping half of them. This improves training
performance, and during testing, all nodes are used with weights adjusted by their sampling
probabilities.
5. CONCLUSION
23
FUNDAMENTALS OF NEURAL NETWORKS
This report has given a brief introduction to the fundamentals of neural networks. In the
limited space it is not possible to discuss all possible topics in which neural networks
have been applied to control system problems.
Section 1 On this basis, first introduces the confusing concepts of the deep learning
after that neural network structure and components, also discuss about types of neural
network, concept of weight and biases and then introduces the basic deep learning
method—linear regression and its training process .At the end of this section, the
principle of the simplest neural network—perceptron—is illustrated; then it is expanded
to two-layer and multilayer deep neural networks.
Section 2 introduces activation functions and their types of activation functions. At the
end of this section, discuss the concepts of loss functions and their types of loss
functions.
Section 3 introduces the calculation process of forward propagation and back-
propagation in neural network training. After understanding these technologies, readers
can do some neural network experiments.
Section 4 introduces the concept of overfitting and their types of overfitting and finally,
how can be solved overfitting by help of regularization.
This report only outlines the basic content related to neural networks. Interested readers
can access papers in related fields to understand more specific related knowledge,
including regularization, loss functions, etc.
24