You are on page 1of 57

Unit-I INTRODUCTION Lecture Notes

Unit –I
INTRODUCTION
Introduction to machine learning- Linear models (SVMs and Perceptron, logistic regression)-
Intro to Neural Nets: What a shallow network computes- Training a network: loss functions,
back propagation and stochastic gradient descent- Neural networks as universal function
approximates.

Introduction to machine learning

Machine learning is a growing technology which enables computers to learn automatically


from past data. Machine learning uses various algorithms for building mathematical models
and making predictions using historical data or information. Currently, it is being used
for various tasks such as image recognition, speech recognition, email filtering, Facebook
auto-tagging, recommender system, and many more.

This machine learning tutorial gives you an introduction to machine learning along with the
wide range of machine learning techniques such as Supervised, Unsupervised,
and Reinforcement learning. You will learn about regression and classification models,
clustering methods, hidden Markov models, and various sequential models.

Working of Machine Learning

A Machine Learning system learns from historical data, builds the prediction models,
and whenever it receives new data, predicts the output for it. The accuracy of predicted
output depends upon the amount of data, as the huge amount of data helps to build a better
model which predicts the output more accurately.

Suppose we have a complex problem, where we need to perform some predictions, so instead
of writing a code for it, we just need to feed the data to generic algorithms, and with the help
of these algorithms, machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below block diagram
explains the working of Machine Learning algorithm:

Features of Machine Learning:

o Machine learning uses data to detect various patterns in a given dataset.

1
Unit-I INTRODUCTION Lecture Notes

o It can learn from past data and improve automatically.


o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount
of the data.

Need for Machine Learning

The need for machine learning is increasing day by day. The reason behind the need for
machine learning is that it is capable of doing tasks that are too complex for a person to
implement directly. As a human, we have some limitations as we cannot access the huge
amount of data manually, so for this, we need some computer systems and here comes the
machine learning to make things easy for us.

We can train machine learning algorithms by providing them the huge amount of data and let
them explore the data, construct the models, and predict the required output automatically.
The performance of the machine learning algorithm depends on the amount of data, and it can
be determined by the cost function. With the help of machine learning, we can save both time
and money.

The importance of machine learning can be easily understood by its uses cases, Currently,
machine learning is used in self-driving cars, cyber fraud detection, face recognition,
and friend suggestion by Facebook, etc. Various top companies such as Netflix and
Amazon have built machine learning models that are using a vast amount of data to analyze
the user interest and recommend product accordingly.

Following are some key points which show the importance of Machine Learning:

o Rapid increment in the production of data


o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from data.

Classification of Machine Learning

At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

2
Unit-I INTRODUCTION Lecture Notes

1) Supervised Learning

Supervised learning is a type of machine learning method in which we provide sample


labeled data to the machine learning system in order to train it, and on that basis, it predicts
the output.

The system creates a model using labeled data to understand the datasets and learn about each
data, once the training and processing are done then we test the model by providing a sample
data to check whether it is predicting the exact output or not.

The goal of supervised learning is to map input data with the output data. The supervised
learning is based on supervision, and it is the same as when a student learns things in the
supervision of the teacher. The example of supervised learning is spam filtering.

Supervised learning can be grouped further in two categories of algorithms:

o Classification
o Regression

2) Unsupervised Learning

Unsupervised learning is a learning method in which a machine learns without any


supervision.

The training is provided to the machine with the set of data that has not been labelled,
classified, or categorized, and the algorithm needs to act on that data without any supervision.
The goal of unsupervised learning is to restructure the input data into new features or a group
of objects with similar patterns.

In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from the huge amount of data. It can be further classifieds into two categories
of algorithms:

o Clustering
o Association

3) Reinforcement Learning

Reinforcement learning is a feedback-based learning method, in which a learning agent gets a


reward for each right action and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its performance. In reinforcement learning,
the agent interacts with the environment and explores it. The goal of an agent is to get the
most reward points, and hence, it improves its performance.

The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.

3
Unit-I INTRODUCTION Lecture Notes

Linear models:

Support Vector Machine

Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is called
a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model
that can accurately identify whether it is a cat or dog, so such a model can be created by
using the SVM algorithm. We will first train our model with lots of images of cats and dogs
so that it can learn about different features of cats and dogs, and then we test it with this
strange creature. So as support vector creates a decision boundary between these two data
(cat and dog) and choose extreme cases (support vectors), it will see the extreme case of cat
and dog. On the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:

4
Unit-I INTRODUCTION Lecture Notes

SVM algorithm can be used for Face detection, image classification, text


categorization, etc.

Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-


dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if
there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.

5
Unit-I INTRODUCTION Lecture Notes

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2.
We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance between the
vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.

6
Unit-I INTRODUCTION Lecture Notes

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can
be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

7
Unit-I INTRODUCTION Lecture Notes

So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:

8
Unit-I INTRODUCTION Lecture Notes

Perceptron
It is the primary step to learn Machine Learning and Deep Learning technologies, which
consists of a set of weights, input values or scores, and a threshold. Perceptron is a building
block of an Artificial Neural Network.

Perceptron is Machine Learning algorithm for supervised learning of various binary


classification tasks. Further, Perceptron is also understood as an Artificial Neuron or
neural network unit that helps to detect certain input data computations in business
intelligence.

Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers. Hence, we can
consider it as a single-layer neural network with four main parameters, i.e., input values,
weights and Bias, net sum, and an activation function.

Binary classifier in Machine Learning

In Machine Learning, binary classifiers are defined as the function that helps in deciding
whether input data can be represented as vectors of numbers and belongs to some specific
class.

Binary classifiers can be considered as linear classifiers. In simple words, we can understand
it as a classification algorithm that can predict linear predictor function in terms of
weight and feature vectors.

Basic Components of Perceptron

Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains
three main components. These are as follows:

o Input Nodes or Input Layer:

This is the primary component of Perceptron which accepts the initial data into the system for
further processing. Each input node contains a real numerical value.

o Wight and Bias:

9
Unit-I INTRODUCTION Lecture Notes

Weight parameter represents the strength of the connection between units. This is another
most important parameter of Perceptron components. Weight is directly proportional to the
strength of the associated input neuron in deciding the output. Further, Bias can be considered
as the line of intercept in a linear equation.

o Activation Function:

These are the final and important components that help to determine whether the neuron will
fire or not. Activation Function can be considered primarily as a step function.

Types of Activation functions:

o Sign function
o Step function, and
o Sigmoid function

The data scientist uses the activation function to take a subjective decision based on various
problem statements and forms the desired outputs. Activation function may differ (e.g., Sign,
Step, and Sigmoid) in perceptron models by checking whether the learning process is slow or
has vanishing or exploding gradients.

How does Perceptron work?

In Machine Learning, Perceptron is considered as a single-layer neural network that consists


of four main parameters named input values (Input nodes), weights and Bias, net sum, and an
activation function. The perceptron model begins with the multiplication of all input values
and their weights, then adds these values together to create the weighted sum. Then this
weighted sum is applied to the activation function 'f' to obtain the desired output. This
activation function is also known as the step function and is represented by 'f'.

10
Unit-I INTRODUCTION Lecture Notes

This step function or Activation function plays a vital role in ensuring that output is mapped
between required values (0,1) or (-1,1). It is important to note that the weight of input is
indicative of the strength of a node. Similarly, an input's bias value gives the ability to shift
the activation function curve up or down.

Perceptron model works in two important steps as follows:

Step-1

In the first step first, multiply all input values with corresponding weight values and then add
them to determine the weighted sum. Mathematically, we can calculate the weighted sum as
follows:

∑wi*xi = x1*w1 + x2*w2 +…wn*xn

Add a special term called bias 'b' to this weighted sum to improve the model's performance.

∑wi*xi + b

Step-2

In the second step, an activation function is applied with the above-mentioned weighted sum,
which gives us output either in binary form or a continuous value as follows:

Y = f(∑wi*xi + b)

Types of Perceptron Models

Based on the layers, Perceptron models are divided into two types. These are as follows:

1. Single-layer Perceptron Model


2. Multi-layer Perceptron model

Single Layer Perceptron Model:

This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron
model consists feed-forward network and also includes a threshold transfer function inside

11
Unit-I INTRODUCTION Lecture Notes

the model. The main objective of the single-layer perceptron model is to analyze the linearly
separable objects with binary outcomes.

In a single layer perceptron model, its algorithms do not contain recorded data, so it begins
with inconstantly allocated input for weight parameters. Further, it sums up all inputs
(weight). After adding all inputs, if the total sum of all inputs is more than a pre-determined
value, the model gets activated and shows the output value as +1.

If the outcome is same as pre-determined or threshold value, then the performance of this
model is stated as satisfied, and weight demand does not change. However, this model
consists of a few discrepancies triggered when multiple weight inputs values are fed into the
model. Hence, to find desired output and minimize errors, some changes should be necessary
for the weights input.

"Single-layer perceptron can learn only linearly separable patterns."

Multi-Layered Perceptron Model:

Like a single-layer perceptron model, a multi-layer perceptron model also has the same
model structure but has a greater number of hidden layers.

The multi-layer perceptron model is also known as the Backpropagation algorithm, which
executes in two stages as follows:

o Forward Stage: Activation functions start from the input layer in the forward stage
and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per
the model's requirement. In this stage, the error between actual output and demanded
originated backward on the output layer and ended on the input layer.

Hence, a multi-layered perceptron model has considered as multiple artificial neural networks
having various layers in which activation function does not remain linear, similar to a single
layer perceptron model. Instead of linear, activation function can be executed as sigmoid,
TanH, ReLU, etc., for deployment.

A multi-layer perceptron model has greater processing power and can process linear and non-
linear patterns. Further, it can also implement logic gates such as AND, OR, XOR, NAND,
NOT, XNOR, NOR.

Advantages of Multi-Layer Perceptron:

o A multi-layered perceptron model can be used to solve complex non-linear problems.


o It works well with both small and large input data.
o It helps us to obtain quick predictions after the training.
o It helps to obtain the same accuracy ratio with large as well as small data.

12
Unit-I INTRODUCTION Lecture Notes

Disadvantages of Multi-Layer Perceptron:

o In Multi-layer perceptron, computations are difficult and time-consuming.


o In multi-layer Perceptron, it is difficult to predict how much the dependent variable
affects each independent variable.
o The model functioning depends on the quality of the training.

Perceptron Function

Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the
learned weight coefficient 'w'.

Mathematically, we can express it as follows:

f(x)=1; if w.x+b>0

otherwise, f(x)=0

o 'w' represents real-valued weights vector


o 'b' represents the bias
o 'x' represents a vector of input x values.

Characteristics of Perceptron

The perceptron model has the following characteristics.

1. Perceptron is a machine learning algorithm for supervised learning of binary


classifiers.
2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and the decision is made whether
the neuron is fired or not.
4. The activation function applies a step rule to check whether the weight function is
greater than zero.
5. The linear decision boundary is drawn, enabling the distinction between the two
linearly separable classes +1 and -1.
6. If the added sum of all input values is more than the threshold value, it must have an
output signal; otherwise, no output will be shown.

Limitations of Perceptron Model

A perceptron model has limitations as follows:

13
Unit-I INTRODUCTION Lecture Notes

o The output of a perceptron can only be a binary number (0 or 1) due to the hard limit
transfer function.
o Perceptron can only be used to classify the linearly separable sets of input vectors. If
input vectors are non-linear, it is not easy to classify them properly.

Perceptron Example

Imagine a perceptron (in your brain).

The perceptron tries to decide if you should go to a concert.

Is the artist good? Is the weather good?

What weights should these facts have?

Criteria Input Weight

Artists is Good x1 = 0 or 1 w1 = 0.7

Weather is Good x2 = 0 or 1 w2 = 0.6

Friend will Come x3 = 0 or 1 w3 = 0.5

Food is Served x4 = 0 or 1 w4 = 0.3

Alcohol is Served x5 = 0 or 1 w5 = 0.4

The Perceptron Algorithm

Frank Rosenblatt suggested this algorithm:

1. Set a threshold value


2. Multiply all inputs with its weights
3. Sum all the results
4. Activate the output

14
Unit-I INTRODUCTION Lecture Notes

1. Set a threshold value:

 Threshold = 1.5

2. Multiply all inputs with its weights:

 x1 * w1 = 1 * 0.7 = 0.7
 x2 * w2 = 0 * 0.6 = 0
 x3 * w3 = 1 * 0.5 = 0.5
 x4 * w4 = 0 * 0.3 = 0
 x5 * w5 = 1 * 0.4 = 0.4

3. Sum all the results:

 0.7 + 0 + 0.5 + 0 + 0.4 = 1.6 (The Weighted Sum)

4. Activate the Output:

 Return true if the sum > 1.5 ("Yes I will go to the Concert")

Logistic Regression

o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore,
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.

15
Unit-I INTRODUCTION Lecture Notes

o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.


o The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

16
Unit-I INTRODUCTION Lecture Notes

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".

Intro to Neural Nets: What a shallow network computes

o Neural networks are used to mimic the basic functioning of the human brain and are
inspired by how the human brain interprets information. It is used to solve various
real-time tasks because of its ability to perform computations quickly and its fast
responses.

17
Unit-I INTRODUCTION Lecture Notes

An Artificial Neural Network model contains various components that are inspired by the

biological nervous system.

Artificial Neural Network has a huge number of interconnected processing elements, also

known as Nodes. These nodes are connected with other nodes using a connection link. The

connection link contains weights, these weights contain the information about the input

signal. Each iteration and input in turn leads to updation of these weights. After inputting all

the data instances from the training data set, the final weights of the Neural Network along

with its architecture is known as the Trained Neural Network. This process is called Training

of Neural Networks. This trained neural network is used to solve specific problems as defined

in the problem statement.

Types of tasks that can be solved using an artificial neural network include Classification

problems, Pattern Matching, Data Clustering, etc.

Some real-life applications of neural networks include Air Traffic Control, Optical Character

Recognition as used by some scanning apps like Google Lens, Voice Recognition, etc.

Types of Neural Networks

(i) ANN– It is also known as an artificial neural network. It is a feed-forward neural network

because the inputs are sent in the forward direction. It can also contain hidden layers which

can make the model even denser. They have a fixed length as specified by the programmer. It

is used for Textual Data or Tabular Data. A widely used real-life application is Facial

Recognition. It is comparatively less powerful than CNN and RNN.

(ii) CNN– It is also known as Convolutional Neural Networks. It is mainly used for Image

Data. It is used for Computer Vision. Some of the real-life applications are object detection in

autonomous vehicles. It contains a combination of convolutional layers and neurons. It is

more powerful than both ANN and RNN.

18
Unit-I INTRODUCTION Lecture Notes

(iii) RNN–

It is also known as Recurrent Neural Networks. It is used to process and interpret time series

data. In this type of model, the output from a processing node is fed back into nodes in the

same or previous layers. The most known types of RNN are LSTM (Long Short Term

Memory) Networks

Now that we know the basics about Neural Networks, We know that Neural Networks’

learning capability is what makes it interesting. There are 3 types of learnings in Neural

networks, namely

1. Supervised Learning

2. Unsupervised Learning

3. Reinforcement Learning

Supervised Learning: As the name suggests, it is a type of learning that is looked after by a

supervisor. It is like learning with a teacher. There are input training pairs that contain a set of

input and the desired output. Here the output from the model is compared with the desired

output and an error is calculated, this error signal is sent back into the network for adjusting

the weights. This adjustment is done till no more adjustments can be made and the output of

the model matches the desired output. In this, there is feedback from the environment to the

model.

19
Unit-I INTRODUCTION Lecture Notes

Image Source:https://bigdata-madesimple.com/machine-learning-explained-understanding-

supervised-unsupervised-and-reinforcement-learning/

Unsupervised Learning: Unlike supervised learning, there is no supervisor or a teacher here.

In this type of learning, there is no feedback from the environment, there is no desired output

and the model learns on its own. During the training phase, the inputs are formed into classes

that define the similarity of the members. Each class contains similar input patterns. On

inputting a new pattern, it can predict to which class that input belongs based on similarity

with other patterns. If there is no such class, a new class is formed.

Reinforcement Learning: It gets the best of both worlds, that is, the best of both Supervised

learning and Unsupervised learning. It is like learning with a critique. Here there is no exact

20
Unit-I INTRODUCTION Lecture Notes

feedback from the environment, rather there is critique feedback. The critique tells how close

our solution is. Hence the model learns on its own based on the critique information. It is

similar to supervised learning in that it receives feedback from the environment, but it is

different in that it does not receive the desired output information, rather it receives critique

information.

Working of Neural Network

Artificial neurons or perceptron consist of:


 Input
 Weight
 Bias
 Activation Function
 Output

21
Unit-I INTRODUCTION Lecture Notes

The neurons receive many inputs and process a single output.

Neural networks are comprised of layers of neurons.

These layers consist of the following:

 Input layer
 Multiple hidden layers
 Output layer

The input layer receives data represented by a numeric value. Hidden layers perform the most
computations required by the network. Finally, the output layer predicts the output.

In a neural network, neurons dominate one another. Each layer is made of neurons. Once the
input layer receives data, it is redirected to the hidden layer. Each input is assigned
with weights.

The weight is a value in a neural network that converts input data within the network’s
hidden layers. Weights work by input layer, taking input data, and multiplying it by the
weight value.

It then initiates a value for the first hidden layer. The hidden layers transform the input data
and pass it to the other layer. The output layer produces the desired output.

22
Unit-I INTRODUCTION Lecture Notes

The inputs and weights are multiplied, and their sum is sent to neurons in the hidden
layer. Bias is applied to each neuron. Each neuron adds the inputs it receives to get the sum.
This value then transits through the activation function.

The activation function outcome then decides if a neuron is activated or not. An activated
neuron transfers information into the other layers. With this approach, the data gets generated
in the network until the neuron reaches the output layer.

Another name for this is forward propagation. Feed-forward propagation is the process of
inputting data into an input node and getting the output through the output node. (We’ll
discuss feed-forward propagation a bit more in the section below).

Feed-forward propagation takes place when the hidden layer accepts the input data. Processes
it as per the activation function and passes it to the output. The neuron in the output layer
with the highest probability then projects the result.

If the output is wrong, backpropagation takes place. While designing a neural network,
weights are initialized to each input. Backpropagation means re-adjusting each input’s
weights to minimize the errors, thus resulting in a more accurate output.

What a shallow network computes


Shallow neural networks give us basic idea about deep neural network which consist
of only 1 or 2 hidden layers. Understanding a shallow neural network gives us an
understanding into what exactly is going on inside a deep neural network A neural
network is built using various hidden layers. Now that we know the computations that
occur in a particular layer, let us understand how the whole neural network computes
the output for a given input X. These can also be called the forward-
propagation equations.

1. The first equation calculates the intermediate output Z[1]of the first hidden layer.
2. The second equation calculates the final output A[1]of the first hidden layer.
3. The third equation calculates the intermediate output Z[2]of the output layer.

23
Unit-I INTRODUCTION Lecture Notes

4. The fourth equation calculates the final output A[2]of the output layer which is also the final
output of the whole neural network.

Neural Networks Overview

In logistic regression, to calculate the output (y = a), we used the below computation graph:

In case of a neural network with a single hidden layer, the structure will look like:

And the computation graph to calculate the output will be:

X1  \
X2   => z1 = XW1 + B1 => a1 = Sigmoid(z1) => z2 = a1W2 + B2 => a2 = Sigmoid(z2) =>
l(a2,Y)
X3  /

 Neural Network Representation

Consider the following representation of a neural network:

24
Unit-I INTRODUCTION Lecture Notes

Can you identify the number of layers in the above neural network? Remember that while

counting the number of layers in a NN, we do not count the input layer. So, there are 2 layers

in the NN shown above, i.e., one hidden layer and one output layer.

The first layer is referred as a[0], second layer as a[1], and the final layer as a[2]. Here ‘a’ stands

for activations, which are the values that different layers of a neural network passes on to the

next layer. The corresponding parameters are w[1], b[1] and w[1], b[2]:

This is how a neural network is represented. Next we will look at how to compute the output

from a neural network.

 Computing a Neural Network’s Output

Let’s look in detail at how each neuron of a neural network works. Each neuron takes an

input, performs some operation on them (calculates z = w[T] + b), and then applies the

sigmoid function:

25
Unit-I INTRODUCTION Lecture Notes

This step is performed by each neuron. The equations for the first hidden layer with four

neurons will be:

So, for given input X, the outputs for each neuron will be:

z[1] = W[1]x + b[1]

a[1] = 𝛔(z[1])

z[2] = W[2]x + b[2]

a[2] = 𝛔(z[2])

To compute these outputs, we need to run a for loop which will calculate these values

individually for each neuron. But recall that using a for loop will make the computations very

slow, and hence we should optimize the code to get rid of this for loop and run it faster.

 Vectorizing across multiple examples

The non-vectorized form of computing the output from a neural network is:

for i=1 to m:

26
Unit-I INTRODUCTION Lecture Notes

z[1](i) = W[1](i)x + b[1]

a[1](i) = 𝛔(z[1](i))

z[2](i) = W[2](i)x + b[2]

a[2](i) = 𝛔(z[2](i))

Using this for loop, we are calculating z and a value for each training example separately.

Now we will look at how it can be vectorized. All the training examples will be merged in a

single matrix X:

Here, nx is the number of features and m is the number of training examples. The vectorized

form for calculating the output will be:

Z[1] = W[1]X + b[1]

A[1] = 𝛔(Z[1])

Z[2] = W[2]X + b[2]

A[2] = 𝛔(Z[2])

This will reduce the computation time (significantly in most cases).

 Activation Function

While calculating the output, an activation function is applied. The choice of an activation

function highly affects the performance of the model. So far, we have used the sigmoid

activation function:

27
Unit-I INTRODUCTION Lecture Notes

However, this might not the best option in some cases. Why? Because at the extreme ends of

the graph, the derivative will be close to zero and hence the gradient descent will update the

parameters very slowly.

There are other functions which can replace this activation function:

 tanh:

 ReLU (already covered earlier):

Activation Pros Cons


Function
Used in the output layer for binary
Sigmoid Output ranges from 0 to 1
classification

Updates parameters slowly when points


tanh Better than sigmoid
are at extreme ends

Updates parameters faster as slope


ReLU Zero slope when x<0
is 1 when x>0

 We can choose different activation functions depending on the problem we’re trying to

solve.

28
Unit-I INTRODUCTION Lecture Notes

 Why do we need non-linear activation functions?

If we use linear activation functions on the output of the layers, it will compute the output as

a linear function of input features. We first calculate the Z value as:

Z = WX + b

In case of linear activation functions, the output will be equal to Z (instead of calculating any

non-linear activation):

A=Z

Using linear activation is essentially pointless. The composition of two linear functions is

itself a linear function, and unless we use some non-linear activations, we are not computing

more interesting functions. That’s why most experts stick to using non-linear activation

functions.

There is only one scenario where we tend to use a linear activation function. Suppose we

want to predict the price of a house (which can be any positive real number). If we use a

sigmoid or tanh function, the output will range from (0,1) and (-1,1) respectively. But the

price will be more than 1 as well. In this case, we will use a linear activation function at the

output layer.

Once we have the outputs, what’s the next step? We want to perform backpropagation in

order to update the parameters using gradient descent.

 Gradient Descent for Neural Networks

The parameters which we have to update in a two-layer neural network are: w[1], b[1], w[2]

and b[2], and the cost function which we will be minimizing is:

29
Unit-I INTRODUCTION Lecture Notes

The gradient descent steps can be summarized as:

Repeat:
Compute predictions (y'(i), i = 1,...m)
    Get derivatives: dW[1], db[1], dW[2], db[2]
    Update: W[1] = W[1] - ⍺ * dW[1]
            b[1] = b[1] - ⍺ * db[1]
            W[2] = W[2] - ⍺ * dW[2]
            b[2] = b[2] - ⍺ * db[2]

Let’s quickly look at the forward and backpropagation steps for a two-layer neural networks.

Forward propagation:

Z[1] = W[1]*A[0] + b[1]    # A[0] is X


A[1] = g[1](Z[1])
Z[2] = W[2]*A[1] + b[2]
A[2] = g[2](Z[2])

Backpropagation:

dZ[2] = A[2] - Y   
dW[2] = (dZ[2] * A[1].T) / m
db[2] = Sum(dZ[2]) / m
dZ[1] = (W[2].T * dZ[2]) * g'[1](Z[1])  # element wise product (*)
dW[1] = (dZ[1] * A[0].T) / m   # A[0] = X
db[1] = Sum(dZ[1]) / m

These are the complete steps a neural network performs to generate outputs. Note that we

have to initialize the weights (W) in the beginning which are then updated in the

backpropagation step. So let’s look at how these weights should be initialized.

 Random Initialization

We have previously seen that the weights are initialized to 0 in case of a logistic regression

algorithm. But should we initialize the weights of a neural network to 0? It’s a pertinent

question. Let’s consider the example shown below:

30
Unit-I INTRODUCTION Lecture Notes

If the weights are initialized to 0, the W matrix will be:

Using these weights: 

And finally at the backpropagation step:

No matter how many units we use in a layer, we are always getting the same output which is

similar to that of using a single unit. So, instead of initializing the weights to 0, we randomly

initialize them using the following code:

w[1] = np.random.randn((2,2)) * 0.01


b[1] = np.zero((2,1))

We multiply the weights with 0.01 to initialize small weights. If we initialize large weights,

the activation will be large, resulting in zero slope (in case of sigmoid and tanh activation

function). Hence, learning will be slow. So we generally initialize small weights randomly.

 Training a network: loss functions, back propagation and stochastic gradient descent:

Loss Functions

The loss function in a neural network quantifies the difference between the expected outcome

and the outcome produced by the machine learning model.

Loss functions are mainly classified into two different categories Classification loss and

Regression Loss. Classification loss is the case where the aim is to predict the output from the

different categorical values for example, if we have a dataset of handwritten images and the

31
Unit-I INTRODUCTION Lecture Notes

digit is to be predicted that lies between (0–9), in these kinds of scenarios classification loss

is used.

Whereas if the problem is regression like predicting the continuous values for example, if

need to predict the weather conditions or predicting the prices of houses on the basis of some

features. In this type of case, Regression Loss is used.

Most widely used loss functions in Neural networks:

 Mean Absolute Error (L1 Loss)

 Mean Squared Error (L2 Loss)

 Huber Loss

 Cross-Entropy(a.k.a Log loss)

 Relative Entropy(a.k.a Kullback–Leibler divergence)

 Squared Hinge

Mean Absolute Error (MAE)

Mean absolute error (MAE) also called L1 Loss is a loss function used for regression
problems. It represents the difference between the original and predicted values extracted by
averaging the absolute difference over the data set.

MAE is not sensitive towards outliers and is given several examples with the same input
feature values, and the optimal prediction will be their median target value. This should be
compared with Mean Squared Error, where the optimal prediction is the mean. A
disadvantage of MAE is that the gradient magnitude is not dependent on the error size, only
on the sign of y — ŷ which leads to that the gradient magnitude will be large even when the
error is small, which in turn can lead to convergence problems.

32
Unit-I INTRODUCTION Lecture Notes

Use Mean absolute error when you are doing regression and don’t want outliers to play a
big role. It can also be useful if you know that your distribution is multimodal, and it’s
desirable to have predictions at one of the modes, rather than at the mean of them.

Example: When doing image reconstruction, MAE encourages less blurry images compared
to MSE. This is used for example in the paper Image-to-Image Translation with Conditional
Adversarial Networks by Isola et al.

Mean Squared Error (MSE)

Mean Squared Error (MSE) also called L2 Loss is also a loss function used for regression. It
represents the difference between the original and predicted values extracted by squared the
average difference over the data set.

MSE is sensitive towards outliers and given several examples with the same input feature
values, the optimal prediction will be their mean target value. This should be compared with
Mean Absolute Error, where the optimal prediction is the median. MSE is thus good to use if
you believe that your target data, conditioned on the input, is normally distributed around a
mean value, and when it’s important to penalize outliers extra much.

Use MSE when doing regression, believing that your target, conditioned on the input, is
normally distributed, and want large errors to be significantly (quadratically) more penalized
than small ones.

Example: You want to predict future house prices. The price is a continuous value, and
therefore we want to do regression. MSE can here be used as the loss function.

Huber Loss

Huber Loss is typically used in regression problems. It’s less sensitive to outliers than the
MSE as it treats error as square only inside an interval.

Consider an example where we have a dataset of 100 values we would like our model to be
trained to predict. Out of all that data, 25% of the expected values are 5 while the other 75%
are 10.

An MSE loss wouldn’t quite do the trick, since we don’t really have “outliers”; 25% is by no
means a small fraction. On the other hand, we don’t necessarily want to weigh that 25% too
low with an MAE. Those values of 5 aren’t close to the median (10 — since 75% of the points
have a value of 10), but they’re also not really outliers.

This is where the Huber Loss Function comes into play.

33
Unit-I INTRODUCTION Lecture Notes

The Huber Loss offers the best of both worlds by balancing the MSE and MAE together. We
can define it using the following piecewise function:

Here, (𝛿) delta → hyperparameter defines the range for MAE and MSE.

In simple terms, the above radically says is: for loss values less than (𝛿) delta, use the
MSE; for loss values greater than delta, use the MAE. This way Huber loss provides the
best of both MAE and MSE.
As we already know Huber loss has both MAE and MSE. So when we think higher
weightage should not be given to outliers, then set your loss function as Huber loss. We need
to manually define is the (𝛿) delta value. Generally, some iterations are needed with the
respective algorithm used to find the correct delta value.

Cross-Entropy Loss(a.k.a Log loss)

The concept of cross-entropy traces back into the field of Information Theory where Claude
Shannon introduced the concept of entropy in 1948. Before diving into the Cross-Entropy
loss function, let us talk about Entropy.

Entropy has roots in physics — it is a measure of disorder, or unpredictability, in a


system.

For instance, consider below figure two gases in a box: initially, the system has low entropy,
in that the two gasses are completely separable(skewed distribution); after some time,
however, the gases blend(distribution where events have equal probability) so the system’s
entropy increases. It is said that in an isolated system, the entropy never decreases — the
chaos never dims down without external influence.

34
Unit-I INTRODUCTION Lecture Notes

Entropy

For p(x) — probability distribution and a random variable X, entropy is defined as follows:

Reason for the Negative sign: log(p(x))<0 for all p(x) in (0,1) . p(x) is a probability
distribution and therefore the values must range between 0 and 1.

A plot of log(x). For x values between 0 and 1, log(x) <0 (is negative).

Cross-Entropy loss is also called logarithmic loss, log loss, or logistic loss. Each predicted
class probability is compared to the actual class desired output 0 or 1 and a score/loss is
calculated that penalizes the probability based on how far it is from the actual expected value.
The penalty is logarithmic in nature yielding a large score for large differences close to 1 and
small score for small differences tending to 0.

Cross-Entropy is expressed by the equation;

Where x represents the predicted results by ML algorithm, p(x) is the probability distribution
of “true” label from training samples and q(x) depicts the estimation of the ML algorithm.

Cross-entropy loss measures the performance of a classification model whose output is a


probability value between 0 and 1. Cross-entropy loss increases as the predicted probability
diverge from the actual label. So predicting a probability of .012 when the actual observation
label is 1 would be bad and result in a high loss value. A perfect model would have a log loss
of 0.

35
Unit-I INTRODUCTION Lecture Notes

The graph above shows the range of possible loss values given a true observation. As the
predicted probability approaches 1, log loss slowly decreases. As the predicted probability
decreases, however, the log loss increases rapidly. Log loss penalizes both types of errors, but
especially those predictions that are confident and wrong!
The cross-entropy method is a Monte Carlo technique for significance optimization and
sampling.
Binary Cross-Entropy
Binary cross-entropy is a loss function that is used in binary classification tasks. These are
tasks that answer a question with only two choices (yes or no, A or B, 0 or 1, left or right).
In binary classification, where the number of classes M equals 2, cross-entropy can be
calculated as:

Sigmoid is the only activation function compatible with the binary cross-entropy loss
function. You must use it on the last block before the target block.

The binary cross-entropy needs to compute the logarithms of Ŷi and (1-Ŷi), which only exist
if Ŷi is between 0 and 1. The softmax activation function is the only one to guarantee that the
output is within this range.

Categorical Cross-Entropy

Categorical cross-entropy is a loss function that is used in multi-class classification tasks.


These are tasks where an example can only belong to one out of many possible categories,
and the model must decide which one.
Formally, it is designed to quantify the difference between two probability distributions.
If 𝑀>2 (i.e. multiclass classification), we calculate a separate loss for each class label per
observation and sum the result.

36
Unit-I INTRODUCTION Lecture Notes

 M — number of classes (dog, cat, fish)


 log — the natural log
 y — binary indicator (0 or 1) if class label c is the correct classification for observation
o
 p — predicted probability observation o is of class 𝑐

Softmax is the only activation function recommended to use with the categorical cross-
entropy loss function.
Strictly speaking, the output of the model only needs to be positive so that the logarithm of
every output value Ŷi exists. However, the main appeal of this loss function is for comparing
two probability distributions. The softmax activation rescales the model output so that it has
the right properties.

Sparse Categorical Cross-Entropy

sparse categorical cross-entropy has the same loss function as, categorical cross-entropy
which we have mentioned above. The only difference is the format in which we mention
𝑌𝑖(i,e true labels).

If your Yi’s are one-hot encoded, use categorical_crossentropy. Examples for a 3-class
classification: [1,0,0] , [0,1,0], [0,0,1]

But if your Yi’s are integers, use sparse_categorical_crossentropy. Examples for above 3-
class classification problem: [1] , [2], [3]

The usage entirely depends on how you load your dataset. One advantage of using sparse
categorical cross-entropy is it saves time in memory as well as computation because it simply
uses a single integer for a class, rather than a whole vector.
Relative Entropy(Kullback–Leibler divergence)

The Relative entropy (also called Kullback–Leibler divergence), is a method for measuring
the similarity between two probability distributions. It was refined by Solomon Kullback
and Richard Leibler for public release in 1951(paper), KL-Divergence aims to identify the
divergence(separation or bifurcation) of a probability distribution given a baseline
distribution. That is, for a target distribution, P, we compare a competing distribution, Q, by
computing the expected value of the log-odds of the two distributions:

For distributions P and Q of a continuous random variable, the Kullback-Leibler divergence


is computed as an integral:

37
Unit-I INTRODUCTION Lecture Notes

If P and Q represent the probability distribution of a discrete random variable, the Kullback-
Leibler divergence is calculated as a summation:

Also, with a little bit of work, we can show that the KL-Divergence is non-negative. It
means, that the smallest possible value is zero (distributions are equal) and the maximum
value is infinity. We procure infinity when P is defined in a region where Q can never exist.
Therefore, it is a common assumption that both distributions exist on the same support.

The closer two distributions get to each other, the lower the loss becomes. In the following
graph, the blue distribution is trying to model the green distribution. As the blue distribution
comes closer and closer to the green one, the KL divergence loss will get closer to zero.

Lower the KL divergence value, the better we have matched the true distribution with our
approximation.

Comparison of Blue and green distribution

The applications of KL-Divergence:


1. Primarily, it is used in Variational Autoencoders. These autoencoders learn to
encode samples into a latent probability distribution and from this latent distribution,
a sample can be drawn that can be fed to a decoder which outputs e.g. an image.
2. KL divergence can also be used in multiclass classification scenarios. These
problems, which traditionally use the Softmax function and use one-hot encoded
target data, are naturally suitable to KL divergence since Softmax “normalizes data
into a probability distribution consisting of K probabilities proportional to the
exponentials of the input numbers”
3. Delineating the relative (Shannon) entropy in information systems,
4. Randomness in continuous time-series.
Squared Hinge
The squared hinge loss is a loss function used for “maximum margin” binary classification
problems. Mathematically it is defined as:

38
Unit-I INTRODUCTION Lecture Notes

where ŷ is the predicted value and y is either 1 or -1.

Thus, the squared hinge loss → 0, when the true and predicted labels are the same and when
ŷ≥ 1 (which is an indication that the classifier is sure that it’s the correct label).
The squared hinge loss → quadratically increasing with the error, when when the true
and predicted labels are not the same or when ŷ< 1, even when the true and predicted labels
are the same (which is an indication that the classifier is not sure that it’s the correct label).
As compared to traditional hinge loss(used in SVM) larger errors are punished more
significantly, whereas smaller errors are punished slightly lighter.

Comparison between Hinge and Squared hinge loss

Use the Squared Hinge loss function on problems involving yes/no (binary) decisions.
Especially, when you’re not interested in knowing how certain the classifier is about the
classification. Namely, when you don’t care about the classification probabilities. Use in
combination with the tanh() the activation function in the last layer of the neural network.

A typical application can be classifying email into ‘spam’ and ‘not spam’ and you’re only
interested in the classification accuracy.
Back Propagation

Backpropagation is the essence of neural network training. It is the method of fine-tuning


the weights of a neural network based on the error rate obtained in the previous epoch (i.e.,
iteration). Proper tuning of the weights allows you to reduce error rates and make the model
reliable by increasing its generalization.

39
Unit-I INTRODUCTION Lecture Notes

Backpropagation in neural network is a short form for “backward propagation of errors.” It is


a standard method of training artificial neural networks. This method helps calculate the
gradient of a loss function with respect to all the weights in the network.

Backpropagation Algorithm Works


The Back propagation algorithm in neural network computes the gradient of the loss function
for a single weight by the chain rule. It efficiently computes one layer at a time, unlike a
native direct computation. It computes the gradient, but it does not define how the gradient is
used. It generalizes the computation in the delta rule.

Consider the following Back propagation neural network example diagram to understand:

1. Inputs X, arrive through the preconnected path


2. Input is modeled using real weights W. The weights are usually randomly selected.
3. Calculate the output for every neuron from the input layer, to the hidden layers, to the
output layer.
4. Calculate the error in the outputs

ErrorB= Actual Output – Desired Output

5. Travel back from the output layer to the hidden layer to adjust the weights such that
the error is decreased.

Keep repeating the process until the desired output is achieved

Most prominent advantages of Backpropagation are:

 Backpropagation is fast, simple and easy to program


 It has no parameters to tune apart from the numbers of input
 It is a flexible method as it does not require prior knowledge about the network
 It is a standard method that generally works well
 It does not need any special mention of the features of the function to be learned.

Feed Forward Network

40
Unit-I INTRODUCTION Lecture Notes

A feedforward neural network is an artificial neural network where the nodes never form a
cycle. This kind of neural network has an input layer, hidden layers, and an output layer. It is
the first and simplest type of artificial neural network.

Types of Backpropagation Networks


Two Types of Backpropagation Networks are:

 Static Back-propagation
 Recurrent Backpropagation

Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input for
static output. It is useful to solve static classification issues like optical character recognition.

Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved.
After that, the error is computed and propagated backward.

The main difference between both of these methods is: that the mapping is rapid in static
back-propagation while it is nonstatic in recurrent backpropagation.

Example:

Input values

X1=0.05
X2=0.10

41
Unit-I INTRODUCTION Lecture Notes

Initial weight

W1=0.15     w5=0.40
W2=0.20     w6=0.45
W3=0.25     w7=0.50
W4=0.30     w8=0.55

Bias Values

b1=0.35     b2=0.60

Target Values

T1=0.01
T2=0.99

Now, we first calculate the values of H1 and H2 by a forward pass.

Forward Pass

To find the value of H1 we first multiply the input value from the weights as

                              H1=x1×w1+x2×w2+b1
                        H1=0.05×0.15+0.10×0.20+0.35
                                    H1=0.3775

To calculate the final result of H1, we performed the sigmoid function as

We will calculate the value of H2 in the same way as H1

                              H2=x1×w3+x2×w4+b1
                        H2=0.05×0.25+0.10×0.30+0.35
                                    H2=0.3925

To calculate the final result of H1, we performed the sigmoid function as

42
Unit-I INTRODUCTION Lecture Notes

Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and H2.

To find the value of y1, we first multiply the input value i.e., the outcome of H1 and H2 from
the weights as

                              y1=H1×w5+H2×w6+b2
                        y1=0.593269992×0.40+0.596884378×0.45+0.60
                                    y1=1.10590597

To calculate the final result of y1 we performed the sigmoid function as

We will calculate the value of y2 in the same way as y1

                              y2=H1×w7+H2×w8+b2
                        y2=0.593269992×0.50+0.596884378×0.55+0.60
                                    y2=1.2249214

To calculate the final result of H1, we performed the sigmoid function as

Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our target
values T1 and T2.

43
Unit-I INTRODUCTION Lecture Notes

Now, we will find the total error, which is simply the difference between the outputs from
the target outputs. The total error is calculated as

So, the total error is

Now, we will backpropagate this error to update the weights using a backward pass.

Backward pass at the output layer

To update the weight, we calculate the error correspond to each weight with the help of a
total error. The error on weight w is calculated by differentiating total error with respect to w.

We perform backward process so first consider the last weight w5 as

From equation two, it is clear that we cannot partially differentiate it with respect to w5
because there is no any w5. We split equation one into multiple terms so that we can easily
differentiate it with respect to w5 as

Now, we calculate each term one by one to differentiate Etotal with respect to w5 as

44
Unit-I INTRODUCTION Lecture Notes

Putting the value of e-y in equation (5)

So, we put the values of   in equation no (3) to find the final result.

45
Unit-I INTRODUCTION Lecture Notes

Now, we will calculate the updated weight w5new with the help of the following formula

In the same way, we calculate w6new,w7new, and w8new and this will give us the following
values

                        w5new=0.35891648
                        w6new=408666186
                        w7new=0.511301270
                        w8new=0.561370121

Backward pass at Hidden layer

Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3, and w4
as we have done with w5, w6, w7, and w8 weights.

We will calculate the error at w1 as

From equation (2), it is clear that we cannot partially differentiate it with respect to w1
because there is no any w1. We split equation (1) into multiple terms so that we can easily
differentiate it with respect to w1 as

Now, we calculate each term one by one to differentiate Etotal with respect to w1 as

46
Unit-I INTRODUCTION Lecture Notes

We again split this because there is no any H1final term in Etoatal as

 will again split because in E1 and E2 there is no H1 term. Splitting is


done as

We again Split both  because there is no any y1 and y2 term in E1 and E2. We
split it as

Now, we find the value of   by putting values in equation (18) and (19) as

From equation (18)

From equation (8)

47
Unit-I INTRODUCTION Lecture Notes

From equation (19)

Putting the value of e-y2 in equation (23)

From equation (21)

48
Unit-I INTRODUCTION Lecture Notes

Now from equation (16) and (17)

Put the value of   in equation (15) as

We have we need to figure out as

49
Unit-I INTRODUCTION Lecture Notes

Putting the value of e-H1 in equation (30)

We calculate the partial derivative of the total net input to H1 with respect to w1 the same as
we did for the output neuron:

So, we put the values of   in equation (13) to find the final result.

50
Unit-I INTRODUCTION Lecture Notes

Now, we will calculate the updated weight w1new with the help of the following formula

In the same way, we calculate w2new,w3new, and w4 and this will give us the following values

                        w1new=0.149780716
                        w2new=0.19956143
                        w3new=0.24975114
                        w4new=0.29950229

We have updated all the weights. We found the error 0.298371109 on the network when we
fed forward the 0.05 and 0.1 inputs. In the first round of Backpropagation, the total error is
down to 0.291027924. After repeating this process 10,000, the total error is down to
0.0000351085. At this point, the outputs neurons generate 0.159121960 and 0.984065734
i.e., nearby our target value when we feed forward the 0.05 and 0.1.

Disadvantages of using Backpropagation

 The actual performance of backpropagation on a specific problem is dependent on the


input data.
 Back propagation algorithm in data mining can be quite sensitive to noisy data
 You need to use the matrix-based approach for backpropagation instead of mini-
batch.

Stochastic Gradient Descent (SGD)

Gradient Descent is an iterative optimization process that searches for an objective


function’s optimum value (Minimum/Maximum). It is one of the most used methods for
changing a model’s parameters in order to reduce a cost function in machine learning
projects.  
The primary goal of gradient descent is to identify the model parameters that provide
the maximum accuracy on both training and test datasets. In gradient descent, the
gradient is a vector pointing in the general direction of the function’s steepest rise at a
particular point. The algorithm might gradually drop towards lower values of the function
by moving in the opposite direction of the gradient, until reaching the minimum of the
function.
Types of Gradient Descent: 
Typically, there are three types of Gradient Descent:  
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-batch Gradient Descent

51
Unit-I INTRODUCTION Lecture Notes

Stochastic Gradient Descent (SGD):


Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent  algorithm that is
used for optimizing machine learning models. It addresses the computational inefficiency
of traditional Gradient Descent methods when dealing with large datasets in machine
learning projects.
In SGD, instead of using the entire dataset for each iteration, only a single random training
example (or a small batch) is selected to calculate the gradient and update the model
parameters. This random selection introduces randomness into the optimization process,
hence the term “stochastic” in stochastic Gradient Descent
The advantage of using SGD is its computational efficiency, especially when dealing with
large datasets. By using a single example or a small batch, the computational cost per
iteration is significantly reduced compared to traditional Gradient Descent methods that
require processing the entire dataset.
Stochastic Gradient Descent Algorithm 
 Initialization: Randomly initialize the parameters of the model.
 Set Parameters: Determine the number of iterations and the learning rate (alpha) for
updating the parameters.
 Stochastic Gradient Descent Loop: Repeat the following steps until the model
converges or reaches the maximum number of iterations: 
                  a. Shuffle the training dataset to introduce randomness. 
                  b. Iterate over each training example (or a small batch) in the shuffled order. 
                  c. Compute the gradient of the cost function with respect to the model
parameters using the current training example (or batch).                   
                 d. Update the model parameters by taking a step in the direction of the negative
gradient, scaled by the learning rate. 
                  e. Evaluate the convergence criteria, such as the difference in the cost function
between iterations of the gradient.
 Return Optimized Parameters: Once the convergence criteria are met or the
maximum number of iterations is reached, return the optimized model parameters.

In SGD, since only one sample from the dataset is chosen at random for each iteration, the
path taken by the algorithm to reach the minima is usually noisier than your typical
Gradient Descent algorithm. But that doesn’t matter all that much because the path taken by
the algorithm does not matter, as long as we reach the minimum and with a significantly
shorter training time.
The path taken by Batch Gradient Descent is shown below:

52
Unit-I INTRODUCTION Lecture Notes

A path taken by Stochastic Gradient Descent looks as follows –

One thing to be noted is that, as SGD is generally noisier than typical Gradient Descent, it
usually took a higher number of iterations to reach the minima, because of the randomness in
its descent. Even though it requires a higher number of iterations to reach the minima than
typical Gradient Descent, it is still computationally much less expensive than typical Gradient
Descent. Hence, in most scenarios, SGD is preferred over Batch Gradient Descent for
optimizing a learning algorithm.

Advantages of Stochastic Gradient Descent  


Speed: SGD is faster than other variants of Gradient Descent such as Batch Gradient
Descent and Mini-Batch Gradient Descent since it uses only one example to update the
parameters.
Memory Efficiency: Since SGD updates the parameters for each training example one at a
time, it is memory-efficient and can handle large datasets that cannot fit into memory.

53
Unit-I INTRODUCTION Lecture Notes

Avoidance of Local Minima: Due to the noisy updates in SGD, it has the ability to escape
from local minima and converges to a global minimum.
Disadvantages of Stochastic Gradient Descent 
Noisy updates: The updates in SGD are noisy and have a high variance, which can make
the optimization process less stable and lead to oscillations around the minimum.
Slow Convergence: SGD may require more iterations to converge to the minimum since it
updates the parameters for each training example one at a time.
Sensitivity to Learning Rate: The choice of learning rate can be critical in SGD since
using a high learning rate can cause the algorithm to overshoot the minimum, while a low
learning rate can make the algorithm converge slowly.
Less Accurate: Due to the noisy updates, SGD may not converge to the exact global
minimum and can result in a suboptimal solution. This can be mitigated by using
techniques such as learning rate scheduling and momentum-based updates

Neural networks as universal function approximates

The universal approximation theorem states that any continuous function f: [0, 1] n −→ [0, 1]
can be approximated arbitrarily well by a neural network with at least 1 hidden layer with a
finite number of weights, which is what we are going to illustrate in the next subsections.

Visual proof of Universal Approximation

Say we want to approximate a function with 1 input and 1 output like so:

We will first consider a simple NN with 2 hidden neurons that have a sigmoid activation
function, and for now the output neuron will just be linear.

Step 1 Make a step function with 1 of the neuron.

54
Unit-I INTRODUCTION Lecture Notes

Let’s focus on the top hidden neuron first, by using a big weight on the top neuron we can
approximate the step function with a sigmoid arbitrarily well, and by adjusting the bias we
can place it anywhere. In this toy example, we won’t be interested in changing the weights of
the first layer, they just have to be high enough, so we will just consider them to be constant.
Additionally, to make the plots clearer, we will display the position of the step instead of the
bias, which is easily computed with With these changes, the plot above becomes:

55
Unit-I INTRODUCTION Lecture Notes

56
Unit-I INTRODUCTION Lecture Notes

57

You might also like