Professional Documents
Culture Documents
Unit I
Unit I
Unit –I
INTRODUCTION
Introduction to machine learning- Linear models (SVMs and Perceptron, logistic regression)-
Intro to Neural Nets: What a shallow network computes- Training a network: loss functions,
back propagation and stochastic gradient descent- Neural networks as universal function
approximates.
This machine learning tutorial gives you an introduction to machine learning along with the
wide range of machine learning techniques such as Supervised, Unsupervised,
and Reinforcement learning. You will learn about regression and classification models,
clustering methods, hidden Markov models, and various sequential models.
A Machine Learning system learns from historical data, builds the prediction models,
and whenever it receives new data, predicts the output for it. The accuracy of predicted
output depends upon the amount of data, as the huge amount of data helps to build a better
model which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some predictions, so instead
of writing a code for it, we just need to feed the data to generic algorithms, and with the help
of these algorithms, machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below block diagram
explains the working of Machine Learning algorithm:
1
Unit-I INTRODUCTION Lecture Notes
The need for machine learning is increasing day by day. The reason behind the need for
machine learning is that it is capable of doing tasks that are too complex for a person to
implement directly. As a human, we have some limitations as we cannot access the huge
amount of data manually, so for this, we need some computer systems and here comes the
machine learning to make things easy for us.
We can train machine learning algorithms by providing them the huge amount of data and let
them explore the data, construct the models, and predict the required output automatically.
The performance of the machine learning algorithm depends on the amount of data, and it can
be determined by the cost function. With the help of machine learning, we can save both time
and money.
The importance of machine learning can be easily understood by its uses cases, Currently,
machine learning is used in self-driving cars, cyber fraud detection, face recognition,
and friend suggestion by Facebook, etc. Various top companies such as Netflix and
Amazon have built machine learning models that are using a vast amount of data to analyze
the user interest and recommend product accordingly.
Following are some key points which show the importance of Machine Learning:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
2
Unit-I INTRODUCTION Lecture Notes
1) Supervised Learning
The system creates a model using labeled data to understand the datasets and learn about each
data, once the training and processing are done then we test the model by providing a sample
data to check whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The supervised
learning is based on supervision, and it is the same as when a student learns things in the
supervision of the teacher. The example of supervised learning is spam filtering.
o Classification
o Regression
2) Unsupervised Learning
The training is provided to the machine with the set of data that has not been labelled,
classified, or categorized, and the algorithm needs to act on that data without any supervision.
The goal of unsupervised learning is to restructure the input data into new features or a group
of objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from the huge amount of data. It can be further classifieds into two categories
of algorithms:
o Clustering
o Association
3) Reinforcement Learning
The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.
3
Unit-I INTRODUCTION Lecture Notes
Linear models:
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is called
a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model
that can accurately identify whether it is a cat or dog, so such a model can be created by
using the SVM algorithm. We will first train our model with lots of images of cats and dogs
so that it can learn about different features of cats and dogs, and then we test it with this
strange creature. So as support vector creates a decision boundary between these two data
(cat and dog) and choose extreme cases (support vectors), it will see the extreme case of cat
and dog. On the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:
4
Unit-I INTRODUCTION Lecture Notes
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if
there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
5
Unit-I INTRODUCTION Lecture Notes
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2.
We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance between the
vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.
6
Unit-I INTRODUCTION Lecture Notes
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can
be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
7
Unit-I INTRODUCTION Lecture Notes
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
8
Unit-I INTRODUCTION Lecture Notes
Perceptron
It is the primary step to learn Machine Learning and Deep Learning technologies, which
consists of a set of weights, input values or scores, and a threshold. Perceptron is a building
block of an Artificial Neural Network.
Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers. Hence, we can
consider it as a single-layer neural network with four main parameters, i.e., input values,
weights and Bias, net sum, and an activation function.
In Machine Learning, binary classifiers are defined as the function that helps in deciding
whether input data can be represented as vectors of numbers and belongs to some specific
class.
Binary classifiers can be considered as linear classifiers. In simple words, we can understand
it as a classification algorithm that can predict linear predictor function in terms of
weight and feature vectors.
Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains
three main components. These are as follows:
This is the primary component of Perceptron which accepts the initial data into the system for
further processing. Each input node contains a real numerical value.
9
Unit-I INTRODUCTION Lecture Notes
Weight parameter represents the strength of the connection between units. This is another
most important parameter of Perceptron components. Weight is directly proportional to the
strength of the associated input neuron in deciding the output. Further, Bias can be considered
as the line of intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the neuron will
fire or not. Activation Function can be considered primarily as a step function.
o Sign function
o Step function, and
o Sigmoid function
The data scientist uses the activation function to take a subjective decision based on various
problem statements and forms the desired outputs. Activation function may differ (e.g., Sign,
Step, and Sigmoid) in perceptron models by checking whether the learning process is slow or
has vanishing or exploding gradients.
10
Unit-I INTRODUCTION Lecture Notes
This step function or Activation function plays a vital role in ensuring that output is mapped
between required values (0,1) or (-1,1). It is important to note that the weight of input is
indicative of the strength of a node. Similarly, an input's bias value gives the ability to shift
the activation function curve up or down.
Step-1
In the first step first, multiply all input values with corresponding weight values and then add
them to determine the weighted sum. Mathematically, we can calculate the weighted sum as
follows:
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted sum,
which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
Based on the layers, Perceptron models are divided into two types. These are as follows:
This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron
model consists feed-forward network and also includes a threshold transfer function inside
11
Unit-I INTRODUCTION Lecture Notes
the model. The main objective of the single-layer perceptron model is to analyze the linearly
separable objects with binary outcomes.
In a single layer perceptron model, its algorithms do not contain recorded data, so it begins
with inconstantly allocated input for weight parameters. Further, it sums up all inputs
(weight). After adding all inputs, if the total sum of all inputs is more than a pre-determined
value, the model gets activated and shows the output value as +1.
If the outcome is same as pre-determined or threshold value, then the performance of this
model is stated as satisfied, and weight demand does not change. However, this model
consists of a few discrepancies triggered when multiple weight inputs values are fed into the
model. Hence, to find desired output and minimize errors, some changes should be necessary
for the weights input.
Like a single-layer perceptron model, a multi-layer perceptron model also has the same
model structure but has a greater number of hidden layers.
The multi-layer perceptron model is also known as the Backpropagation algorithm, which
executes in two stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward stage
and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per
the model's requirement. In this stage, the error between actual output and demanded
originated backward on the output layer and ended on the input layer.
Hence, a multi-layered perceptron model has considered as multiple artificial neural networks
having various layers in which activation function does not remain linear, similar to a single
layer perceptron model. Instead of linear, activation function can be executed as sigmoid,
TanH, ReLU, etc., for deployment.
A multi-layer perceptron model has greater processing power and can process linear and non-
linear patterns. Further, it can also implement logic gates such as AND, OR, XOR, NAND,
NOT, XNOR, NOR.
12
Unit-I INTRODUCTION Lecture Notes
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the
learned weight coefficient 'w'.
f(x)=1; if w.x+b>0
otherwise, f(x)=0
Characteristics of Perceptron
13
Unit-I INTRODUCTION Lecture Notes
o The output of a perceptron can only be a binary number (0 or 1) due to the hard limit
transfer function.
o Perceptron can only be used to classify the linearly separable sets of input vectors. If
input vectors are non-linear, it is not easy to classify them properly.
Perceptron Example
14
Unit-I INTRODUCTION Lecture Notes
Threshold = 1.5
x1 * w1 = 1 * 0.7 = 0.7
x2 * w2 = 0 * 0.6 = 0
x3 * w3 = 1 * 0.5 = 0.5
x4 * w4 = 0 * 0.3 = 0
x5 * w5 = 1 * 0.4 = 0.4
Return true if the sum > 1.5 ("Yes I will go to the Concert")
Logistic Regression
o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore,
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
15
Unit-I INTRODUCTION Lecture Notes
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
16
Unit-I INTRODUCTION Lecture Notes
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".
o Neural networks are used to mimic the basic functioning of the human brain and are
inspired by how the human brain interprets information. It is used to solve various
real-time tasks because of its ability to perform computations quickly and its fast
responses.
17
Unit-I INTRODUCTION Lecture Notes
An Artificial Neural Network model contains various components that are inspired by the
Artificial Neural Network has a huge number of interconnected processing elements, also
known as Nodes. These nodes are connected with other nodes using a connection link. The
connection link contains weights, these weights contain the information about the input
signal. Each iteration and input in turn leads to updation of these weights. After inputting all
the data instances from the training data set, the final weights of the Neural Network along
with its architecture is known as the Trained Neural Network. This process is called Training
of Neural Networks. This trained neural network is used to solve specific problems as defined
Types of tasks that can be solved using an artificial neural network include Classification
Some real-life applications of neural networks include Air Traffic Control, Optical Character
Recognition as used by some scanning apps like Google Lens, Voice Recognition, etc.
(i) ANN– It is also known as an artificial neural network. It is a feed-forward neural network
because the inputs are sent in the forward direction. It can also contain hidden layers which
can make the model even denser. They have a fixed length as specified by the programmer. It
is used for Textual Data or Tabular Data. A widely used real-life application is Facial
(ii) CNN– It is also known as Convolutional Neural Networks. It is mainly used for Image
Data. It is used for Computer Vision. Some of the real-life applications are object detection in
18
Unit-I INTRODUCTION Lecture Notes
(iii) RNN–
It is also known as Recurrent Neural Networks. It is used to process and interpret time series
data. In this type of model, the output from a processing node is fed back into nodes in the
same or previous layers. The most known types of RNN are LSTM (Long Short Term
Memory) Networks
Now that we know the basics about Neural Networks, We know that Neural Networks’
learning capability is what makes it interesting. There are 3 types of learnings in Neural
networks, namely
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
Supervised Learning: As the name suggests, it is a type of learning that is looked after by a
supervisor. It is like learning with a teacher. There are input training pairs that contain a set of
input and the desired output. Here the output from the model is compared with the desired
output and an error is calculated, this error signal is sent back into the network for adjusting
the weights. This adjustment is done till no more adjustments can be made and the output of
the model matches the desired output. In this, there is feedback from the environment to the
model.
19
Unit-I INTRODUCTION Lecture Notes
Image Source:https://bigdata-madesimple.com/machine-learning-explained-understanding-
supervised-unsupervised-and-reinforcement-learning/
In this type of learning, there is no feedback from the environment, there is no desired output
and the model learns on its own. During the training phase, the inputs are formed into classes
that define the similarity of the members. Each class contains similar input patterns. On
inputting a new pattern, it can predict to which class that input belongs based on similarity
Reinforcement Learning: It gets the best of both worlds, that is, the best of both Supervised
learning and Unsupervised learning. It is like learning with a critique. Here there is no exact
20
Unit-I INTRODUCTION Lecture Notes
feedback from the environment, rather there is critique feedback. The critique tells how close
our solution is. Hence the model learns on its own based on the critique information. It is
similar to supervised learning in that it receives feedback from the environment, but it is
different in that it does not receive the desired output information, rather it receives critique
information.
21
Unit-I INTRODUCTION Lecture Notes
Input layer
Multiple hidden layers
Output layer
The input layer receives data represented by a numeric value. Hidden layers perform the most
computations required by the network. Finally, the output layer predicts the output.
In a neural network, neurons dominate one another. Each layer is made of neurons. Once the
input layer receives data, it is redirected to the hidden layer. Each input is assigned
with weights.
The weight is a value in a neural network that converts input data within the network’s
hidden layers. Weights work by input layer, taking input data, and multiplying it by the
weight value.
It then initiates a value for the first hidden layer. The hidden layers transform the input data
and pass it to the other layer. The output layer produces the desired output.
22
Unit-I INTRODUCTION Lecture Notes
The inputs and weights are multiplied, and their sum is sent to neurons in the hidden
layer. Bias is applied to each neuron. Each neuron adds the inputs it receives to get the sum.
This value then transits through the activation function.
The activation function outcome then decides if a neuron is activated or not. An activated
neuron transfers information into the other layers. With this approach, the data gets generated
in the network until the neuron reaches the output layer.
Another name for this is forward propagation. Feed-forward propagation is the process of
inputting data into an input node and getting the output through the output node. (We’ll
discuss feed-forward propagation a bit more in the section below).
Feed-forward propagation takes place when the hidden layer accepts the input data. Processes
it as per the activation function and passes it to the output. The neuron in the output layer
with the highest probability then projects the result.
If the output is wrong, backpropagation takes place. While designing a neural network,
weights are initialized to each input. Backpropagation means re-adjusting each input’s
weights to minimize the errors, thus resulting in a more accurate output.
1. The first equation calculates the intermediate output Z[1]of the first hidden layer.
2. The second equation calculates the final output A[1]of the first hidden layer.
3. The third equation calculates the intermediate output Z[2]of the output layer.
23
Unit-I INTRODUCTION Lecture Notes
4. The fourth equation calculates the final output A[2]of the output layer which is also the final
output of the whole neural network.
In logistic regression, to calculate the output (y = a), we used the below computation graph:
In case of a neural network with a single hidden layer, the structure will look like:
X1 \
X2 => z1 = XW1 + B1 => a1 = Sigmoid(z1) => z2 = a1W2 + B2 => a2 = Sigmoid(z2) =>
l(a2,Y)
X3 /
24
Unit-I INTRODUCTION Lecture Notes
Can you identify the number of layers in the above neural network? Remember that while
counting the number of layers in a NN, we do not count the input layer. So, there are 2 layers
in the NN shown above, i.e., one hidden layer and one output layer.
The first layer is referred as a[0], second layer as a[1], and the final layer as a[2]. Here ‘a’ stands
for activations, which are the values that different layers of a neural network passes on to the
next layer. The corresponding parameters are w[1], b[1] and w[1], b[2]:
This is how a neural network is represented. Next we will look at how to compute the output
Let’s look in detail at how each neuron of a neural network works. Each neuron takes an
input, performs some operation on them (calculates z = w[T] + b), and then applies the
sigmoid function:
25
Unit-I INTRODUCTION Lecture Notes
This step is performed by each neuron. The equations for the first hidden layer with four
So, for given input X, the outputs for each neuron will be:
a[1] = 𝛔(z[1])
a[2] = 𝛔(z[2])
To compute these outputs, we need to run a for loop which will calculate these values
individually for each neuron. But recall that using a for loop will make the computations very
slow, and hence we should optimize the code to get rid of this for loop and run it faster.
The non-vectorized form of computing the output from a neural network is:
for i=1 to m:
26
Unit-I INTRODUCTION Lecture Notes
a[1](i) = 𝛔(z[1](i))
a[2](i) = 𝛔(z[2](i))
Using this for loop, we are calculating z and a value for each training example separately.
Now we will look at how it can be vectorized. All the training examples will be merged in a
single matrix X:
Here, nx is the number of features and m is the number of training examples. The vectorized
A[1] = 𝛔(Z[1])
A[2] = 𝛔(Z[2])
Activation Function
While calculating the output, an activation function is applied. The choice of an activation
function highly affects the performance of the model. So far, we have used the sigmoid
activation function:
27
Unit-I INTRODUCTION Lecture Notes
However, this might not the best option in some cases. Why? Because at the extreme ends of
the graph, the derivative will be close to zero and hence the gradient descent will update the
There are other functions which can replace this activation function:
tanh:
We can choose different activation functions depending on the problem we’re trying to
solve.
28
Unit-I INTRODUCTION Lecture Notes
If we use linear activation functions on the output of the layers, it will compute the output as
Z = WX + b
In case of linear activation functions, the output will be equal to Z (instead of calculating any
non-linear activation):
A=Z
Using linear activation is essentially pointless. The composition of two linear functions is
itself a linear function, and unless we use some non-linear activations, we are not computing
more interesting functions. That’s why most experts stick to using non-linear activation
functions.
There is only one scenario where we tend to use a linear activation function. Suppose we
want to predict the price of a house (which can be any positive real number). If we use a
sigmoid or tanh function, the output will range from (0,1) and (-1,1) respectively. But the
price will be more than 1 as well. In this case, we will use a linear activation function at the
output layer.
Once we have the outputs, what’s the next step? We want to perform backpropagation in
The parameters which we have to update in a two-layer neural network are: w[1], b[1], w[2]
and b[2], and the cost function which we will be minimizing is:
29
Unit-I INTRODUCTION Lecture Notes
Repeat:
Compute predictions (y'(i), i = 1,...m)
Get derivatives: dW[1], db[1], dW[2], db[2]
Update: W[1] = W[1] - ⍺ * dW[1]
b[1] = b[1] - ⍺ * db[1]
W[2] = W[2] - ⍺ * dW[2]
b[2] = b[2] - ⍺ * db[2]
Let’s quickly look at the forward and backpropagation steps for a two-layer neural networks.
Forward propagation:
Backpropagation:
dZ[2] = A[2] - Y
dW[2] = (dZ[2] * A[1].T) / m
db[2] = Sum(dZ[2]) / m
dZ[1] = (W[2].T * dZ[2]) * g'[1](Z[1]) # element wise product (*)
dW[1] = (dZ[1] * A[0].T) / m # A[0] = X
db[1] = Sum(dZ[1]) / m
These are the complete steps a neural network performs to generate outputs. Note that we
have to initialize the weights (W) in the beginning which are then updated in the
Random Initialization
We have previously seen that the weights are initialized to 0 in case of a logistic regression
algorithm. But should we initialize the weights of a neural network to 0? It’s a pertinent
30
Unit-I INTRODUCTION Lecture Notes
No matter how many units we use in a layer, we are always getting the same output which is
similar to that of using a single unit. So, instead of initializing the weights to 0, we randomly
We multiply the weights with 0.01 to initialize small weights. If we initialize large weights,
the activation will be large, resulting in zero slope (in case of sigmoid and tanh activation
function). Hence, learning will be slow. So we generally initialize small weights randomly.
Training a network: loss functions, back propagation and stochastic gradient descent:
Loss Functions
The loss function in a neural network quantifies the difference between the expected outcome
Loss functions are mainly classified into two different categories Classification loss and
Regression Loss. Classification loss is the case where the aim is to predict the output from the
different categorical values for example, if we have a dataset of handwritten images and the
31
Unit-I INTRODUCTION Lecture Notes
digit is to be predicted that lies between (0–9), in these kinds of scenarios classification loss
is used.
Whereas if the problem is regression like predicting the continuous values for example, if
need to predict the weather conditions or predicting the prices of houses on the basis of some
Huber Loss
Squared Hinge
Mean absolute error (MAE) also called L1 Loss is a loss function used for regression
problems. It represents the difference between the original and predicted values extracted by
averaging the absolute difference over the data set.
MAE is not sensitive towards outliers and is given several examples with the same input
feature values, and the optimal prediction will be their median target value. This should be
compared with Mean Squared Error, where the optimal prediction is the mean. A
disadvantage of MAE is that the gradient magnitude is not dependent on the error size, only
on the sign of y — ŷ which leads to that the gradient magnitude will be large even when the
error is small, which in turn can lead to convergence problems.
32
Unit-I INTRODUCTION Lecture Notes
Use Mean absolute error when you are doing regression and don’t want outliers to play a
big role. It can also be useful if you know that your distribution is multimodal, and it’s
desirable to have predictions at one of the modes, rather than at the mean of them.
Example: When doing image reconstruction, MAE encourages less blurry images compared
to MSE. This is used for example in the paper Image-to-Image Translation with Conditional
Adversarial Networks by Isola et al.
Mean Squared Error (MSE) also called L2 Loss is also a loss function used for regression. It
represents the difference between the original and predicted values extracted by squared the
average difference over the data set.
MSE is sensitive towards outliers and given several examples with the same input feature
values, the optimal prediction will be their mean target value. This should be compared with
Mean Absolute Error, where the optimal prediction is the median. MSE is thus good to use if
you believe that your target data, conditioned on the input, is normally distributed around a
mean value, and when it’s important to penalize outliers extra much.
Use MSE when doing regression, believing that your target, conditioned on the input, is
normally distributed, and want large errors to be significantly (quadratically) more penalized
than small ones.
Example: You want to predict future house prices. The price is a continuous value, and
therefore we want to do regression. MSE can here be used as the loss function.
Huber Loss
Huber Loss is typically used in regression problems. It’s less sensitive to outliers than the
MSE as it treats error as square only inside an interval.
Consider an example where we have a dataset of 100 values we would like our model to be
trained to predict. Out of all that data, 25% of the expected values are 5 while the other 75%
are 10.
An MSE loss wouldn’t quite do the trick, since we don’t really have “outliers”; 25% is by no
means a small fraction. On the other hand, we don’t necessarily want to weigh that 25% too
low with an MAE. Those values of 5 aren’t close to the median (10 — since 75% of the points
have a value of 10), but they’re also not really outliers.
33
Unit-I INTRODUCTION Lecture Notes
The Huber Loss offers the best of both worlds by balancing the MSE and MAE together. We
can define it using the following piecewise function:
Here, (𝛿) delta → hyperparameter defines the range for MAE and MSE.
In simple terms, the above radically says is: for loss values less than (𝛿) delta, use the
MSE; for loss values greater than delta, use the MAE. This way Huber loss provides the
best of both MAE and MSE.
As we already know Huber loss has both MAE and MSE. So when we think higher
weightage should not be given to outliers, then set your loss function as Huber loss. We need
to manually define is the (𝛿) delta value. Generally, some iterations are needed with the
respective algorithm used to find the correct delta value.
The concept of cross-entropy traces back into the field of Information Theory where Claude
Shannon introduced the concept of entropy in 1948. Before diving into the Cross-Entropy
loss function, let us talk about Entropy.
For instance, consider below figure two gases in a box: initially, the system has low entropy,
in that the two gasses are completely separable(skewed distribution); after some time,
however, the gases blend(distribution where events have equal probability) so the system’s
entropy increases. It is said that in an isolated system, the entropy never decreases — the
chaos never dims down without external influence.
34
Unit-I INTRODUCTION Lecture Notes
Entropy
Reason for the Negative sign: log(p(x))<0 for all p(x) in (0,1) . p(x) is a probability
distribution and therefore the values must range between 0 and 1.
A plot of log(x). For x values between 0 and 1, log(x) <0 (is negative).
Cross-Entropy loss is also called logarithmic loss, log loss, or logistic loss. Each predicted
class probability is compared to the actual class desired output 0 or 1 and a score/loss is
calculated that penalizes the probability based on how far it is from the actual expected value.
The penalty is logarithmic in nature yielding a large score for large differences close to 1 and
small score for small differences tending to 0.
Where x represents the predicted results by ML algorithm, p(x) is the probability distribution
of “true” label from training samples and q(x) depicts the estimation of the ML algorithm.
35
Unit-I INTRODUCTION Lecture Notes
The graph above shows the range of possible loss values given a true observation. As the
predicted probability approaches 1, log loss slowly decreases. As the predicted probability
decreases, however, the log loss increases rapidly. Log loss penalizes both types of errors, but
especially those predictions that are confident and wrong!
The cross-entropy method is a Monte Carlo technique for significance optimization and
sampling.
Binary Cross-Entropy
Binary cross-entropy is a loss function that is used in binary classification tasks. These are
tasks that answer a question with only two choices (yes or no, A or B, 0 or 1, left or right).
In binary classification, where the number of classes M equals 2, cross-entropy can be
calculated as:
Sigmoid is the only activation function compatible with the binary cross-entropy loss
function. You must use it on the last block before the target block.
The binary cross-entropy needs to compute the logarithms of Ŷi and (1-Ŷi), which only exist
if Ŷi is between 0 and 1. The softmax activation function is the only one to guarantee that the
output is within this range.
Categorical Cross-Entropy
36
Unit-I INTRODUCTION Lecture Notes
Softmax is the only activation function recommended to use with the categorical cross-
entropy loss function.
Strictly speaking, the output of the model only needs to be positive so that the logarithm of
every output value Ŷi exists. However, the main appeal of this loss function is for comparing
two probability distributions. The softmax activation rescales the model output so that it has
the right properties.
sparse categorical cross-entropy has the same loss function as, categorical cross-entropy
which we have mentioned above. The only difference is the format in which we mention
𝑌𝑖(i,e true labels).
If your Yi’s are one-hot encoded, use categorical_crossentropy. Examples for a 3-class
classification: [1,0,0] , [0,1,0], [0,0,1]
But if your Yi’s are integers, use sparse_categorical_crossentropy. Examples for above 3-
class classification problem: [1] , [2], [3]
The usage entirely depends on how you load your dataset. One advantage of using sparse
categorical cross-entropy is it saves time in memory as well as computation because it simply
uses a single integer for a class, rather than a whole vector.
Relative Entropy(Kullback–Leibler divergence)
The Relative entropy (also called Kullback–Leibler divergence), is a method for measuring
the similarity between two probability distributions. It was refined by Solomon Kullback
and Richard Leibler for public release in 1951(paper), KL-Divergence aims to identify the
divergence(separation or bifurcation) of a probability distribution given a baseline
distribution. That is, for a target distribution, P, we compare a competing distribution, Q, by
computing the expected value of the log-odds of the two distributions:
37
Unit-I INTRODUCTION Lecture Notes
If P and Q represent the probability distribution of a discrete random variable, the Kullback-
Leibler divergence is calculated as a summation:
Also, with a little bit of work, we can show that the KL-Divergence is non-negative. It
means, that the smallest possible value is zero (distributions are equal) and the maximum
value is infinity. We procure infinity when P is defined in a region where Q can never exist.
Therefore, it is a common assumption that both distributions exist on the same support.
The closer two distributions get to each other, the lower the loss becomes. In the following
graph, the blue distribution is trying to model the green distribution. As the blue distribution
comes closer and closer to the green one, the KL divergence loss will get closer to zero.
Lower the KL divergence value, the better we have matched the true distribution with our
approximation.
38
Unit-I INTRODUCTION Lecture Notes
Thus, the squared hinge loss → 0, when the true and predicted labels are the same and when
ŷ≥ 1 (which is an indication that the classifier is sure that it’s the correct label).
The squared hinge loss → quadratically increasing with the error, when when the true
and predicted labels are not the same or when ŷ< 1, even when the true and predicted labels
are the same (which is an indication that the classifier is not sure that it’s the correct label).
As compared to traditional hinge loss(used in SVM) larger errors are punished more
significantly, whereas smaller errors are punished slightly lighter.
Use the Squared Hinge loss function on problems involving yes/no (binary) decisions.
Especially, when you’re not interested in knowing how certain the classifier is about the
classification. Namely, when you don’t care about the classification probabilities. Use in
combination with the tanh() the activation function in the last layer of the neural network.
A typical application can be classifying email into ‘spam’ and ‘not spam’ and you’re only
interested in the classification accuracy.
Back Propagation
39
Unit-I INTRODUCTION Lecture Notes
Consider the following Back propagation neural network example diagram to understand:
5. Travel back from the output layer to the hidden layer to adjust the weights such that
the error is decreased.
40
Unit-I INTRODUCTION Lecture Notes
A feedforward neural network is an artificial neural network where the nodes never form a
cycle. This kind of neural network has an input layer, hidden layers, and an output layer. It is
the first and simplest type of artificial neural network.
Static Back-propagation
Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input for
static output. It is useful to solve static classification issues like optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved.
After that, the error is computed and propagated backward.
The main difference between both of these methods is: that the mapping is rapid in static
back-propagation while it is nonstatic in recurrent backpropagation.
Example:
Input values
X1=0.05
X2=0.10
41
Unit-I INTRODUCTION Lecture Notes
Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
Bias Values
b1=0.35 b2=0.60
Target Values
T1=0.01
T2=0.99
Forward Pass
To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925
42
Unit-I INTRODUCTION Lecture Notes
Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and H2.
To find the value of y1, we first multiply the input value i.e., the outcome of H1 and H2 from
the weights as
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214
Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our target
values T1 and T2.
43
Unit-I INTRODUCTION Lecture Notes
Now, we will find the total error, which is simply the difference between the outputs from
the target outputs. The total error is calculated as
Now, we will backpropagate this error to update the weights using a backward pass.
To update the weight, we calculate the error correspond to each weight with the help of a
total error. The error on weight w is calculated by differentiating total error with respect to w.
From equation two, it is clear that we cannot partially differentiate it with respect to w5
because there is no any w5. We split equation one into multiple terms so that we can easily
differentiate it with respect to w5 as
44
Unit-I INTRODUCTION Lecture Notes
So, we put the values of in equation no (3) to find the final result.
45
Unit-I INTRODUCTION Lecture Notes
Now, we will calculate the updated weight w5new with the help of the following formula
In the same way, we calculate w6new,w7new, and w8new and this will give us the following
values
w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121
Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3, and w4
as we have done with w5, w6, w7, and w8 weights.
From equation (2), it is clear that we cannot partially differentiate it with respect to w1
because there is no any w1. We split equation (1) into multiple terms so that we can easily
differentiate it with respect to w1 as
46
Unit-I INTRODUCTION Lecture Notes
We again Split both because there is no any y1 and y2 term in E1 and E2. We
split it as
Now, we find the value of by putting values in equation (18) and (19) as
47
Unit-I INTRODUCTION Lecture Notes
48
Unit-I INTRODUCTION Lecture Notes
49
Unit-I INTRODUCTION Lecture Notes
We calculate the partial derivative of the total net input to H1 with respect to w1 the same as
we did for the output neuron:
So, we put the values of in equation (13) to find the final result.
50
Unit-I INTRODUCTION Lecture Notes
Now, we will calculate the updated weight w1new with the help of the following formula
In the same way, we calculate w2new,w3new, and w4 and this will give us the following values
w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229
We have updated all the weights. We found the error 0.298371109 on the network when we
fed forward the 0.05 and 0.1 inputs. In the first round of Backpropagation, the total error is
down to 0.291027924. After repeating this process 10,000, the total error is down to
0.0000351085. At this point, the outputs neurons generate 0.159121960 and 0.984065734
i.e., nearby our target value when we feed forward the 0.05 and 0.1.
51
Unit-I INTRODUCTION Lecture Notes
In SGD, since only one sample from the dataset is chosen at random for each iteration, the
path taken by the algorithm to reach the minima is usually noisier than your typical
Gradient Descent algorithm. But that doesn’t matter all that much because the path taken by
the algorithm does not matter, as long as we reach the minimum and with a significantly
shorter training time.
The path taken by Batch Gradient Descent is shown below:
52
Unit-I INTRODUCTION Lecture Notes
One thing to be noted is that, as SGD is generally noisier than typical Gradient Descent, it
usually took a higher number of iterations to reach the minima, because of the randomness in
its descent. Even though it requires a higher number of iterations to reach the minima than
typical Gradient Descent, it is still computationally much less expensive than typical Gradient
Descent. Hence, in most scenarios, SGD is preferred over Batch Gradient Descent for
optimizing a learning algorithm.
53
Unit-I INTRODUCTION Lecture Notes
Avoidance of Local Minima: Due to the noisy updates in SGD, it has the ability to escape
from local minima and converges to a global minimum.
Disadvantages of Stochastic Gradient Descent
Noisy updates: The updates in SGD are noisy and have a high variance, which can make
the optimization process less stable and lead to oscillations around the minimum.
Slow Convergence: SGD may require more iterations to converge to the minimum since it
updates the parameters for each training example one at a time.
Sensitivity to Learning Rate: The choice of learning rate can be critical in SGD since
using a high learning rate can cause the algorithm to overshoot the minimum, while a low
learning rate can make the algorithm converge slowly.
Less Accurate: Due to the noisy updates, SGD may not converge to the exact global
minimum and can result in a suboptimal solution. This can be mitigated by using
techniques such as learning rate scheduling and momentum-based updates
The universal approximation theorem states that any continuous function f: [0, 1] n −→ [0, 1]
can be approximated arbitrarily well by a neural network with at least 1 hidden layer with a
finite number of weights, which is what we are going to illustrate in the next subsections.
Say we want to approximate a function with 1 input and 1 output like so:
We will first consider a simple NN with 2 hidden neurons that have a sigmoid activation
function, and for now the output neuron will just be linear.
54
Unit-I INTRODUCTION Lecture Notes
Let’s focus on the top hidden neuron first, by using a big weight on the top neuron we can
approximate the step function with a sigmoid arbitrarily well, and by adjusting the bias we
can place it anywhere. In this toy example, we won’t be interested in changing the weights of
the first layer, they just have to be high enough, so we will just consider them to be constant.
Additionally, to make the plots clearer, we will display the position of the step instead of the
bias, which is easily computed with With these changes, the plot above becomes:
55
Unit-I INTRODUCTION Lecture Notes
56
Unit-I INTRODUCTION Lecture Notes
57