You are on page 1of 111

Unit No.

2
Unit 2: Supervised learning:
Regression

Prof . Sachin Sambhaji Patil


D. Y. Patil University Ambi, Pune

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 1


Supervised Machine Learning

• Supervised learning is the types of machine learning in


which machines are trained using well "labelled" training
data, and on basis of that data, machines predict the
output.
• The labelled data means some input data is already tagged
with the correct output.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 2
Supervised Machine Learning

• In supervised learning, the training data provided to the


machines work as the supervisor that teaches the machines
to predict the output correctly.

• It applies the same concept as a student learns in the


supervision of the teacher.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 3


Supervised Machine Learning

• Supervised learning is a process of providing input data as


well as correct output data to the machine learning model.
The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the
output variable(y).

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 4


Supervised Machine Learning

• In the real-world, supervised learning can be used for Risk


Assessment, Image classification, Fraud Detection, spam
filtering, etc.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 5


How Supervised Learning Works?

• In supervised learning, models are trained using labelled


dataset, where the model learns about each type of data.
Once the training process is completed, the model is tested
on the basis of test data (a subset of the training set), and
then it predicts the output.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 6


How Supervised Learning Works?

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 7


Supervised Machine Learning
• Suppose we have a dataset of different types of shapes which includes
square, rectangle, triangle, and Polygon. Now the first step is that we
need to train the model for each shape.

• If the given shape has four sides, and all the sides are equal, then it will
be labelled as a Square.

• If the given shape has three sides, then it will be labelled as a triangle.

• If the given shape has six equal sides then it will be labelled as hexagon.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 8
Supervised Machine Learning
• Now, after training, we test our model using the test set, and the
task of the model is to identify the shape.

• The machine is already trained on all types of shapes, and when it


finds a new shape, it classifies the shape on the bases of a number
of sides, and predicts the output.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 9


Supervised Machine Learning
• Steps Involved in Supervised Learning:
1. First Determine the type of training dataset
2. Collect/Gather the labelled training data.
3. Split the training dataset into training dataset, test dataset, and
validation dataset.
4. Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.
5. Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
6. Execute the algorithm on the training dataset. Sometimes we need
validation sets as the control parameters, which are the subset of
training datasets.
7. Evaluate the accuracy of the model by providing the test set. If the
model predicts the correct output,
Prof.Sachin Sambhaji which
Patil , D.Y.Patil means
University Ambi , Pune our model is accurate.
10
Types of supervised Machine learning Algorithms:

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 11


Types of Supervised Machine learning Algorithms:
• Regression
• Regression algorithms are used if there is a relationship
between the input variable and the output variable.
• It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc.
• Below are some popular Regression algorithms which come
under supervised learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 12
Types of Supervised Machine learning Algorithms:
• Classification
• Classification algorithms are used when the output
variable is categorical, which means there are two classes
such as Yes-No, Male-Female, True-false, etc.
• Spam Filtering,
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 13
Types of Supervised Machine learning Algorithms:
• Supervised learning has two types:
• Classification: It predicts the class of the dataset based on the
independent input variable. Class is the categorical or discrete values.
like the image of an animal is a cat or dog?

• Regression: It predicts the continuous output variables based on the


independent input variable. like the prediction of house prices based on
different parameters like house age, distance from the main road,
location, area, etc.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 14
Types of Supervised Machine learning Algorithms:
• Linear Regression
• Linear regression is a type of supervised machine learning
algorithm that computes the linear relationship between a
dependent variable and one or more independent features.

• When the number of the independent feature, is 1 then it is


known as Univariate Linear regression, and in the case of more
than one feature, it is known as multivariate linear regression.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 15


Types of Supervised Machine learning Algorithms:
• Linear Regression
• The goal of the algorithm is to find the best linear equation that
can predict the value of the dependent variable based on the
independent variables.

• The equation provides a straight line that represents the


relationship between the dependent and independent variables.

• The slope of the line indicates how much the dependent variable
changes for a unit change in the independent variables.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 16
Linear regression- Linear models,

• Linear regression performs the task to predict a dependent variable


value (y) based on a given independent variable (x)).
• Hence, the name is Linear Regression.
• In the figure above, X (input) is the work experience and Y (output) is the
salary of a person.
• The regression line is the best-fit line for our model.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 17
Linear regression- Linear models,

our independent feature is the experience i.e X and the


respective salary Y is the dependent variable. Let’s assume
there is a linear relationship between X and Y then the salary
can be predicted using:

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 18


Linear regression- Linear models,

• The model gets the best regression fit line by finding the
best θ1 and θ2 values.
• θ1: intercept
• θ2: coefficient of x
• Once we find the best θ1 and θ2 values, we get the best-fit
line. So when we are finally using our model for prediction,
it will predict the value of y for the input value of x.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 19
Linear regression-Cost Function
• The cost function or the loss function is nothing but the error or
difference between the predicted value and the true value Y.

• It is the Mean Squared Error (MSE) between the predicted value and
the true value.

• The cost function (J) can be written as:

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 20


How to update θ1 and θ2 values to get the best-fit line?

• To achieve the best-fit regression line, the model aims to predict the
target value such that the error difference between the predicted value
and the true value Y is minimum.

• So, it is very important to update the θ1 and θ2 values, to reach the best
value that minimizes the error between the predicted y value (pred) and
the true y value (y).

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 21


Gradient Descent
• A linear regression model can be trained using the optimization algorithm
gradient descent by iteratively modifying the model’s parameters to reduce
the mean squared error (MSE) of the model on a training dataset.

• To update θ1 and θ2 values in order to reduce the Cost function (minimizing


RMSE value) and achieve the best-fit line the model uses Gradient Descent.
The idea is to start with random θ1 and θ2 values and then iteratively
update the values, reaching minimum cost.

• A gradient is nothing but a derivative that defines the effects on outputs of


the function with a little bit of variation in inputs.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 22
Multi Dimensionality Reduction
• Used for dimensionality reduction when the input data is not linearly
arranged or it is not known whether a linear relationship exists or not.

• MDS is a non-linear technique for embedding data in a lower-dimensional


space.

• MDS (multidimensional scaling) is an algorithm that transforms a dataset


into another dataset, usually with lower dimensions, keeping the same
euclidean distances between the points.

• It can be used to detect outliers in some multivariate distribution,


Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 23
Multi Dimensionality Reduction

• The main objective of MDS is to represent dissimilarities as distances


between points in a low dimensional space such that the distances
correspond as closely as possible to the dissimilarities.

• nonlinear method to project in lower dimensions by saving pairwise


distances

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 24


Multi Dimensionality Reduction

• The metric MDS calculates distances between each pair of points in


the original high-dimensional space and then maps it to lower-
dimensional space while preserving those distances between points
as well as possible.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 25


Differentiate the cost function(J)

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 26


Linear Regression model
• Finding the coefficients of a linear equation that best fits the training
data is the objective of linear regression.

• By moving in the direction of the Mean Squared Error negative gradient


with respect to the coefficients, the coefficients can be changed.

• And the respective intercept and coefficient of X will be if alpha is the


learning rate.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 27


Gradient Descent

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 28


Bias-Variance Trade-Off

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 29


Bias Variance Tradeoff

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 30


Model Complexity

• Model complexity, which in the case of linear regression can be


thought of as the number of predictors increases, estimates variance
also increases, but the bias decreases.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 31


Use of Regularization

• use regularization - a technique allowing to decrease this variance at


the cost of introducing some bias.

• Finding a good bias-variance trade-off allows to minimize the


model's total error.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 32


Types of Regularization Techniques
• There are three popular regularization techniques, each of them aiming
at decreasing the size of the coefficients :

1. Ridge Regression, which penalizes sum of squared coefficients (L2


penalty).

2. Lasso Regression, which penalizes the sum of absolute values of the


coefficients (L1 penalty).

3. Elastic Net, a convex combination of Ridge and Lasso (L1 + L2 )

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 33


Types of Regularization Techniques
• L2 regularization takes the square of the weights, so the cost of
outliers present in the data increases exponentially.

• L1 regularization takes the absolute values of the weights, so


the cost only increases linearly.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 34


Lasso Regression
• Lasso, or Least Absolute Shrinkage and Selection Operator, is quite
similar conceptually to ridge regression.

• It also adds a penalty for non-zero coefficients, but unlike ridge


regression which penalizes sum of squared coefficients (the so-called L2
penalty), lasso penalizes the sum of their absolute values (L1 penalty).

• As a result, for high values of λ, many coefficients are exactly zeroed


under lasso, which is never the case in ridge regression.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 35


Lasso, the loss is defined as:

• In lasso, one of the correlated predictors has a larger coefficient, while


the rest are (nearly) zeroed.
• Lasso tends to do well if there are a small number of significant parameters
and the others are close to zero
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 36
Elastic Net Regularization
• Elastic Net first emerged as a result of critique on lasso, whose
variable selection can be too dependent on data and thus
unstable.

• The solution is to combine the penalties of ridge regression


and lasso to get the best of both worlds.

• Elastic Net aims at minimizing the following loss function:

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 37


Elastic Net Regularization
• Elastic Net first emerged as a result of critique on lasso,
whose variable selection can be too dependent on data
and thus unstable.
• The solution is to combine the penalties of ridge
regression and lasso to get the best of both worlds.
• Elastic Net aims at minimizing the following loss function:

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 38


Elastic Net Regularization

• Where α is the mixing parameter between


• ridge (α = 0) and
• lasso (α = 1).

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 39


Elastic Net Regularization

• The elastic net penalty is a weighted sum of the L1 and L2 penalties.

• The mixing parameter, alpha (α), controls the weight of the L1


penalty relative to the L2 penalty.

• When alpha=1, the penalty reduces to the L1 penalty (Lasso


regression), and when alpha=0, the penalty reduces to the L2
penalty (Ridge regression).

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 40


Elastic Net Regularization

• Elastic net regression is a linear regression technique that uses a


penalty term to shrink the coefficients of the predictors.

• The penalty term is a combination of the l1-norm (absolute value)


and the l2-norm (square) of the coefficients, weighted by a
parameter called alpha.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 41


Elastic Net Regularization

• Now, there are two parameters to tune: λ and α.

• The glmnet package allows to tune λ via cross-validation for a fixed α,


but it does not support α-tuning, so we will turn to caret for this job.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 42


Polynomial Regression
• Polynomial Regression is a regression algorithm that models
the relationship between a dependent(y) and independent
variable(x) as nth degree polynomial. The Polynomial
Regression equation is given below:

y= b0+b1x1+ b2x12+ b2x13+...... bnx1n

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 43


Polynomial Regression
• It is also called the special case of Multiple Linear Regression in ML.
Because we add some polynomial terms to the Multiple Linear regression
equation to convert it into Polynomial Regression.

• It is a linear model with some modification in order to increase the


accuracy.

• The dataset used in Polynomial regression for training is of non-linear


nature.

• It makes use of a linear regression model to fit the complicated and non-
linear functions and datasets.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 44
Polynomial Regression

• "In Polynomial regression, the original features are converted into


Polynomial features of required degree (2,3,..,n) and then modeled
using a linear model."

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 45


Polynomial Regression

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 46


Simple Linear Regression equation:
• Simple Linear Regression equation:
• y = b0+b1x .........(a)

• Multiple Linear Regression equation:


• y= b0+b1x+ b2x2+ b3x3+....+ bnxn .........(b)

• Polynomial Regression equation:


• y= b0+b1x + b2x2+ b3x3+....+ bnxn …………………(c)

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 47


Linear Regression equation
• When we compare the above three equations, we can clearly see that
all three equations are Polynomial equations but differ by the degree
of variables.

• The Simple and Multiple Linear equations are also Polynomial


equations with a single degree, and the Polynomial regression
equation is Linear equation with the nth degree.

• So if we add a degree to our linear equations, then it will be


converted into Polynomial Linear
Prof.Sachin Sambhaji equations.
Patil , D.Y.Patil University Ambi , Pune 48
Isotonic Regression

• 'iso' means equal and 'tonic' means stretching.

• In terms of machine learning algorithms, isotonic regression


can, therefore, be understood as equal stretching along the
linear regression line.

• It works on top of a linear regression model.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 49


Isotonic Regression

• Isotonic regression has to be non-negative whereas in linear


regression can be negative.

• This means every point in isotonic regression should be high as


before the previous point.

• Isotonic can be free form but linear regression should be linear.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 50


Isotonic Regression

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 51


Isotonic Regression
• Isotonic regression can be formulated as an optimization

problem in which the goal is to find a monotonic function


that minimizes the sum of the squared errors between the
predicted and observed values of the target variable.

• The optimization problem can be written as follows:

minimize subject to

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 52


Isotonic Regression

• where x_i and y_i are the predictors and target variables
for the i^{th} data point,

• respectively, and

• f is the monotonic function that is being fit to the data.

• The constraint ensures that the function is monotonic.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 53


Applications of Isotonic Regression

1. Calibration of predicted probabilities: Isotonic regression can be


used to adjust the predicted probabilities produced by a classifier
so that they are more accurately calibrated to the true probabilities.

2. Ordinal regression: Isotonic regression can be used to model


ordinal variables, which are variables that can be ranked in order
(e.g., “low,” “medium,” and “high”).

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 54


Applications of Isotonic Regression
3. Non-parametric regression: Because isotonic regression does not make
any assumptions about the functional form of the relationship between the
predictor and target variables, it can be used as a non-parametric
regression method.

4. Imputing missing values: Isotonic regression can be used to impute


missing values in a dataset by predicting the missing values based on the
surrounding non-missing values.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 55


Applications of Isotonic Regression

5. Outlier detection: Isotonic regression can be used to identify outliers


in a dataset by identifying points that are significantly different from
the overall trend of the data.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 56


Isotonic Regression

• In scikit-learn, isotonic regression can be performed using the


‘IsotonicRegression’ class. This class implements the isotonic
regression algorithm, which fits a non-decreasing piecewise-constant
function to the data.

• how to use the IsotonicRegression class in scikit-learn to perform


isotonic regression:

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 57


Isotonic Regression
from sklearn.isotonic import IsotonicRegression

ir = IsotonicRegression()
# create an instance of the IsotonicRegression class

# Fit isotonic regression model


y_ir = ir.fit_transform(x, y)
# fit the model and transform the data

print('Isotonic Regression Predictions :\n',y_ir)


Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 58
Logistic Regression
• Logistic regression is a supervised machine learning algorithm
mainly used for classification tasks where the goal is to predict
the probability that an instance of belonging to a given class.

• It is used for classification algorithms its name is logistic


regression. it’s referred to as regression because it takes the
output of the linear regression function as input and uses a
sigmoid function to estimate the probability for the given class.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 59
Logistic Regression
• The difference between linear regression and logistic
regression is that linear regression output is the continuous
value that can be anything while logistic regression predicts the
probability that an instance belongs to a given class or not.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 60


• Logistic Regression: Logistic Regression
• It is used for predicting the categorical dependent variable
using a given set of independent variables.

• Logistic regression predicts the output of a categorical


dependent variable. Therefore the outcome must be a
categorical or discrete value.

• It can be either Yes or No, 0 or 1, true or False, etc. but instead


of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 61
Logistic Regression
• Logistic Regression:

• Logistic Regression is much similar to the Linear Regression except


that how they are used.

• Linear Regression is used for solving Regression problems, whereas


Logistic regression is used for solving the classification problems.

• In Logistic regression, instead of fitting a regression line, we fit an


“S” shaped logistic function, which predicts two maximum values
(0 or 1). Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 62
Logistic Regression

• The curve from the logistic function indicates the likelihood of


something such as whether the cells are cancerous or not, a mouse is
obese or not based on its weight, etc.

• Logistic Regression is a significant machine learning algorithm


because it has the ability to provide probabilities and classify new
data using continuous and discrete datasets.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 63


Logistic Regression

• Logistic Regression can be used to classify the observations using


different types of data and can easily determine the most effective
variables used for the classification.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 64


Logistic Regression : Logistic Function (Sigmoid Function)
• Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted values
to probabilities.
• It maps any real value into another value within a range of 0 and 1. o The value of
the logistic regression must be between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the “S” form.
• The S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 65
Type of Logistic Regression
• On the basis of the categories, Logistic Regression can be classified into three
types:

• Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.

• Multinomial: In multinomial Logistic regression, there can be 3 or more


possible unordered types of the dependent variable, such as “cat”, “dogs”, or
“sheep”

• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered


types of dependent variables, such as “low”, “Medium”, or “High”.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 66
Sr.No Linear Regression Logistic Regression
Linear regression is used to predict the Logistic regression is used to predict the
1 continuous dependent variable using a categorical dependent variable using a given
given set of independent variables. set of independent variables.

Linear regression is used for solving


2 Regression problem.
It is used for solving classification problems.

In this we predict the value of continuous In this we predict values of categorical


3 variables varibles

4 In this we find best fit line. In this we find S-Curve .

Least square estimation method is used for Maximum likelihood estimation method is
5 estimation of accuracy. used for Estimation of accuracy.

The output must be continuous value,such Output is must be categorical value such as
6 as price,age,etc. 0 or 1, Yes or no, etc.

It required linear relationship between


7 dependent and independent variables.
It not required linear relationship.

There may be collinearity between the There should not be collinearity between
8 Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune
independent variables. independent varible. 67
Logistic Regression : Sigmoid Function
sigmoid function
where the input
will be z and we
find the
probability
between 0 and 1.
i.e predicted y.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 68
Logistic Regression
• Logistic Regression Equation
• The odd is the ratio of something occurring to something not
occurring. it is different from probability as the probability is the ratio
of something occurring to everything that could possibly occur. so
odd will be

• from sklearn.linear_model import LogisticRegression


Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 69
Types of Gradient Descent:
• Typically, there are three types of Gradient Descent:
• Batch Gradient Descent

• Stochastic Gradient Descent

• Mini-batch Gradient Descent

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 70


Gradient Descent

• Gradient descent is an optimization algorithm that’s used


when training a machine learning model.

• It’s based on a convex function and tweaks (changing) its parameters


iteratively to minimize a given function to its local minimum.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 71


WHAT IS GRADIENT DESCENT IN MACHINE LEARNING?
• Gradient Descent is an optimization algorithm for finding a local
minimum of a differentiable function.

• Gradient descent in machine learning is simply used to find the


values of a function's parameters (coefficients) that minimize a cost
function as far as possible.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 72


ROLE OF GRADIENT DESCENT
• Initial parameter’s values and from there the gradient descent
algorithm uses calculus to iteratively adjust the values so they
minimize the given cost-function.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 73


ROLE OF GRADIENT DESCENT
• stochastic gradient descent (SGD) does this for each training example
within the dataset, meaning it updates the parameters for each
training example one by one.

• Depending on the problem, this can make SGD faster than batch
gradient descent.

• One advantage is the frequent updates allow us to have a pretty


detailed rate of improvement..

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 74


ROLE OF GRADIENT DESCENT
• The frequent updates, however, are more computationally expensive
than the batch gradient descent approach.

• Additionally, the frequency of those updates can result in noisy


gradients, which may cause the error rate to jump around instead of
slowly decreasing.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 75


Stochastic gradient descendent algorithms
• The gradient descent algorithm is an approximate and iterative
method for mathematical optimization.

• You can use it to approach the minimum of any differentiable


function.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 76


Stochastic gradient descendent algorithms
• Gradient Descent is an iterative optimization process that searches
for an objective function’s optimum value (Minimum/Maximum).

• It is one of the most used methods for changing a model’s


parameters in order to reduce a cost function in machine learning
projects.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 77


Stochastic gradient descendent algorithms
• The primary goal of gradient descent is to identify the model
parameters that provide the maximum accuracy on both training
and test datasets.

• In gradient descent, the gradient is a vector pointing in the general


direction of the function’s steepest rise at a particular point.

• The algorithm might gradually drop towards lower values of the


function by moving in the opposite direction of the gradient, until
reaching the minimum of the function.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 78
Stochastic Gradient Descent
• Stochastic Gradient Descent (SGD) is a variant of the Gradient
Descent algorithm that is used for optimizing machine learning
models.

• It addresses the computational inefficiency of traditional Gradient


Descent methods when dealing with large datasets in machine
learning projects.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 79


Stochastic Gradient Descent
• In SGD, instead of using the entire dataset for each iteration, only a
single random training example (or a small batch) is selected to
calculate the gradient and update the model parameters.

• This random selection introduces randomness into the optimization


process, hence the term “stochastic” in stochastic Gradient Descent

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 80


Stochastic Gradient Descent
• The advantage of using SGD is its computational efficiency, especially
when dealing with large datasets.

• By using a single example or a small batch, the computational cost


per iteration is significantly reduced compared to traditional
Gradient Descent methods that require processing the entire dataset.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 81


Stochastic Gradient Descent
• Stochastic Gradient Descent Algorithm
• Initialization: Randomly initialize the parameters of the model.

• Set Parameters: Determine the number of iterations and the learning


rate (alpha) for updating the parameters.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 82


Stochastic Gradient Descent
• Stochastic Gradient Descent Loop: Repeat the following steps until the model converges
or reaches the maximum number of iterations:

a. Shuffle the training dataset to introduce randomness.

b. Iterate over each training example (or a small batch) in the shuffled order.

c. Compute the gradient of the cost function with respect to the model parameters using
the current training example (or batch).

d. Update the model parameters by taking a step in the direction of the negative gradient,
scaled by the learning rate.

e. Evaluate the convergence criteria, such as the difference in the cost function between
iterations of the gradient. Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 83
Stochastic Gradient Descent
• Return Optimized Parameters: Once the convergence criteria are met or
the maximum number of iterations is reached, return the optimized model
parameters.

• In SGD, since only one sample from the dataset is chosen at random for
each iteration, the path taken by the algorithm to reach the minima is
usually noisier than your typical Gradient Descent algorithm.

• But that doesn’t matter all that much because the path taken by the
algorithm does not matter, as long as we reach the minimum and with a
significantly shorter training time.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 84
The path taken by Batch Gradient Descent is shown below:

• we reach the
minimum and
with a
significantly
shorter
training time.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 85


A path taken by Stochastic Gradient Descent looks as follows –
• One thing to be noted is that,
as SGD is generally noisier
than typical Gradient
Descent, it usually took a
higher number of iterations
to reach the minima,
because of the randomness
in its descent.
• Even though it requires a
higher number of iterations
to reach the minima than
typical Gradient Descent, it is
still computationally much
• Hence, in most scenarios,
SGD is preferred over Batch
less expensive than typical
Gradient Descent for Gradient Descent.
optimizing a learning
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 86
algorithm.
Advantages of Stochastic Gradient Descent
• Speed: SGD is faster than other variants of Gradient Descent such as
Batch Gradient Descent and Mini-Batch Gradient Descent since it uses
only one example to update the parameters.

• Memory Efficiency: Since SGD updates the parameters for each training
example one at a time, it is memory-efficient and can handle large
datasets that cannot fit into memory.

• Avoidance of Local Minima: Due to the noisy updates in SGD, it has the
ability to escape from local minima and converges to a global minimum.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 87
Disadvantages of Stochastic Gradient Descent
• Noisy updates: The updates in SGD are noisy and have a high variance,
which can make the optimization process less stable and lead to
oscillations around the minimum.
• Slow Convergence: SGD may require more iterations to converge to the
minimum since it updates the parameters for each training example one at
a time.
• Sensitivity to Learning Rate: The choice of learning rate can be critical in
SGD since using a high learning rate can cause the algorithm to overshoot
the minimum, while a low learning rate can make the algorithm converge
slowly.
• Less Accurate: Due to the noisy updates, SGD may not converge to the
exact global minimum and can result in a suboptimal solution. This can
be mitigated by using techniques such as learning rate scheduling and
momentum-based updates
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 88
Confusion Matrix

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 89


Confusion Matrix
The target variable has
two values:
Positive or Negative

The columns represent


the actual values of the
target variable.

The rows represent


the predicted values of
the target variable
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 90
Confusion Matrix

• The classification matrix is a standard tool for evaluation of statistical


models and is sometimes referred to as a confusion matrix.

• A Confusion matrix is an N x N matrix used for evaluating


the performance of a classification model, where N is the number
of target classes. The matrix compares the actual target values with
those predicted by the machine learning model.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 91


Confusion Matrix
• A good model is one which has high TP and TN rates, while low FP and
FN rates.

• A confusion matrix is a tabular summary of the number of correct and


incorrect predictions made by a classifier.

• It is used to measure the performance of a classification model.

• It can be used to evaluate the performance of a classification model


through the calculation of performance metrics like accuracy, precision,
recall, and F1-score.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 92
Confusion Matrix
• True Positives (TP): when the actual value is Positive and
predicted is also Positive.
• True negatives (TN): when the actual value is Negative and
prediction is also Negative.
• False positives (FP): When the actual is negative but
prediction is Positive. Also known as the Type 1 error
• False negatives (FN): When the actual is Positive but the
prediction is Negative. Also known as the Type 2 error
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 93
Confusion Matrix

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 94


Classification Measure
• Classification Measure
• Basically, it is an extended version of the confusion matrix. There are
measures other than the confusion matrix which can help achieve
better understanding and analysis of our model and its
performance.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 95


Classification Measure
a. Accuracy

b. Precision

c. Recall (TPR, Sensitivity)

d. F1-Score

e. FPR (Type I Error)

f. FNR (Type II Error)


Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 96
Classification Measure : a. Accuracy
Accuracy simply
measures how often
the classifier makes
the correct
prediction.
It’s the ratio between
the number of
correct predictions
and the total number
of predictions.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 97
Classification Measure

• In a two-class problem, we are often looking to discriminate between


observations with a specific outcome, from normal observations.

• “true positive” for correctly predicted event values.

• “false positive” for incorrectly predicted event values.

• “true negative” for correctly predicted no-event values.

• “false negative” for incorrectly predicted no-event values.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 98


Confusion Matrix
1 # Example of a confusion matrix in Python
2 from sklearn.metrics import confusion_matrix
3
4 expected = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
5 predicted = [1, 0, 0, 1, 0, 0, 1, 1, 1, 0]
6 results = confusion_matrix(expected, predicted)
7 print(results)

[[4 2]
[1 3]]
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 99
Calculate Accuracy, Error, Precision, Recall and F1 Score
for the following Confusion Matrix
Actual Positive Actual Negative

Predicted 10 10
Positive

Predicted 25 55
Negative

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 100


Solution 
1. Accuracy is calculated as follows:
(TP + TN) / (TP + TN + FP + FN)
= (10 + 55) / (10 + 55 + 10 + 25) = 65 / 100 = 0.65
2. Error = 1 – Accuracy
Error = 1 – 0.65
Error = 0.35
3. Precision = TP / TP + FP = 10 / (10 + 10) = 0.5.
4. Recall (Sensitivity) = 10 / (10 + 25) = 0.2857.
5. F1 Score is calculated as follows:
F1 Score = 2 * Precision * Recall / (Precision + Recall)
F1 Score = 2 * 0.5 * 0.2857 / (0.5 + 0.2857)
F1 Score = 0.3571
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 101
ROC Curve

• An ROC curve (receiver operating characteristic curve) is a graph


showing the performance of a classification model at all
classification thresholds.

• This curve plots two parameters: True Positive Rate and False
Positive Rate.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 102


ROC Curve

• An ROC curve (receiver operating characteristic curve) is a graph


showing the performance of a classification model at all
classification thresholds.

• This curve plots two parameters: True Positive Rate and False
Positive Rate.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 103


ROC Curve
• True Positive Rate (TPR) is a synonym for recall and is therefore defined
as follows:

• False Positive Rate (FPR) is defined as follows:

• An ROC curve plots TPR vs. FPR at different classification thresholds.


Lowering the classification threshold classifies more items as positive,
thus increasing both False Positives
Prof.Sachin Sambhaji Patil , D.Y.Patiland
UniversityTrue Positives.
Ambi , Pune 104
ROC Curve
• The following figure shows a typical ROC curve.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 105


ROC Curve

• With a ROC curve, you’re trying to find a good model that optimizes the
trade off between the False Positive Rate (FPR) and True Positive Rate
(TPR). What counts here is how much area is under the curve (Area under
the Curve = AuC).

• The ideal curve in the left image fills in 100%, which means that you’re
going to be able to distinguish between negative results and positive
results 100% of the time (which is almost impossible in real life).

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 106


ROC Curve

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 107


ROC Curve
• A Receiver Operator Characteristic (ROC) curve is a graphical plot used to
show the diagnostic ability of binary classifiers.

• A ROC curve is constructed by plotting the true positive rate (TPR)


against the false positive rate (FPR).

• The true positive rate is the proportion of observations that were


correctly predicted to be positive out of all positive observations (TP/(TP
+ FN)).

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 108


Plot ROC-AUC Curve for binary classification problem

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 109


Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 110
Thank You

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 111

You might also like