Unit 2 Supervised Learning Regression

Unit No.
2
Unit 2: Supervised learning:
Regression
Prof . Sachin Sambhaji Patil

D. Y. Patil University Ambi, Pune
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 1

Supervised Machine Learning
• Supervised learning is the types of machine learning in

which machines are trained using well "labelled" training
data, and on basis of that data, machines predict the
output.
• The labelled data means some input data is already tagged
with the correct output.
• In supervised learning, the training data provided to the

machines work as the supervisor that teaches the machines
to predict the output correctly.
• It applies the same concept as a student learns in the

supervision of the teacher.

• Supervised learning is a process of providing input data as

well as correct output data to the machine learning model.
The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the
output variable(y).

• In the real-world, supervised learning can be used for Risk

Assessment, Image classification, Fraud Detection, spam
filtering, etc.

How Supervised Learning Works?
• In supervised learning, models are trained using labelled

dataset, where the model learns about each type of data.
Once the training process is completed, the model is tested
on the basis of test data (a subset of the training set), and
then it predicts the output.

How Supervised Learning Works?

• Suppose we have a dataset of different types of shapes which includes
square, rectangle, triangle, and Polygon. Now the first step is that we
need to train the model for each shape.
• If the given shape has four sides, and all the sides are equal, then it will
be labelled as a Square.
• If the given shape has three sides, then it will be labelled as a triangle.
• If the given shape has six equal sides then it will be labelled as hexagon.
• Now, after training, we test our model using the test set, and the
task of the model is to identify the shape.
• The machine is already trained on all types of shapes, and when it

finds a new shape, it classifies the shape on the bases of a number
of sides, and predicts the output.

• Steps Involved in Supervised Learning:
1. First Determine the type of training dataset
2. Collect/Gather the labelled training data.
3. Split the training dataset into training dataset, test dataset, and
validation dataset.
4. Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.
5. Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
6. Execute the algorithm on the training dataset. Sometimes we need
validation sets as the control parameters, which are the subset of
training datasets.
7. Evaluate the accuracy of the model by providing the test set. If the
model predicts the correct output,
Prof.Sachin Sambhaji which
Patil , D.Y.Patil means
University Ambi , Pune our model is accurate.
10
Types of supervised Machine learning Algorithms:

Types of Supervised Machine learning Algorithms:
• Regression
• Regression algorithms are used if there is a relationship
between the input variable and the output variable.
• It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc.
• Below are some popular Regression algorithms which come
under supervised learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
• Classification
• Classification algorithms are used when the output
variable is categorical, which means there are two classes
such as Yes-No, Male-Female, True-false, etc.
• Spam Filtering,
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
• Supervised learning has two types:
• Classification: It predicts the class of the dataset based on the
independent input variable. Class is the categorical or discrete values.
like the image of an animal is a cat or dog?
• Regression: It predicts the continuous output variables based on the

independent input variable. like the prediction of house prices based on
different parameters like house age, distance from the main road,
location, area, etc.
• Linear regression is a type of supervised machine learning
algorithm that computes the linear relationship between a
dependent variable and one or more independent features.
• When the number of the independent feature, is 1 then it is

known as Univariate Linear regression, and in the case of more
than one feature, it is known as multivariate linear regression.

• The goal of the algorithm is to find the best linear equation that
can predict the value of the dependent variable based on the
independent variables.
• The equation provides a straight line that represents the

relationship between the dependent and independent variables.
• The slope of the line indicates how much the dependent variable
changes for a unit change in the independent variables.
Linear regression- Linear models,
• Linear regression performs the task to predict a dependent variable

value (y) based on a given independent variable (x)).
• Hence, the name is Linear Regression.
• In the figure above, X (input) is the work experience and Y (output) is the
salary of a person.
• The regression line is the best-fit line for our model.
our independent feature is the experience i.e X and the

respective salary Y is the dependent variable. Let’s assume
there is a linear relationship between X and Y then the salary
can be predicted using:

• The model gets the best regression fit line by finding the
best θ1 and θ2 values.
• θ1: intercept
• θ2: coefficient of x
• Once we find the best θ1 and θ2 values, we get the best-fit
line. So when we are finally using our model for prediction,
it will predict the value of y for the input value of x.
Linear regression-Cost Function
• The cost function or the loss function is nothing but the error or
difference between the predicted value and the true value Y.
• It is the Mean Squared Error (MSE) between the predicted value and
the true value.
• The cost function (J) can be written as:

How to update θ1 and θ2 values to get the best-fit line?
• To achieve the best-fit regression line, the model aims to predict the
target value such that the error difference between the predicted value
and the true value Y is minimum.
• So, it is very important to update the θ1 and θ2 values, to reach the best
value that minimizes the error between the predicted y value (pred) and
the true y value (y).

Gradient Descent
• A linear regression model can be trained using the optimization algorithm
gradient descent by iteratively modifying the model’s parameters to reduce
the mean squared error (MSE) of the model on a training dataset.
• To update θ1 and θ2 values in order to reduce the Cost function (minimizing

RMSE value) and achieve the best-fit line the model uses Gradient Descent.
The idea is to start with random θ1 and θ2 values and then iteratively
update the values, reaching minimum cost.
• A gradient is nothing but a derivative that defines the effects on outputs of

the function with a little bit of variation in inputs.
Multi Dimensionality Reduction
• Used for dimensionality reduction when the input data is not linearly
arranged or it is not known whether a linear relationship exists or not.
• MDS is a non-linear technique for embedding data in a lower-dimensional

space.
• MDS (multidimensional scaling) is an algorithm that transforms a dataset

into another dataset, usually with lower dimensions, keeping the same
euclidean distances between the points.
• It can be used to detect outliers in some multivariate distribution,

• The main objective of MDS is to represent dissimilarities as distances

between points in a low dimensional space such that the distances
correspond as closely as possible to the dissimilarities.
• nonlinear method to project in lower dimensions by saving pairwise

distances

• The metric MDS calculates distances between each pair of points in

the original high-dimensional space and then maps it to lower-
dimensional space while preserving those distances between points
as well as possible.

Differentiate the cost function(J)

Linear Regression model
• Finding the coefficients of a linear equation that best fits the training
data is the objective of linear regression.
• By moving in the direction of the Mean Squared Error negative gradient

with respect to the coefficients, the coefficients can be changed.
• And the respective intercept and coefficient of X will be if alpha is the

learning rate.

Gradient Descent

Bias-Variance Trade-Off

Bias Variance Tradeoff

Model Complexity
• Model complexity, which in the case of linear regression can be

thought of as the number of predictors increases, estimates variance
also increases, but the bias decreases.

Use of Regularization
• use regularization - a technique allowing to decrease this variance at

the cost of introducing some bias.
• Finding a good bias-variance trade-off allows to minimize the

model's total error.

Types of Regularization Techniques
• There are three popular regularization techniques, each of them aiming
at decreasing the size of the coefficients :
1. Ridge Regression, which penalizes sum of squared coefficients (L2

penalty).
2. Lasso Regression, which penalizes the sum of absolute values of the

coefficients (L1 penalty).
3. Elastic Net, a convex combination of Ridge and Lasso (L1 + L2 )

Types of Regularization Techniques
• L2 regularization takes the square of the weights, so the cost of
outliers present in the data increases exponentially.
• L1 regularization takes the absolute values of the weights, so

the cost only increases linearly.

Lasso Regression
• Lasso, or Least Absolute Shrinkage and Selection Operator, is quite
similar conceptually to ridge regression.
• It also adds a penalty for non-zero coefficients, but unlike ridge

regression which penalizes sum of squared coefficients (the so-called L2
penalty), lasso penalizes the sum of their absolute values (L1 penalty).
• As a result, for high values of λ, many coefficients are exactly zeroed

under lasso, which is never the case in ridge regression.

Lasso, the loss is defined as:
• In lasso, one of the correlated predictors has a larger coefficient, while

the rest are (nearly) zeroed.
• Lasso tends to do well if there are a small number of significant parameters
and the others are close to zero
Elastic Net Regularization
• Elastic Net first emerged as a result of critique on lasso, whose
variable selection can be too dependent on data and thus
unstable.
• The solution is to combine the penalties of ridge regression

and lasso to get the best of both worlds.
• Elastic Net aims at minimizing the following loss function:

• Elastic Net first emerged as a result of critique on lasso,
whose variable selection can be too dependent on data
and thus unstable.
• The solution is to combine the penalties of ridge
regression and lasso to get the best of both worlds.
• Elastic Net aims at minimizing the following loss function:

• Where α is the mixing parameter between

• ridge (α = 0) and
• lasso (α = 1).

• The elastic net penalty is a weighted sum of the L1 and L2 penalties.
• The mixing parameter, alpha (α), controls the weight of the L1

penalty relative to the L2 penalty.
• When alpha=1, the penalty reduces to the L1 penalty (Lasso

regression), and when alpha=0, the penalty reduces to the L2
penalty (Ridge regression).

• Elastic net regression is a linear regression technique that uses a

penalty term to shrink the coefficients of the predictors.
• The penalty term is a combination of the l1-norm (absolute value)

and the l2-norm (square) of the coefficients, weighted by a
parameter called alpha.

• Now, there are two parameters to tune: λ and α.
• The glmnet package allows to tune λ via cross-validation for a fixed α,

but it does not support α-tuning, so we will turn to caret for this job.

Polynomial Regression
• Polynomial Regression is a regression algorithm that models
the relationship between a dependent(y) and independent
variable(x) as nth degree polynomial. The Polynomial
Regression equation is given below:
y= b0+b1x1+ b2x12+ b2x13+...... bnx1n

• It is also called the special case of Multiple Linear Regression in ML.
Because we add some polynomial terms to the Multiple Linear regression
equation to convert it into Polynomial Regression.
• It is a linear model with some modification in order to increase the

accuracy.
• The dataset used in Polynomial regression for training is of non-linear

nature.
• It makes use of a linear regression model to fit the complicated and non-
linear functions and datasets.
• "In Polynomial regression, the original features are converted into

Polynomial features of required degree (2,3,..,n) and then modeled
using a linear model."


Simple Linear Regression equation:
• Simple Linear Regression equation:
• y = b0+b1x .........(a)
• Multiple Linear Regression equation:

• y= b0+b1x+ b2x2+ b3x3+....+ bnxn .........(b)
• Polynomial Regression equation:

• y= b0+b1x + b2x2+ b3x3+....+ bnxn …………………(c)

Linear Regression equation
• When we compare the above three equations, we can clearly see that
all three equations are Polynomial equations but differ by the degree
of variables.
• The Simple and Multiple Linear equations are also Polynomial

equations with a single degree, and the Polynomial regression
equation is Linear equation with the nth degree.
• So if we add a degree to our linear equations, then it will be

converted into Polynomial Linear
Prof.Sachin Sambhaji equations.
Patil , D.Y.Patil University Ambi , Pune 48
Isotonic Regression
• 'iso' means equal and 'tonic' means stretching.
• In terms of machine learning algorithms, isotonic regression

can, therefore, be understood as equal stretching along the
linear regression line.
• It works on top of a linear regression model.

Isotonic Regression
• Isotonic regression has to be non-negative whereas in linear

regression can be negative.
• This means every point in isotonic regression should be high as

before the previous point.
• Isotonic can be free form but linear regression should be linear.

Isotonic Regression

Isotonic Regression
• Isotonic regression can be formulated as an optimization
problem in which the goal is to find a monotonic function

that minimizes the sum of the squared errors between the
predicted and observed values of the target variable.
• The optimization problem can be written as follows:
minimize subject to

Isotonic Regression
• where x_i and y_i are the predictors and target variables
for the i^{th} data point,
• respectively, and
• f is the monotonic function that is being fit to the data.
• The constraint ensures that the function is monotonic.

Applications of Isotonic Regression
1. Calibration of predicted probabilities: Isotonic regression can be

used to adjust the predicted probabilities produced by a classifier
so that they are more accurately calibrated to the true probabilities.
2. Ordinal regression: Isotonic regression can be used to model

ordinal variables, which are variables that can be ranked in order
(e.g., “low,” “medium,” and “high”).

3. Non-parametric regression: Because isotonic regression does not make
any assumptions about the functional form of the relationship between the
predictor and target variables, it can be used as a non-parametric
regression method.
4. Imputing missing values: Isotonic regression can be used to impute

missing values in a dataset by predicting the missing values based on the
surrounding non-missing values.

5. Outlier detection: Isotonic regression can be used to identify outliers

in a dataset by identifying points that are significantly different from
the overall trend of the data.

Isotonic Regression
• In scikit-learn, isotonic regression can be performed using the

‘IsotonicRegression’ class. This class implements the isotonic
regression algorithm, which fits a non-decreasing piecewise-constant
function to the data.
• how to use the IsotonicRegression class in scikit-learn to perform

isotonic regression:

Isotonic Regression
from sklearn.isotonic import IsotonicRegression
ir = IsotonicRegression()
# create an instance of the IsotonicRegression class
# Fit isotonic regression model

y_ir = ir.fit_transform(x, y)
# fit the model and transform the data
print('Isotonic Regression Predictions :\n',y_ir)

Logistic Regression
• Logistic regression is a supervised machine learning algorithm
mainly used for classification tasks where the goal is to predict
the probability that an instance of belonging to a given class.
• It is used for classification algorithms its name is logistic

regression. it’s referred to as regression because it takes the
output of the linear regression function as input and uses a
sigmoid function to estimate the probability for the given class.
Logistic Regression
• The difference between linear regression and logistic
regression is that linear regression output is the continuous
value that can be anything while logistic regression predicts the
probability that an instance belongs to a given class or not.

• Logistic Regression: Logistic Regression
• It is used for predicting the categorical dependent variable
using a given set of independent variables.
• Logistic regression predicts the output of a categorical

dependent variable. Therefore the outcome must be a
categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead

of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
Logistic Regression
• Logistic Regression:
• Logistic Regression is much similar to the Linear Regression except

that how they are used.
• Linear Regression is used for solving Regression problems, whereas

Logistic regression is used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an

“S” shaped logistic function, which predicts two maximum values
(0 or 1). Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 62
Logistic Regression
• The curve from the logistic function indicates the likelihood of

something such as whether the cells are cancerous or not, a mouse is
obese or not based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm

because it has the ability to provide probabilities and classify new
data using continuous and discrete datasets.

Logistic Regression
• Logistic Regression can be used to classify the observations using

different types of data and can easily determine the most effective
variables used for the classification.

Logistic Regression : Logistic Function (Sigmoid Function)
• Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted values
to probabilities.
• It maps any real value into another value within a range of 0 and 1. o The value of
the logistic regression must be between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the “S” form.
• The S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.
Type of Logistic Regression
• On the basis of the categories, Logistic Regression can be classified into three
types:
• Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more

possible unordered types of the dependent variable, such as “cat”, “dogs”, or
“sheep”
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered

types of dependent variables, such as “low”, “Medium”, or “High”.
Sr.No Linear Regression Logistic Regression
Linear regression is used to predict the Logistic regression is used to predict the
1 continuous dependent variable using a categorical dependent variable using a given
given set of independent variables. set of independent variables.
Linear regression is used for solving

2 Regression problem.
It is used for solving classification problems.
In this we predict the value of continuous In this we predict values of categorical

3 variables varibles
4 In this we find best fit line. In this we find S-Curve .
Least square estimation method is used for Maximum likelihood estimation method is
5 estimation of accuracy. used for Estimation of accuracy.
The output must be continuous value,such Output is must be categorical value such as
6 as price,age,etc. 0 or 1, Yes or no, etc.
It required linear relationship between

7 dependent and independent variables.
It not required linear relationship.
There may be collinearity between the There should not be collinearity between
8 Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune
independent variables. independent varible. 67
Logistic Regression : Sigmoid Function
sigmoid function
where the input
will be z and we
find the
probability
between 0 and 1.
i.e predicted y.
Logistic Regression
• Logistic Regression Equation
• The odd is the ratio of something occurring to something not
occurring. it is different from probability as the probability is the ratio
of something occurring to everything that could possibly occur. so
odd will be
• from sklearn.linear_model import LogisticRegression

Types of Gradient Descent:
• Typically, there are three types of Gradient Descent:
• Batch Gradient Descent
• Stochastic Gradient Descent
• Mini-batch Gradient Descent

Gradient Descent
• Gradient descent is an optimization algorithm that’s used

when training a machine learning model.
• It’s based on a convex function and tweaks (changing) its parameters

iteratively to minimize a given function to its local minimum.

WHAT IS GRADIENT DESCENT IN MACHINE LEARNING?
• Gradient Descent is an optimization algorithm for finding a local
minimum of a differentiable function.
• Gradient descent in machine learning is simply used to find the

values of a function's parameters (coefficients) that minimize a cost
function as far as possible.

ROLE OF GRADIENT DESCENT
• Initial parameter’s values and from there the gradient descent
algorithm uses calculus to iteratively adjust the values so they
minimize the given cost-function.

• stochastic gradient descent (SGD) does this for each training example
within the dataset, meaning it updates the parameters for each
training example one by one.
• Depending on the problem, this can make SGD faster than batch
gradient descent.
• One advantage is the frequent updates allow us to have a pretty

detailed rate of improvement..

• The frequent updates, however, are more computationally expensive
than the batch gradient descent approach.
• Additionally, the frequency of those updates can result in noisy

gradients, which may cause the error rate to jump around instead of
slowly decreasing.

Stochastic gradient descendent algorithms
• The gradient descent algorithm is an approximate and iterative
method for mathematical optimization.
• You can use it to approach the minimum of any differentiable

function.

• Gradient Descent is an iterative optimization process that searches
for an objective function’s optimum value (Minimum/Maximum).
• It is one of the most used methods for changing a model’s

parameters in order to reduce a cost function in machine learning
projects.

• The primary goal of gradient descent is to identify the model
parameters that provide the maximum accuracy on both training
and test datasets.
• In gradient descent, the gradient is a vector pointing in the general

direction of the function’s steepest rise at a particular point.
• The algorithm might gradually drop towards lower values of the

function by moving in the opposite direction of the gradient, until
reaching the minimum of the function.
Stochastic Gradient Descent
• Stochastic Gradient Descent (SGD) is a variant of the Gradient
Descent algorithm that is used for optimizing machine learning
models.
• It addresses the computational inefficiency of traditional Gradient

Descent methods when dealing with large datasets in machine
learning projects.

• In SGD, instead of using the entire dataset for each iteration, only a
single random training example (or a small batch) is selected to
calculate the gradient and update the model parameters.
• This random selection introduces randomness into the optimization

process, hence the term “stochastic” in stochastic Gradient Descent

• The advantage of using SGD is its computational efficiency, especially
when dealing with large datasets.
• By using a single example or a small batch, the computational cost

per iteration is significantly reduced compared to traditional
Gradient Descent methods that require processing the entire dataset.

• Stochastic Gradient Descent Algorithm
• Initialization: Randomly initialize the parameters of the model.
• Set Parameters: Determine the number of iterations and the learning

rate (alpha) for updating the parameters.

• Stochastic Gradient Descent Loop: Repeat the following steps until the model converges
or reaches the maximum number of iterations:
a. Shuffle the training dataset to introduce randomness.
b. Iterate over each training example (or a small batch) in the shuffled order.
c. Compute the gradient of the cost function with respect to the model parameters using
the current training example (or batch).
d. Update the model parameters by taking a step in the direction of the negative gradient,
scaled by the learning rate.
e. Evaluate the convergence criteria, such as the difference in the cost function between
iterations of the gradient. Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 83
• Return Optimized Parameters: Once the convergence criteria are met or
the maximum number of iterations is reached, return the optimized model
parameters.
• In SGD, since only one sample from the dataset is chosen at random for
each iteration, the path taken by the algorithm to reach the minima is
usually noisier than your typical Gradient Descent algorithm.
• But that doesn’t matter all that much because the path taken by the
algorithm does not matter, as long as we reach the minimum and with a
significantly shorter training time.
The path taken by Batch Gradient Descent is shown below:
• we reach the
minimum and
with a
significantly
shorter
training time.

A path taken by Stochastic Gradient Descent looks as follows –
• One thing to be noted is that,
as SGD is generally noisier
than typical Gradient
Descent, it usually took a
higher number of iterations
to reach the minima,
because of the randomness
in its descent.
• Even though it requires a
higher number of iterations
to reach the minima than
typical Gradient Descent, it is
still computationally much
• Hence, in most scenarios,
SGD is preferred over Batch
less expensive than typical
Gradient Descent for Gradient Descent.
optimizing a learning
algorithm.
Advantages of Stochastic Gradient Descent
• Speed: SGD is faster than other variants of Gradient Descent such as
Batch Gradient Descent and Mini-Batch Gradient Descent since it uses
only one example to update the parameters.
• Memory Efficiency: Since SGD updates the parameters for each training
example one at a time, it is memory-efficient and can handle large
datasets that cannot fit into memory.
• Avoidance of Local Minima: Due to the noisy updates in SGD, it has the
ability to escape from local minima and converges to a global minimum.
Disadvantages of Stochastic Gradient Descent
• Noisy updates: The updates in SGD are noisy and have a high variance,
which can make the optimization process less stable and lead to
oscillations around the minimum.
• Slow Convergence: SGD may require more iterations to converge to the
minimum since it updates the parameters for each training example one at
a time.
• Sensitivity to Learning Rate: The choice of learning rate can be critical in
SGD since using a high learning rate can cause the algorithm to overshoot
the minimum, while a low learning rate can make the algorithm converge
slowly.
• Less Accurate: Due to the noisy updates, SGD may not converge to the
exact global minimum and can result in a suboptimal solution. This can
be mitigated by using techniques such as learning rate scheduling and
momentum-based updates
Confusion Matrix

Confusion Matrix
The target variable has
two values:
Positive or Negative
The columns represent

the actual values of the
target variable.
The rows represent

the predicted values of
the target variable
Confusion Matrix
• The classification matrix is a standard tool for evaluation of statistical

models and is sometimes referred to as a confusion matrix.
• A Confusion matrix is an N x N matrix used for evaluating

the performance of a classification model, where N is the number
of target classes. The matrix compares the actual target values with
those predicted by the machine learning model.

Confusion Matrix
• A good model is one which has high TP and TN rates, while low FP and
FN rates.
• A confusion matrix is a tabular summary of the number of correct and

incorrect predictions made by a classifier.
• It is used to measure the performance of a classification model.
• It can be used to evaluate the performance of a classification model

through the calculation of performance metrics like accuracy, precision,
recall, and F1-score.
Confusion Matrix
• True Positives (TP): when the actual value is Positive and
predicted is also Positive.
• True negatives (TN): when the actual value is Negative and
prediction is also Negative.
• False positives (FP): When the actual is negative but
prediction is Positive. Also known as the Type 1 error
• False negatives (FN): When the actual is Positive but the
prediction is Negative. Also known as the Type 2 error
Confusion Matrix

Classification Measure
• Classification Measure
• Basically, it is an extended version of the confusion matrix. There are
measures other than the confusion matrix which can help achieve
better understanding and analysis of our model and its
performance.

a. Accuracy
b. Precision
c. Recall (TPR, Sensitivity)
d. F1-Score
e. FPR (Type I Error)
f. FNR (Type II Error)

Classification Measure : a. Accuracy
Accuracy simply
measures how often
the classifier makes
the correct
prediction.
It’s the ratio between
the number of
correct predictions
and the total number
of predictions.
• In a two-class problem, we are often looking to discriminate between

observations with a specific outcome, from normal observations.
• “true positive” for correctly predicted event values.
• “false positive” for incorrectly predicted event values.
• “true negative” for correctly predicted no-event values.
• “false negative” for incorrectly predicted no-event values.

Confusion Matrix
1 # Example of a confusion matrix in Python
2 from sklearn.metrics import confusion_matrix
3
4 expected = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
5 predicted = [1, 0, 0, 1, 0, 0, 1, 1, 1, 0]
6 results = confusion_matrix(expected, predicted)
7 print(results)
[[4 2]
[1 3]]
Calculate Accuracy, Error, Precision, Recall and F1 Score
for the following Confusion Matrix
Actual Positive Actual Negative
Predicted 10 10
Positive
Predicted 25 55
Negative

Solution 
1. Accuracy is calculated as follows:
(TP + TN) / (TP + TN + FP + FN)
= (10 + 55) / (10 + 55 + 10 + 25) = 65 / 100 = 0.65
2. Error = 1 – Accuracy
Error = 1 – 0.65
Error = 0.35
3. Precision = TP / TP + FP = 10 / (10 + 10) = 0.5.
4. Recall (Sensitivity) = 10 / (10 + 25) = 0.2857.
5. F1 Score is calculated as follows:
F1 Score = 2 * Precision * Recall / (Precision + Recall)
F1 Score = 2 * 0.5 * 0.2857 / (0.5 + 0.2857)
F1 Score = 0.3571
ROC Curve
• An ROC curve (receiver operating characteristic curve) is a graph

showing the performance of a classification model at all
classification thresholds.
• This curve plots two parameters: True Positive Rate and False
Positive Rate.

ROC Curve
• An ROC curve (receiver operating characteristic curve) is a graph

showing the performance of a classification model at all
classification thresholds.
• This curve plots two parameters: True Positive Rate and False
Positive Rate.

ROC Curve
• True Positive Rate (TPR) is a synonym for recall and is therefore defined
as follows:
• False Positive Rate (FPR) is defined as follows:
• An ROC curve plots TPR vs. FPR at different classification thresholds.

Lowering the classification threshold classifies more items as positive,
thus increasing both False Positives
Prof.Sachin Sambhaji Patil , D.Y.Patiland
UniversityTrue Positives.
Ambi , Pune 104
ROC Curve
• The following figure shows a typical ROC curve.

ROC Curve
• With a ROC curve, you’re trying to find a good model that optimizes the
trade off between the False Positive Rate (FPR) and True Positive Rate
(TPR). What counts here is how much area is under the curve (Area under
the Curve = AuC).
• The ideal curve in the left image fills in 100%, which means that you’re
going to be able to distinguish between negative results and positive
results 100% of the time (which is almost impossible in real life).

ROC Curve

ROC Curve
• A Receiver Operator Characteristic (ROC) curve is a graphical plot used to
show the diagnostic ability of binary classifiers.
• A ROC curve is constructed by plotting the true positive rate (TPR)

against the false positive rate (FPR).
• The true positive rate is the proportion of observations that were

correctly predicted to be positive out of all positive observations (TP/(TP
+ FN)).

Plot ROC-AUC Curve for binary classification problem

Thank You

Unit 2 Supervised Learning Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2 Supervised Learning Regression

Uploaded by

Copyright:

Available Formats

Unit No.

Prof . Sachin Sambhaji Patil

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 1

• Supervised learning is the types of machine learning in

• In supervised learning, the training data provided to the

• It applies the same concept as a student learns in the

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 3

• Supervised learning is a process of providing input data as

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 4

• In the real-world, supervised learning can be used for Risk

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 5

• In supervised learning, models are trained using labelled

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 6

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 7

• The machine is already trained on all types of shapes, and when it

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 9

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 11

• Regression: It predicts the continuous output variables based on the

• When the number of the independent feature, is 1 then it is

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 15

• The equation provides a straight line that represents the

• Linear regression performs the task to predict a dependent variable

our independent feature is the experience i.e X and the

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 18

• The cost function (J) can be written as:

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 20

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 21

• To update θ1 and θ2 values in order to reduce the Cost function (minimizing

• A gradient is nothing but a derivative that defines the effects on outputs of

• MDS is a non-linear technique for embedding data in a lower-dimensional

• MDS (multidimensional scaling) is an algorithm that transforms a dataset

• It can be used to detect outliers in some multivariate distribution,

• The main objective of MDS is to represent dissimilarities as distances

• nonlinear method to project in lower dimensions by saving pairwise

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 24

• The metric MDS calculates distances between each pair of points in

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 25

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 26

• By moving in the direction of the Mean Squared Error negative gradient

• And the respective intercept and coefficient of X will be if alpha is the

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 27

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 28

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 29

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 30

• Model complexity, which in the case of linear regression can be

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 31

• use regularization - a technique allowing to decrease this variance at

• Finding a good bias-variance trade-off allows to minimize the

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 32

1. Ridge Regression, which penalizes sum of squared coefficients (L2

2. Lasso Regression, which penalizes the sum of absolute values of the

3. Elastic Net, a convex combination of Ridge and Lasso (L1 + L2 )

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 33

• L1 regularization takes the absolute values of the weights, so

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 34

• It also adds a penalty for non-zero coefficients, but unlike ridge

• As a result, for high values of λ, many coefficients are exactly zeroed

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 35

• In lasso, one of the correlated predictors has a larger coefficient, while

• The solution is to combine the penalties of ridge regression

• Elastic Net aims at minimizing the following loss function:

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 37

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 38

• Where α is the mixing parameter between