Logistic Regression

Logistic Regression
Classification - Evaluation Metrics - Naive Baye’s
Dr. D. Harimurugan
Department of Electrical Engineering
Dr B R Ambedkar National Institute of Technology

Jalandhar
Logistic regression Introduction
Multiclass classification Sigmoid function
Evaluation Metrics Decison Boundary
Naive Baye’s Algorithm Cost Function
Logistic regression
Classification algorithm to cluster the data

Output is categorical variable (0/1)
ML Dr. D. Harimurugan, EE - NITJ

Logistic regression
Binary classification (0/1)
Multiclass classfication (0,1,2)

Logistic regression
The output is the probability value (0 to 1) which gives the

probability of a dataset belonging to particular class
hθ (x) >0.5 ⇒ Class-0
hθ (x) <0.5 ⇒ Class-1
0.5 is the threshold value (user defined).
Logistic regression

Logistic regression
The output is the probability value (0 to 1) which gives the

probability of a dataset belonging to particular class
hθ (x) >0.5 ⇒ Class-0
hθ (x) <0.5 ⇒ Class-1
0.5 is the threshold value (user defined).
Linear regression for classification




Linear regression for classification with outliers




Problem of outliers
−∞ ≤ h ≤ ∞ (Threshold selection is a problem)
To overcome these problem, we use Sigmoid function
The value of h varies between 0 to 1
S-curve is used for fitting in logistic regression


Problem of outliers
−∞ ≤ h ≤ ∞ (Threshold selection is a problem)


Problem of outliers
−∞ ≤ h ≤ ∞


Problem of outliers
−∞ ≤ h ≤ ∞

S-curve : Sigmoid function

S-curve represents the probability value.

The probability range between the classes is high with
sigmoid curve (stepness and closeness)
S-curve represents the probability value. (low probaility for

one class and high probability for other class)
The probability range between the classes is high with
sigmoid curve (stepnessMLand Dr.
closeness)
D. Harimurugan, EE - NITJ
Sigmoid function or logistic function

1
g(z) =
1 + e−z

Sigmoid function or logistic function

1
g(z) =
1 + e−z
g(z)|z=∞ = 1 g(z)|z=−∞ = 0
h(x) represents the estimated probability data belongs to
one class
Sigmoid function for logistic regression

1
g(z) =
1 + e−z
Hypothesis for logistic regression
1
g(hθ (x)) = g(X .θ) =
1 + e−(X .θ)


1
g(z) =
1 + e−z
Hypothesis for logistic regression
1
g(hθ (x)) = g(X .θ) =
1 + e−(X .θ)

1
hθ (x) = g(X .θ) =
1 + e−(X .θ)

1
hθ (x) = g(X .θ) =
1 + e−(X .θ)
z ≥ 0 ⇒ g(z) ≥ 0.5 ⇒ hθ (x) ≥ 0.5 ⇒ Class − 1

z < 0 ⇒ g(z) < 0.5 ⇒ hθ (x) < 0.5 ⇒ Class − 0

1
hθ (x) = g(X .θ) =
1 + e−(X .θ)
z ≥ 0 ⇒ g(z) ≥ 0.5 ⇒ hθ (x) ≥ 0.5 ⇒ Class − 1

z < 0 ⇒ g(z) < 0.5 ⇒ hθ (x) < 0.5 ⇒ Class − 0
X .θ ≥ 0 ⇒ g(X .θ) ≥ 0.5 ⇒ Class − 1
X .θ < 0 ⇒ g(X .θ) < 0.5 ⇒ Class − 0
1
hθ (x) = g(X .θ) =
1 + e−(X .θ)
z ≥ 0 ⇒ g(z) ≥ 0.5 ⇒ hθ (x) ≥ 0.5 ⇒ Class − 1
z < 0 ⇒ g(z) < 0.5 ⇒ hθ (x) < 0.5 ⇒ Class − 0
X .θ ≥ 0 ⇒ g(X .θ) ≥ 0.5 ⇒ Class − 1
X .θ < 0 ⇒ g(X .θ) < 0.5 ⇒ Class − 0
Predicting probability of ’y’ belong to class-1 or class-0 is
equivalent to predicting X .θ greater than or less than zero.
Based on the value of h, we will divide the dataset into
classes and the boundary we call it as “Decision
boundary”
Decision Boundary

Decision Boundary
hθ (x) = g(θ0 + θ1 x1 + θ2 x2 )
Find the equation of line which

seperates two classes

Decision Boundary
hθ (x) = g(θ0 + θ1 x1 + θ2 x2 )


Decision Boundary
hθ (x) = g(θ0 + θ1 x1 + θ2 x2 )

x1 + x2 = 4
x1 + x2 − 4 = 0

Decision Boundary
hθ (x) = g(θ0 + θ1 x1 + θ2 x2 )

x1 + x2 = 4
x1 + x2 − 4 = 0
 
−4 Predict y=1, if x1 + x2 ≥ 4
θ= 1

Predict y=0, if x1 + x2 < 4
1

Decision Boundary
hθ (x) = g(θ0 + θ1 x1 + θ2 x2 )

x1 + x2 = 4
x1 + x2 − 4 = 0
 
−4 Predict y=1, if x1 + x2 ≥ 4
θ= 1

Predict y=0, if x1 + x2 < 4
1
hθ (x) = 0.5 ⇒ g(x1 + x2 = 4)

Decision Boundary
Decision boundary is a property of hypothesis and

parameter of hypothesis, not of data set
Non linear Decision Boundary

hθ (x) = g(θ0 +θ1 x1 +θ2 x2 +θ3 x12 +θ4 x22 )
Decision boundary is

hθ (x) = g(θ0 +θ1 x1 +θ2 x2 +θ3 x12 +θ4 x22 )
x12 + x22 = 1
x12 + x22 − 1 ≥ 0 ⇒ y = 1
x12 + x22 − 1 < 0 ⇒ y = 0

hθ (x) = g(θ0 +θ1 x1 +θ2 x2 +θ3 x12 +θ4 x22 )
x12 + x22 = 1
x12 + x22 − 1 ≥ 0 ⇒ y = 1
x12 + x22 − 1 < 0 ⇒ y = 0
 
−1
0
hθ (x) = g(x12 + x22 − 1)
 
0
θ= 
1
1
Logistic regression cost function


P1 to P4 should have less

probability
P5 to P8 should have high
probability


probability
probability
Minimizing P4 is equivalent
to maximizing (1 − P4 )


probability
probability
The Maximization function is

Product = (1 − P1 )(1 − P2 )(1 − P3 )(1 − P4 )P5 P6 P7 P8


probability
probability
The Maximization function is

Product = (1 − P1 )(1 − P2 )(1 − P3 )(1 − P4 )P5 P6 P7 P8
Maximization is equivalent to Minimizing negative of function
Min J = −[(1 − P1 )(1 − P2 )(1 − P3 )(1 − P4 )P5 P6 P7 P8 ]

m
1 X1
Linear regression ⇒ J = (h(xi ) − yi )2
m 2
i=1
m
1 X
Logistic regression ⇒ J = cost(hθ (x)(i) , y (i) )
m
i=1
(
(i) (i) −hθ (x) if y=1
cost(hθ (x) , y ) =
−(1 − hθ (x)) if y=0


(
(i) (i) −(hθ (x)) if y=1
−(1 − hθ (x)) if y=0


(
(i) (i) −(hθ (x)) if y=1
−(1 − hθ (x)) if y=0
(
(i) −log(hθ (x))
(i) if y=1
−log(1 − hθ (x)) if y=0

(
(i) (i) −log(hθ (x)) if y=1
−log(1 − hθ (x)) if y=0
The above cost value can be written as
cost(hθ (x), y ) = −y .log(hθ (x)) − (1 − y).log(1 − hθ (x))


(
−log(1 − hθ (x)) if y=0
y=1:


(
−log(1 − hθ (x)) if y=0
y=1:
cost(hθ (x), y ) = −1.log(hθ (x)) − (1 − 1).log(1 − hθ (x))
cost(hθ (x), y) = −log(hθ (x))


(
−log(1 − hθ (x)) if y=0
y=1:
y=0:


(
−log(1 − hθ (x)) if y=0
y=1:
y=0:
cost(hθ (x), y) = −log(1 − hθ (x))

The cost function for logistic regression is

m
1 X (i) (i) (i) (i)
J=− y .log(hθ (x) ) + (1 − y ).log(1 − hθ (x) )
m
i=1
Goal ⇒find the value of θ which gives minimum value for J

The output for new value of x is given by
1
hθ (x) =
1 − e−X .θ

Logistic regression: Overfitting


Logistic regression: Regularization
The cost function for logistic regression with regularization is

m
1 X (i) (i) (i) (i)
J=− y .log(hθ (x) ) + (1 − y ).log(1 − hθ (x) )
m
i=1
n
λ X 2
+ θj
2m
j=1


Logistic regression
Multiclass classification
One Vs All
Evaluation Metrics
Naive Baye’s Algorithm
Multiclass classification: One Vs All
Find the probabilites of each model and the test point belongs
to the model which gives highest probability
Logistic regression Accuracy
Multiclass classification Confusion matrix
Evaluation Metrics Precision and Recall, F1 Score
Naive Baye’s Algorithm AUC-ROC & Log-loss
Evaluation metrics for classification
Accuracy
Confusion matrix
Precision and recall
F1-score
AUC-ROC
Log loss
Gini coefficient

Evaluation metric : Accuracy
Accuracy indicates how much percentage model has made

correct prediction
Correct prediction
Accuracy =
Total Prediction
Accuracy will have problem with skewed classes.

Evaluation Metric: Accuracy

Evaluation metric : Confusion matrix

Evaluation metric : Confusion matrix
True positives True positives

Precision = =
Predicted positives True positives + False positives

Evaluation metric : Recall
True positives True positives

Recall = =
Actual positives True positives + False negatives

Evaluation Metric: Precision and Recall

Evaluation Metric: Precision and Recall

Evaluation metric : F1 score
F1 score is used as tradeoff among precison and recall

F1 score is a harmonic sum of precision and recall
P.R
F 1 score = 2
P +R

Evaluation metric : F1 score
F1 score is used as tradeoff among precison and recall

F1 score is a harmonic sum of precision and recall
P.R
F 1 score = 2
P +R
To give more importance to precision or recall

P.R
F 1 score = (1 + β 2 ) 2
(β P) + R

Evaluation Metric: F1 Score

Evaluation Metric: AUC-ROC

True positive fraction False positive fraction
TP FP
TPF (sensitivity) = TPF =
TP + FN TN + FP
False negative fraction Positive predicted value
FN TP
TPF = PPV =
TP + FN TP + FP
True negative fraction Negative predicted value
TN TN
TPF (specificity) = NPV =
TN + FP TN + FN

Evaluation metric : AUC-ROC

ROC stands for “Receiver Operating Characteristics” which
from signals and systems where they used it for
distinguishing ’noise’ from ’not noise’
Used as an evaluation metric between true positive rate
and false positive rate.
Gives trade off between true positives and false positives

Consider max value of 1 unit, an completely random

prediction will give you straight line (AUC=0.5)
For a model better than random one, AUC will be greater
than 0.5.
More area under the curve, better model it is.
Stepper the curve, better the model!

PR curve is preferred over ROC if we have sckewed classes.

Evaluation metric : Log loss
AUC considers only the order of probability not the value of

probability
Log loss is the negative average of the log of the predicted
probabilites for each instance
m
1 X (i) (i) (i) (i)
Log loss = − y .log(hθ (x) )+(1−y ).log(1−hθ (x) )
m
i=1

Evaluation metric : Gini coefficient
It is derived from AUC-ROC curve

It is given by area between the ROC curve and the
diagonal line divided by area of triangle
Gini above 60% is a good model
Gini coefficient = 2AUC − 1

Multiclass classification Example
Evaluation Metrics Generative model Vs Descriminative model
Naive Baye’s Algorithm Types of Navie Baye
Navie Baye’s Algorithm

Supervised algorithm based on Baye’s theorem used for
classification
Generative model
Main assumption: Each feature is independent of each
other


Supervised algorithm based on Baye’s theorem used for
classification
Generative model
Main assumption: Each feature is independent of each
other
P(B|A)P(A)
P(A|B) =
P(B)
P(A|B) is Posterior probability: Probability of hypothesis

A on the observed event B.
P(B|A) is Likelihood probability: Probability of the
evidence given that the probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before
observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Navie Baye’s Procedure
Convert the given dataset into frequency tables

Generate Likelihood table by finding probabilites of given
feature
Use Baye’s theorem to calculate the Posterior probability
P(x1 |y).P(x2 |y).....P(xn |y ).P(y)

P(y |x1 , x2 ....xn ) =
P(x1 ).P(x2 )....P(xn )


Consider a dataset of weather condition and target variable
as playing golf
Find P(yes|today); today=(sunny, hot, normal, false)

Navie Baye’s Algorithm:Calculation
P(x1 |y).P(x2 |y).....P(xn |y ).P(y)

P(y |x1 , x2 ....xn ) =
P(x1 ).P(x2 )....P(xn )
[P(sunny|yes).P(hot|yes)P(normal|yes)P(false|yes)].P(yes)
=
P(sunny ).P(hot)P(false)P(normal)
Find P(NO|today); today=(sunny, hot, normal, false)
[P(sunny |No).P(hot|No)P(normal|No)P(false|No)].P(no)
=

P(x1 |y).P(x2 |y).....P(xn |y ).P(y)

P(y |x1 , x2 ....xn ) =
P(x1 ).P(x2 )....P(xn )
[P(sunny|yes).P(hot|yes)P(normal|yes)P(false|yes)].P(yes)
=
Find P(NO|today); today=(sunny, hot, normal, false)
[P(sunny |No).P(hot|No)P(normal|No)P(false|No)].P(no)
=
P(Y |t) = P(yes)∗P(sunny|yes).P(hot|yes)P(normal|yes)P(false|yes

P(N|t) = P(no)∗P(sunny|no).P(hot|no)P(normal|no)P(false|no)

3
P(sunny |yes) =
9
2
P(hot|yes) =
9
6
P(Normal|yes) =
9
6
P(False|yes) =
9
9
P(yes) =
14

Navie Baye’s Algorithm: Calculation
= [P(sunny |yes).P(hot|yes)P(normal|yes)P(false|yes)].P(yes)
3 2 6 6 9
P(yes|sunny , hot, normal, false) = . . . . = 0.0211
9 9 9 9 14
2 2 1 2 5
P(No|sunny, hot, normal, false) = . . . . = 0.0024
5 5 5 5 14
P(yes|today) > P(no|today)
Hence, the test data belongs to class “Yes”

Navie Baye’s Algorithm: Exercise
Classify a red, domestic, SUV.

Navie Baye’s Algorithm: Exercise
Classify a red, domestic, SUV.

P(Yes|test) = 0.037 P(No|test) = 0.069
Generative Vs Descriminative models
In case of discriminative models, to find the probability, first

we assume some functional form for P(Y |x) and
estimate the parameter of P(Y |x) with the help of training
data


data
In case of generative models, to find the conditional
probability P(Y |x), first we estimate the prior probability
P(Y) and likelihood probability P(x|Y ) with the help of
training data and uses baye’s theorem to calculate the
posterior probability P(Y |x)
P(x|Y )P(Y )
P(Y |x) =
P(x)

Generative model vs discriminative model


data


data
In case of generative models, to find the conditional
probability P(Y |x), first we estimate the prior probability
P(Y) and likelihood probability P(x|Y ) with the help of
training data and uses baye’s theorem to calculate the
posterior probability P(Y |x)
P(x|Y )P(Y )
P(Y |x) =
P(x)

Descriminative models makes predictions on the unseen

data based on conditional probability.
Generative model focuses on the distribution of a dataset
to return a probability


Discriminative models are better than generative models
when we haave otuliers


Discriminative models are better than generative models
when we haave otuliers
Generative models use the assumption of independence
among the features

Types of Navie Baye’s algorithm
Most common variants are

Gaussian Navie Bayes
Multinomail Navie Bayes
Bernoulli Navie Bayes
END OF LOGISTIC REGRESSION

Logistic Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Logistic Regression

Uploaded by

Copyright:

Available Formats

Logistic Regression

Classification - Evaluation Metrics - Naive Baye’s

Department of Electrical Engineering

Dr B R Ambedkar National Institute of Technology

Classification algorithm to cluster the data

ML Dr. D. Harimurugan, EE - NITJ

ML Dr. D. Harimurugan, EE - NITJ

The output is the probability value (0 to 1) which gives the

ML Dr. D. Harimurugan, EE - NITJ

The output is the probability value (0 to 1) which gives the

Linear regression for classification

ML Dr. D. Harimurugan, EE - NITJ

Linear regression for classification

ML Dr. D. Harimurugan, EE - NITJ

Linear regression for classification

ML Dr. D. Harimurugan, EE - NITJ

Linear regression for classification

ML Dr. D. Harimurugan, EE - NITJ

Linear regression for classification with outliers

ML Dr. D. Harimurugan, EE - NITJ

Linear regression for classification with outliers

ML Dr. D. Harimurugan, EE - NITJ

Linear regression for classification with outliers

ML Dr. D. Harimurugan, EE - NITJ

Linear regression for classification with outliers

ML Dr. D. Harimurugan, EE - NITJ

Linear regression for classification with outliers

ML Dr. D. Harimurugan, EE - NITJ

Linear regression for classification with outliers

ML Dr. D. Harimurugan, EE - NITJ

Linear regression for classification with outliers

ML Dr. D. Harimurugan, EE - NITJ

S-curve : Sigmoid function

ML Dr. D. Harimurugan, EE - NITJ

S-curve : Sigmoid function

S-curve represents the probability value.

S-curve : Sigmoid function

S-curve represents the probability value. (low probaility for

Sigmoid function or logistic function

ML Dr. D. Harimurugan, EE - NITJ

Sigmoid function or logistic function

Sigmoid function for logistic regression

ML Dr. D. Harimurugan, EE - NITJ

Sigmoid function for logistic regression

ML Dr. D. Harimurugan, EE - NITJ

Sigmoid function for logistic regression

ML Dr. D. Harimurugan, EE - NITJ

Sigmoid function for logistic regression

z ≥ 0 ⇒ g(z) ≥ 0.5 ⇒ hθ (x) ≥ 0.5 ⇒ Class − 1

ML Dr. D. Harimurugan, EE - NITJ

Sigmoid function for logistic regression

z ≥ 0 ⇒ g(z) ≥ 0.5 ⇒ hθ (x) ≥ 0.5 ⇒ Class − 1

Sigmoid function for logistic regression

ML Dr. D. Harimurugan, EE - NITJ

Find the equation of line which

ML Dr. D. Harimurugan, EE - NITJ

Find the equation of line which

ML Dr. D. Harimurugan, EE - NITJ

Find the equation of line which

ML Dr. D. Harimurugan, EE - NITJ

Find the equation of line which

ML Dr. D. Harimurugan, EE - NITJ