You are on page 1of 32

MACHINE LEARNING

COLLEGE PHYSICS
Chapter # Chapter Title
PowerPoint Image Slideshow

LECTURE 9-12 : LOGISTIC REGRESSION MODEL


DIAGNOSTICS AND GOODNESS OF FIT
LEARNING OBJECTIVES

• Logistic Regression Model Diagnostics


• Logistic Regression Goodness of fit
LOGISTIC REGRESSION MODEL DIAGNOSTICS

1. Omnibus test: Check whether the explained variance in the model


is significantly higher than the unexplained variation. For example,
in MLR model, F-test is an omnibus test.
2. Wald’s test: Wald’s test is used for checking whether an individual
explanatory variable is statistically significant. Wald’s test is a chi-
square test.
3. Hosmer−Lemeshow test: It is a chi-square goodness of fit test for
binary logistic regression.
4. Pseudo R2: Pseudo R2 is a measure of goodness of the model. It is
called pseudo R2 because it does not have the same interpretation
of R2 as in the MLR model.
4
WALD’S TEST
Wald’s test is used for checking statistical significance of individual predictor
variables (equivalent to t-test in MLR model). The null and alternative
hypotheses for Wald’s test are:

the Wald’s test statistic value is 4.832 and the corresponding p-value is 0.028. Since the p-
value is less than 0.05, we reject the null hypothesis. That is, the variable ‘launch
temperature’ has a statistically significant relationship on O-ring failure.
HOSMER-LEMESHOW GOODNESS OF FIT TEST
Hosmer-Lemeshow (H-L) is a chi-square goodness of fit test.
The H-L test is constructed by dividing the data set into 10 groups
(deciles). The H-L test checks whether the observed and expected
frequencies in each group are equal. The null and alternative
hypotheses in H-L test are
H : The logistic regression model fits the data
0

H : The logistic regression model does not fit the data


1

Since the null hypothesis is that the logistic regression is a good fit for the
data, we claim that the logistic regression is appropriate (p-value = 0.1411).
PSEUDO R2
It is not possible to calculate R2 as in the case of continuous dependent variable in a
logistic regression model. However, many pseudo R2 values are used which compare the
intercept-only model to the model with independent variables.
• Cox and Snell R2

Cox and Snell R2 is given by


where N is the sample size. L(Full model) is the likelihood function which is used for
calculating the class probability.

• Nagelkerke’s R2 :
Nagelkerke R2 is an adjustment over Cox and Snell R2, so that the maximum value Pseudo R2
is 1. Nagelkerke R2 is given as follows
LOGISTIC REGRESSION GOODNESS OF FIT

Goodness of Fit Metrics

Based on the probability as an output Based on the predicted categories


1.Concordance, Discordance, SomerceD, 1.Confusion Matrix
Gamma 2.Classification reports:
2.Area Under the curve • Accuracy
• Sensitivity
• Specificity
• precision
• Recall
• F1 score
CLASSIFICATION TABLE

To classify the observations, the decision maker has to first decide the
classification cut-off probability Pc.

1. Whenever the predicted probability of an observation, P(Yi = 1), is less than


the classification cut-off probability, Pc, then the observation is classified as
negative (Yi = 0) and if the predicted probability is greater than or equal to
Pc, then the observation is classified as positive
(Yi = 1). That is

2. Default classification cut-off probability of 0.5.

The classification table in a logistic regression model output is a table that


provides accuracy of the logistic regression model (accuracy of classifying
positives and negatives) for a chosen classification cut-off probability
CLASSIFICATION TABLE

• For the classification cut-off probability value of 0.5, the model has classified all 17
negatives (coded as 0) as negatives and 4 positives (coded 1) as positives and
remaining 3 positives as negatives.
• The accuracy of classifying negatives is 100%, whereas the accuracy of classifying
positives is 57.1%.
• The overall accuracy of the logistic regression model is 87.5% (3 observations are
misclassified out of 24 observations).
CLASSIFICATION TABLE
The classification accuracy would depend on the cut-off probability. Classification
table for the cut-off probability of 0.2

• The overall accuracy decreases to 62.5; however, the accuracy of predicting positives
increases from 57.1 to to 85.7.
• The accuracy of predicting negatives has decreased to 52.9% from 100%
ACCURACY PARADOX

• The accuracy paradox in classification problem states that a model with


higher overall accuracy may not be a better model.
• For example, in the case of Challenger data, when the classification cut-off
probability is 0.5, the overall accuracy is 87.5 and when the classification cut-
off probability is 0.2, the overall accuracy is only 62.5.
• However, given the context, we need higher accuracy in predicting positive
classes(Yi = 1) than negative classes (Yi = 0).
• The accuracy in predicting positive classes when classification cutoff is 0.5 is
57.1, whereas the corresponding accuracy for classification cut-off
probability of 0.2 is 85.7.
• In this case, classification cut-off 0.2 is better than 0.5 although the overall
classification accuracy is lesser for classification cut-off probability 0.2. In
classification problems, the model selection cannot be based on overall
accuracy.
SENSITIVITY, SPECIFICITY, AND PRECISION
The ability of the model to correctly classify positives and negatives are called
sensitivity and specificity, respectively.
In medical diagnostics, sensitivity (also known as true positive rate) measures the
ability of a diagnostic test to identify disease if it is present in a patient (test
positive).That is
Sensitivity = P(diagnostic test is positive | patient has disease)
In generic case
Sensitivity = P(model classifies Yi as positive | Yi is positive)
Sensitivity is calculated using the following equation

where
• True Positive (TP) is the number of positives correctly classified as positives by the model
• False Negative (TN) is positives misclassified as negative by the model.

Sensitivity is also called as recall


SPECIFICITY
Specificity is the ability of the diagnostic test to correctly classify the test as negative when
the disease is not present. That is:
Specificity = P(diagnostic test is negative | patient has no disease)
In general:
Specificity = P(model classifies Yi as negative | Yi is negative)

Specificity can be calculated using the following equation:

where
• True Negative (TN) is number of the negatives correctly classified as negatives by the
model and
• False Positive (FP) is number of negatives misclassified as positives by the model.
When the cut-off probability is 0.5, the sensitivity and specificity are given by
PRECISION AND F-SCORE (F-MEASURE)
Precision measures the accuracy of positives classified by the model.

F-Score (F-Measure) is another measure used in binary logistic regression that


combines both precision and recall (harmonic mean of precision and recall) and
is given by
CONCORDANT AND DISCORDANT PAIRS
The concepts of concordant and discordant pairs are measures used to assess the logistic
regression model performance.

If we are given these two observations, then there is no cut-off probability that can classify
them correctly.
• For example, if we choose a cut-off probability which is less than 0.23, then STS 1 will
be misclassified since predicted probability of STS 1 is 0.43.
• If we choose a classification cut-off probability between 0.23 and 0.43, then both of
them will be misclassified, since probability of STS 2 is 0.23 (whereas the corresponding
Y = 1) and probability of STS 1 is 0.43 (the value of Y = 0).
• If the cut-off probability is greater than 0.43, the STS 2 will be misclassified. Thus, if we
select STS 1 (for which Y = 0) and STS 2 (for which Y = 1), there is no cut-off
probability that can classify both positive (Y = 1) and negative (Y = 0) correctly.
• Such pairs are called Discordant Pairs.
DISCORDANT PAIRS
A pair of positive and negative observations for which the model has a cut-off probability to
classify both of them correctly are called concordant pairs.

• Consider observations STS 9 (launch temperature is 70°F and Y = 0) and STS 41B (launch
temperature is 57°F and Y = 1). Predicted probability of damage to O-ring for STS 9 is 0.23
and STS 41B is 0.86.
• If we use a classification cut-off probability between 0.23 and 0.86, then we will classify
STS 9 (Y = 0) and STS 41B (Y = 1) correctly. Such pairs are called Concordant Pairs.

A logistic regression model with high proportion of concordant pairs is preferred.


RECEIVER OPERATING CHARACTERISTICS (ROC) CURVE
• ROC curve is a plot between sensitivity (true positive rate) in the vertical axis and 1 –
specificity (false positive rate) in the horizontal axis.
• When the classification cut-off probability is changed, the sensitivity and specificity are
likely to change. The ROC curve for the challenger crash data is shown in Figure

the diagonal line represents the case of not


using a model (no discrimination between
positive and negative)

The AUC is the proportion of the concordant


pairs in the data.

Model with higher AUC is preferred and


AUC is frequently used for model selection.
AREA UNDER THE CURVE(AUC)

For challenger crash data, the AUC is 0.794. The area under the ROC curve can be interpreted
as follows:
• If we use the logistic regression model, then there will be 79.4% concordant pairs and
20.6% discordant pairs.
• For a randomly selected pair of positive and negative observations, probability of correctly
classifying them is 0.794.
• For a randomly selected positive and negative observations, the following relationship is
valid:
[ ]
P P(Positive) > P(Negative) = 0.794

As a thumb rule, AUC of at least 0.7 is required for practical application of the model.
AUC of greater than 0.9 implies an outstanding model.
LORENZ CURVE, AND GINI COEFFICIENT
Max O Lorenz, an American economist, developed a wealth distribution plot to quantify
the discrimination among citizens of a country based on the total wealth in a society
proportion of wealth

n
in atio
iscrim
d
lth
o wea
n

proportion of population

Curve A indicates the observed wealth distribution as a function of population proportion. As


the size of area A in the Lorenz curve increases, the discrimination increases; similarly as the
size of B increases, the wealth discrimination decreases
The discrimination is usually measured using Gini coefficient and is given by
GINI COEFFICIENT
Since it is a right-angle triangle, length=1
Area A + Area B = 1/2
Substituting the value in Gini coefficient equation, we get
Gini coefficient = 2A.

We can connect the area under ROC curve (AUC) with Gini coefficient. In ROC
curve, the area A is above the diagonal line.
AUC = (1/2) + Area A
or 2AUC = 1 + 2 Area A or 1 + Gini Coefficient.
Gini coefficient = 2AUC − 1

• Gini coefficient is useful in logistic regression model building for variable


selection since a variable with high Gini coefficient indicates that the variable
is able to discriminate the values of response variable well.
• When there are a large number of independent variables, Gini coefficient may
be used as a strategy to shortlist (or rank) the variables for LR model building
OPTIMAL CUT-OFF PROBABILITY
While using logistic regression model, one of the decisions that a
data scientist has to make is to choose the right classification cut-off
probability (Pc).

The following three methods are used for selecting the cut-off
probability.
1. Classification plot
2. Youden’s Index
3. Cost based approach
CLASSIFICATION PLOT
Classification plot is a plot between the predicted probability on horizontal axis and
the corresponding frequencies of the observations using LR model.

we can see that there is no negative (coded as Y = 0) beyond the probability of 0.5, whereas,
there are positives even at probability value of approximately 0.08 .So, if we have to ensure
that all the positives are correctly identified, then we have to set a classification cut-off
probability which is less than 0.08 .
CLASSIFICATION PLOT

At a cut-off value of 0.05, we are able to classify all positives (Y = 1) correctly (sensitivity
= 100%), whereas only 4 out of 17 negatives (Y = 0) are correctly classified, that is the
specificity is only 23.5%.
YOUDEN’S INDEX FOR OPTIMAL CUT-OFF PROBABILITY
YOUDEN’S INDEX FOR OPTIMAL CUT-OFF PROBABILITY

• Consider the ROC curve , the coordinate (0, 1) implies sensitivity


= specificity = 1 which is the ideal model that we would like to
use.
• The ROC curve provides information regarding how sensitivity
and specificity change when
• the classification cut-off probability changes.
• The point on the ROC curve which is at minimum distance from
coordinate (0, 1) (or which is at the maximum distance from the
diagonal line) will give us the best cut-off probability.
• Youden’s Index is the classification cut-off probability value for
which the distance from the diagonal line to the ROC curve is
maximum.
• Note that the maximum distance is a line parallel to the diagonal
line which is also tangent to the ROC curve.
YOUDEN’S INDEX FOR OPTIMAL CUT-OFF PROBABILITY
We can calculate Youden’s Index by incrementally changing the cut-off probability and
calculating the corresponding sensitivity + specificity – 1. Youden’s Index for challenger
crash data is shown in Table

The maximum value of Youden’s Index is


0.571 which occurs for cut-off probability
values of 0.5 and 0.6 (in fact for all the
values between 0.5 and 0.6). But given the
context, one would expect 100% accuracy
for sensitivity in this case. We use
Youden’s Index only when both sensitivity
and specificity are equally important.
COST-BASED CUT-OFF PROBABILITY
VARIABLE SELECTION IN LOGISTIC REGRESSION
It ensure that only statistically significant variables at a significance value of a are
included in the model.

AIC and BIC


VARIABLE SELECTION IN LOGISTIC REGRESSION
Forward LR (Likelihood Ratio)
PREDICT BANKRUPTCY USING LOGISTIC REGRESSION AND
VARIABLE SELECTION METHODS

The bankruptcy data we are dealing with here has 5436 observations and 13 variables. The
variable of interest is DLRSN(D) which is coded as 0/1 where 1 means bankruptcy and 0
non-bankruptcy. There are 10 variables which will help us predict if there is a chance of
bankruptcy or not. In this project we are using logistic regression along with different
variable selection methods and cross validation to have a good out of sample accuracy.
APPLICATION OF LOGISTIC REGRESSION IN CREDIT RATING
we will be using sample data (File name: German Credit Rating.xlsx, the data set has 800
observations and 13 attributes) taken from the ‘German Credit Data’ 1 available at the
University of California, Irvine machine learning repository. Note that the original data set
has 20 attributes (predictors) with over 40,000 observations. The response variable Y takes
value 1 (Bad credit) or 0 (Good credit). Predictors used in developing the credit rating
model are listed in Table

You might also like