You are on page 1of 63

Logistic Regression – Inference and

Diagnostics
U Dinesh Kumar
Objective Fitness Test

Predicted Value of Y
without the variable in
the model

Observed value of Y
Predicted Value of Y
with the variable in the
model
Lecture Outline

 Testing individual regression parameters (Wald’s test).

 Deviance (Deviation from a perfect Model)

 Null Deviance (-2LL0) and Model Deviance (-2LL or – 2LLM)

 Likelihood Ratio Test

 R2 in Logistic Regression.

 Confidence Intervals for parameters and probabilities.


Significance of Individual Parameters – Wald’s
Test
 Wald test is used to check the significance of individual
explanatory variables (similar to t-statistic in linear
regression).

 Wald test statistic is given by:


2
  
 i 
W =
 
 SE (  ) 
 i 

W is a chi-square statistic
Wald test hypothesis

 Null Hypothesis H0: 1 = 0

 Alternative Hypothesis H1: 1  0


Wald Test – German Credit Rating

P-value less
than 0.05
Wald =
(0.036/0.006)2=33.066
Wald Test – Challenger Data

For statistically
significant variable
CI interval will not
have the value 1
Deviance
Deviance (goodness of fit test)

 Deviance measures the deviation from the perfect


model (aka saturated model).

 The larger the value of Deviance, the worse the fit.

 Null Deviance is similar to SST (sum of squares of


total error) in linear regression and Model Deviance
is similar to SSE (sum of squared error) in linear
regression.
Null Deviance (-2LL0)

 Null deviance (-2LL0 or -2log-likelihood function) is


deviance value when no predictor variables are added to
the model.

 Null deviance = Prediction without using any features

 It is similar to total sum of squared errors (SST) in


multiple linear regression.
Likelihood , Log-likelihood, - 2LL
Likelihood function for logistic regression is given by:

n
L(  0 , 1 ) =   i (1 −  i )1− yi
yi
i =1
n
ln( L(  0 , 1 )) = LL (  0 , 1 ) =   yi ln( i ) + (1 − yi ) ln(1 −  i )
i =1
n
− 2 LL(  0 , 1 ) = −2   yi ln( i ) + (1 − yi ) ln(1 −  i )
i =1

Used for calculating Deviance


Null Model

-2LL0 = -2[N0 LN(N0/N) + N1 LN(N1/N)]


N0 and N1 are the number of 0s and 1s
in the data.
In the training sample we
have 561 good credits (N0)
and 239 bad credits (N1)

-2LL = -2 [561* ln (561/800) + 239 * ln(239/800)] = 975.682


Model Deviance (-2LL)
 Model deviance is the value of deviance after adding
features (predictors) to the model.

n
− 2 LL(  0 , 1 ) = −2   yi ln( i ) + (1 − yi ) ln(1 −  i )
i =1

where
−1.635 + 0.036 X
e
i = −1.635 + 0.036 X
1+ e
-2LL before after adding variable “duration”

-2LL0 -2LL

-2LL0 – (-2LL) = 975.075 – 941.754 = 33.928


Likelihood Ratio Test
Likelihood Ratio Test

 H0: 1 = 2 = … k = 0

Model without
 HA: Not all s are zero any predictor
variable

 Null Model 
G =  2 = −2LL 
 Given Model 
Model with
predictor
variables
Model Chi-Square

-2LL0 – (-2LL) = 975.075 – 941.754 = 33.928

CHIDIST(33.928,1) = 5.71 x10-9


Feature Selection in Logistic Regression

 Change in the value of Deviance is used for selecting features


(predictors).

 Variable that results in maximum reduction in deviance will be


chosen (provided the variable is statistically significant) for adding
to the model.

 likelihood without t he variable 


G = −2 ln 
 likelihood with the variable 

Gives decrease in deviance


after adding a variable
German Credit Rating – Final Model
Variables in the Equation

B S.E. Wald df Sig. Exp(B)


n
Step 14
Duration .029 .010 9.079 1 .003 1.030
@0DM 1.851 .247 56.230 1 .000 6.366
lessthan200DM 1.551 .245 39.995 1 .000 4.715
over200DM .843 .401 4.431 1 .035 2.324
critical -.781 .248 9.905 1 .002 .458
Bankpaid 1.001 .394 6.466 1 .011 2.722
CreditAmount .000 .000 4.724 1 .030 1.000
lessthan100 .801 .225 12.697 1 .000 2.228
less500 .630 .325 3.762 1 .052 1.878
SevenYears -.708 .258 7.559 1 .006 .492
Install_rate .338 .090 14.038 1 .000 1.402
MaritalStatusSM -.708 .187 14.388 1 .000 .493
CoapplicantGauran
tor -1.245 .442 7.924 1 .005 .288

Num_Credits .359 .180 3.971 1 .046 1.432


Constant -4.349 .487 79.694 1 .000 .013
Classification Table
Predicted
Credit Rating Percentage
Observed 0 (Negative) 1 (positive) Correct
Step Credit Rating 0 (Negative)
(TN)507 (FP)54 90.4
14 561
1
(Positive) (FN)124 (TP)115 48.1
239
Overall Percentage 77.8

 TP   115 
Sensitivity =  =  = 48.1
 TP + FN   115 + 124 
 TN   507 
Specificity =  =  = 90.4
 TN + FP   507 + 54 
Sensitivity & Specificity

No of true positives
Sensitivity =
Number of true positives + Number of false negatives

Sensitivity is the conditional probability that the predicted value


of y = 1 given that the observed value is 1 (also known as recall)

No of true negatives
Specificity =
Number of true negatives + Number of false positives

Specificity is the conditional probability that the predicted value


of y = 0 given that the observed value is 0
Precision
 Precision measures the ratio of true positive among cases
that are classified as positives:

 Precision = True Positive / (True Positive + False Positive)


Precision is useful Measure in Imbalanced
Data Set
 Assume that a test can identify presence of a disease with
95% accuracy.

 In a population only 2% are known to be affected by the


disease.

 If a population of 1000 people are tested using this test, then


the number of false positive will far exceed the number of
true positives.
Precision
Predicted
Credit Rating Percentage
Observed 0 (Negative) 1 (positive) Correct
Disease 0 (Negative)
931 (TN) 49(FP) 95.0%
980
1
(Positive) 1(FN) 19(TP) 95.0%
20
Overall Percentage 95.0%

𝑇𝑃 19
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = = 27.94%
𝑇𝑃 + 𝐹𝑃 19 + 49

Overall accuracy is 95%, however, precision is only 27.94%


F Score (or F1 Score)
 F Score is a measure that combines both precision and
recall (harmonic mean between precision and
recall) and is given by

Precision  Recall
F − Score = 2
Precision + Recall
Measures of Classification

Measure Interpretation

Sensitivity (aka Recall) P(predicted class is positive | Class is positive)

Specificity P(predicted class is negative | Class is negative)

Precision P(Actual class is positive | predicted class is positive)

F-Score Harmonic mean of Precision and Recall


Classification Table
Predicted
Credit Rating Percentage
Observed 0 (Negative) 1 (positive) Correct
Step Credit Rating 0 (Negative)
507 54 90.4
14 561
1
(Positive) 124 115 48.1
239
Overall Percentage 77.8

 TP   115 
Sensitivity =  =  = 48.1
 TP + FN   115 + 124 
 TN   507 
Specificity =  =  = 90.4
 TN + FP   507 + 54 
Values of various measures German
Credit Rating Example

Measure Value*

Sensitivity (aka Recall) 48.10%

Specificity 90.4%

Precision 68.04%

F-Score 56.35%

* cut-off = 0.50
Receiver Operating Characteristics (ROC) Curve
 ROC curve plots the true positive ratio (sensitivity)
against the false positive ratio (1-specificity) and
compares it with random classification.

 The higher the area under the ROC curve, the better the
prediction ability.
Concordant and Discordant Pairs
 Divide the dataset into positives (y=1) and negatives (y=0).

 For a randomly chosen positive and negative, if the probability


of positive (obtained using logistic regression model) is greater
than probability of negative then such pairs are called
concordant pairs.

 For a randomly chosen positive and negative, if the probability


of positive is less than probability of negative then such pairs
are called discordant pairs.

 Area under the ROC curve is the proportion of concordant


pairs in the dataset.
ROC and Area Under ROC

German Credit Rating with Duration as


covariate
ROC and Area Under ROC

German Credit Rating with after inclusion of


all variables
Area Under the ROC Curve

AUC = 0.629 AUC = 0.801


Area Under the ROC Curve
 Area under the ROC (AUC) curve is interpreted as the
probability that the model will rank a randomly chosen
positive higher than randomly chosen negative.

 If n1 is the number of positives (1s) and n2 is the number


of negatives (0s), then the area under the ROC curve is
the proportion of cases in all possible combinations of (i,
j) such that positives will have higher probability than
negatives.

AUC = P (Random Positive Observation) > P(Random Negative Observation)


Area Under the ROC Curve

Area Under the ROC Curve (AUC) is a measure of the


ability of the logistic regression model to discriminate
positives and negatives correctly.
ROC Curve
 General rule for acceptance of the model:

 If the area under ROC is:

 0.5  No discrimination

 0.7  ROC area < 0.8  Acceptable discrimination

 0.8  ROC area < 0.9  Excellent discrimination

 ROC area  0.9  Outstanding discrimination


Optimal Cut-off probabilities
 Using classification plots.

 Youden’s Index.

 Cost based optimization.


Classification Plots
Step number: 1

Observed Groups and Predicted Probabilities

4 + 1 +
I 1 I
I 1 I
F I 1 I
R 3 + 1 0 +
E I 1 0 I
Q I 1 0 I
U I 1 0 I
E 2 + 0 0 1 0 0 +
N I 0 0 1 0 0 I
C I 0 0 1 0 0 I
Y I 0 0 1 0 0 I
1 + 000 0 0 0 0 0 0 0 0 0 1 1 1 1 +
I 000 0 0 0 0 0 0 0 0 0 1 1 1 1 I
I 000 0 0 0 0 0 0 0 0 0 1 1 1 1 I
I 000 0 0 0 0 0 0 0 0 0 1 1 1 1 I
Predicted ---------+---------+---------+---------+---------+---------+---------+---------+---------+----------
Prob: 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
Group: 0000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111111

Mis-
Classification Plot: Challenger Crash
classification
Youden’s Index
 Youden's index is a measures for diagnostic accuracy. It is
also a global measure of a test performance, used for the
evaluation of overall discriminative power of a diagnostic
procedure.
 Youden's index is calculated by deducting 1 from the sum
of test’s sensitivity and specificity.

Youden' s Index J(p) = Sensitivity(p) + specificity(p) - 1


Cost based Model for Optimal Cut-off
Predicted
Observed 0 1
0 N00 N01

1 N10 N11

C00 = Cost of classifying 0 as 0


C01 = Cost of classifying 0 as 1
C10 = Cost of classifying 1 as 0
C11 = Cost of classifying 1 as 1

Optimal cut-off
Min C01 N 01 + C10 N10 
p
German Credit Rating – Cost based cut-off

C00 = Cost of classifying 0 as 0 = 0


C01 = Cost of classifying 0 as 1 = 100
C10 = Cost of classifying 1 as 0 = 200
C11 = Cost of classifying 1 as 1 = 0
Optimal Cut-off probability

Cut-off Youden's
Probability P00 P01 P10 P11 C01 C10 Cost Index
0.05 0.15 0.85 0.01 0.99 100.00 200.00 88.00 0.13
0.10 0.32 0.68 0.07 0.93 100.00 200.00 81.80 0.25
0.15 0.48 0.52 0.11 0.89 100.00 200.00 74.20 0.37
0.20 0.56 0.44 0.14 0.86 100.00 200.00 72.3 0.42
0.25 0.63 0.37 0.21 0.80 100.00 200.00 78.1 0.42
0.28 0.67 0.33 0.23 0.77 100.00 200.00 77.8 0.45
0.30 0.71 0.29 0.24 0.76 100.00 200.00 77.3 0.47
0.35 0.76 0.24 0.76 0.70 100.00 200.00 175.3 0.46

Minimum cost cut-off


Youden’s Index
Bank Term Deposit Data – Target
Marketing

Total data = 4521 Number of Positives = 521 Number of Negatives = 4000

Conversion Rate = 11.5%


Gain and Lift

Cumulative number of positive observations up to decile i


Gain =
Total number of positive observations in the data

Cumulative Gain using LR model


Lift =
Cumulative Gain using Random model
GAIN

Number of Number of
Number of Positives without Positives using Cumulative
Decile observations Model Model Positives Gain
1 452.1 52.1 223 223 0.42802303
2 904.2 104.2 122 345 0.6621881
3 1356.3 156.3 74 419 0.80422265
4 1808.4 208.4 38 457 0.87715931
5 2260.5 260.5 27 484 0.92898273
6 2712.6 312.6 11 495 0.95009597
7 3164.7 364.7 18 513 0.98464491
8 3616.8 416.8 3 516 0.99040307
9 4068.9 468.9 4 520 0.99808061
10 4521 521 1 521 1
GAIN CHART

Logistic Regression Model

Random Model
Lift

Number of Number of
Number of Positives Positives Cumulative
Decile observations without Model using Model Positives Gain Lift
1 452.1 52.1 223 223 0.42802303 4.28023
2 904.2 104.2 122 345 0.6621881 3.31094
3 1356.3 156.3 74 419 0.80422265 2.680742
4 1808.4 208.4 38 457 0.87715931 2.192898
5 2260.5 260.5 27 484 0.92898273 1.857965
6 2712.6 312.6 11 495 0.95009597 1.583493
7 3164.7 364.7 18 513 0.98464491 1.406636
8 3616.8 416.8 3 516 0.99040307 1.238004
9 4068.9 468.9 4 520 0.99808061 1.108978
10 4521 521 1 521 1 1
Lift Chart
R2 in Logistic Regression

 In linear regression R2 is the proportion of variation


explained by the regression model.

 It is not possible to develop a R2 type measure for


Logistic Regression since the variance of the error term
is not constant.

 Many Pseudo R2 values are used in Logistic Regression.


Pseudo R2 is an indicator of strength of relationship.
R2 in Logistic Regression
 R-squared is a measure of improvement from null model to
fitted model - The denominator of the ratio can be thought of
as the sum of squared errors from the null model--a model
predicting the dependent variable without any independent
variables.

 In the null model, each y value is predicted to be the mean of


the y values.
McFadden’s R2

LL( Model with predictors)


McFadden's R 2 = 1 −
LL(Intercept only Model)

A value of above 0.2 is acceptable


Cox and Snell R2

 Based on Log-likelihood ratio.

2/n
 LL( Null Model) 
R = 1 − 
2

 LL( Model) 

Null Model : Model without predictors

Full Model: Model with predictors

n is the number of observations


Nagelkerke R2

 The maximum value of Cox and Snell R2 may not be


1. Nagelkerke modified Cox and Snell R2 to the
maximum value = 1.

  LL(Null Model)  2 / n 
1 −   
2   LL(Full Model)  
Nagelkerke R = 
2/n 
 1 − LL(null model) 
 
 
R-square for Challenger Data

Model Summary

-2 Log Cox & Snell Nagelkerke


Step likelihood R Square R Square
1 20.371a .301 .430
a. Estimation terminated at iteration number 6 because
parameter estimates changed by less than .001.
Confidence Intervals for Beta Values

100 (1-)% confidence interval for 1 and 0 is given


by:

 
1  z1− / 2 SE( 1)
 
0  z1− / 2 SE( 0 )
CI for Challenger Beta Value

0.036 – 1.96*0.006    0.036 +1.96*0.006

0.02424    0.0477
Confidence Intervals for Exp(1)
 The confidence interval for the odds ratio, exp(1) can be
obtained by transforming the confidence interval for 1.

 If 1 is significant, the confidence interval will NOT have


value 1.

1 − z * S e ( 1 ) 1 + z * S e ( 1 )
(e ,e ) = (1.024,1.050)
Influential Observations
 Cook’s distance should be less than 1 (otherwise
classified as influential observation).

 Leverage values should be less than 3 times the average


leverage values.

 Standardized residual should be less than 3.


DFBeta
 DFBeta is the change in the estimated Beta value when an
observation is removed from the sample.

DFBeta = Beta (with observation in the sample) -


Beta (without the observation in the sample)
Recommended Readings
 D W Hosmer and S Lemeshow, “Applied Logistic Regression”,
John Wiley, 2000.

 Thomas P Ryan, “Modern Regression Methods”, John Wiley


2009
LR resources in the net

 http://faculty.chass.ncsu.edu/garson/PA765/logistic.htm

 http://www.ats.ucla.edu/stat/Spss/topics/logistic_regression
.htm

You might also like