Professional Documents
Culture Documents
COLLEGE PHYSICS
Chapter # Chapter Title
PowerPoint Image Slideshow
the Wald’s test statistic value is 4.832 and the corresponding p-value is 0.028. Since the p-
value is less than 0.05, we reject the null hypothesis. That is, the variable ‘launch
temperature’ has a statistically significant relationship on O-ring failure.
HOSMER-LEMESHOW GOODNESS OF FIT TEST
Hosmer-Lemeshow (H-L) is a chi-square goodness of fit test.
The H-L test is constructed by dividing the data set into 10 groups
(deciles). The H-L test checks whether the observed and expected
frequencies in each group are equal. The null and alternative
hypotheses in H-L test are
H : The logistic regression model fits the data
0
Since the null hypothesis is that the logistic regression is a good fit for the
data, we claim that the logistic regression is appropriate (p-value = 0.1411).
PSEUDO R2
It is not possible to calculate R2 as in the case of continuous dependent variable in a
logistic regression model. However, many pseudo R2 values are used which compare the
intercept-only model to the model with independent variables.
• Cox and Snell R2
• Nagelkerke’s R2 :
Nagelkerke R2 is an adjustment over Cox and Snell R2, so that the maximum value Pseudo R2
is 1. Nagelkerke R2 is given as follows
LOGISTIC REGRESSION GOODNESS OF FIT
To classify the observations, the decision maker has to first decide the
classification cut-off probability Pc.
• For the classification cut-off probability value of 0.5, the model has classified all 17
negatives (coded as 0) as negatives and 4 positives (coded 1) as positives and
remaining 3 positives as negatives.
• The accuracy of classifying negatives is 100%, whereas the accuracy of classifying
positives is 57.1%.
• The overall accuracy of the logistic regression model is 87.5% (3 observations are
misclassified out of 24 observations).
CLASSIFICATION TABLE
The classification accuracy would depend on the cut-off probability. Classification
table for the cut-off probability of 0.2
• The overall accuracy decreases to 62.5; however, the accuracy of predicting positives
increases from 57.1 to to 85.7.
• The accuracy of predicting negatives has decreased to 52.9% from 100%
ACCURACY PARADOX
where
• True Positive (TP) is the number of positives correctly classified as positives by the model
• False Negative (TN) is positives misclassified as negative by the model.
where
• True Negative (TN) is number of the negatives correctly classified as negatives by the
model and
• False Positive (FP) is number of negatives misclassified as positives by the model.
When the cut-off probability is 0.5, the sensitivity and specificity are given by
PRECISION AND F-SCORE (F-MEASURE)
Precision measures the accuracy of positives classified by the model.
If we are given these two observations, then there is no cut-off probability that can classify
them correctly.
• For example, if we choose a cut-off probability which is less than 0.23, then STS 1 will
be misclassified since predicted probability of STS 1 is 0.43.
• If we choose a classification cut-off probability between 0.23 and 0.43, then both of
them will be misclassified, since probability of STS 2 is 0.23 (whereas the corresponding
Y = 1) and probability of STS 1 is 0.43 (the value of Y = 0).
• If the cut-off probability is greater than 0.43, the STS 2 will be misclassified. Thus, if we
select STS 1 (for which Y = 0) and STS 2 (for which Y = 1), there is no cut-off
probability that can classify both positive (Y = 1) and negative (Y = 0) correctly.
• Such pairs are called Discordant Pairs.
DISCORDANT PAIRS
A pair of positive and negative observations for which the model has a cut-off probability to
classify both of them correctly are called concordant pairs.
• Consider observations STS 9 (launch temperature is 70°F and Y = 0) and STS 41B (launch
temperature is 57°F and Y = 1). Predicted probability of damage to O-ring for STS 9 is 0.23
and STS 41B is 0.86.
• If we use a classification cut-off probability between 0.23 and 0.86, then we will classify
STS 9 (Y = 0) and STS 41B (Y = 1) correctly. Such pairs are called Concordant Pairs.
For challenger crash data, the AUC is 0.794. The area under the ROC curve can be interpreted
as follows:
• If we use the logistic regression model, then there will be 79.4% concordant pairs and
20.6% discordant pairs.
• For a randomly selected pair of positive and negative observations, probability of correctly
classifying them is 0.794.
• For a randomly selected positive and negative observations, the following relationship is
valid:
[ ]
P P(Positive) > P(Negative) = 0.794
As a thumb rule, AUC of at least 0.7 is required for practical application of the model.
AUC of greater than 0.9 implies an outstanding model.
LORENZ CURVE, AND GINI COEFFICIENT
Max O Lorenz, an American economist, developed a wealth distribution plot to quantify
the discrimination among citizens of a country based on the total wealth in a society
proportion of wealth
n
in atio
iscrim
d
lth
o wea
n
proportion of population
We can connect the area under ROC curve (AUC) with Gini coefficient. In ROC
curve, the area A is above the diagonal line.
AUC = (1/2) + Area A
or 2AUC = 1 + 2 Area A or 1 + Gini Coefficient.
Gini coefficient = 2AUC − 1
The following three methods are used for selecting the cut-off
probability.
1. Classification plot
2. Youden’s Index
3. Cost based approach
CLASSIFICATION PLOT
Classification plot is a plot between the predicted probability on horizontal axis and
the corresponding frequencies of the observations using LR model.
we can see that there is no negative (coded as Y = 0) beyond the probability of 0.5, whereas,
there are positives even at probability value of approximately 0.08 .So, if we have to ensure
that all the positives are correctly identified, then we have to set a classification cut-off
probability which is less than 0.08 .
CLASSIFICATION PLOT
At a cut-off value of 0.05, we are able to classify all positives (Y = 1) correctly (sensitivity
= 100%), whereas only 4 out of 17 negatives (Y = 0) are correctly classified, that is the
specificity is only 23.5%.
YOUDEN’S INDEX FOR OPTIMAL CUT-OFF PROBABILITY
YOUDEN’S INDEX FOR OPTIMAL CUT-OFF PROBABILITY
The bankruptcy data we are dealing with here has 5436 observations and 13 variables. The
variable of interest is DLRSN(D) which is coded as 0/1 where 1 means bankruptcy and 0
non-bankruptcy. There are 10 variables which will help us predict if there is a chance of
bankruptcy or not. In this project we are using logistic regression along with different
variable selection methods and cross validation to have a good out of sample accuracy.
APPLICATION OF LOGISTIC REGRESSION IN CREDIT RATING
we will be using sample data (File name: German Credit Rating.xlsx, the data set has 800
observations and 13 attributes) taken from the ‘German Credit Data’ 1 available at the
University of California, Irvine machine learning repository. Note that the original data set
has 20 attributes (predictors) with over 40,000 observations. The response variable Y takes
value 1 (Bad credit) or 0 (Good credit). Predictors used in developing the credit rating
model are listed in Table