You are on page 1of 5

Steps in Logistic Regression

1. Univariate analysis on variables


a. Check for outliers and cap the outliers
b. Replace missing values with 95th percentile/µ+3σ

2. Bivariate analysis
a. Bivariate analysis of dependent(% 1s) with independent variables
b. Check for the relationship of the independent variables with the dependent
variables
c. Do transformations wherever necessary

3. Correlation check of the independent variables


a. Use proc corr to check for the correlation among the independent variables
b. Use proc reg with b_status as DV with the IV to check for vif values and remove
variables with high multi-collinearity

4. Stepwise/backward logistic regression as a variable reduction technique and then


scoring the covariates
a. Use a p-value=0.05 as entry criterion and exit criterion for the stepwise/backward
logistic regression
b. Score the covariates and obtain the propensity scores

5. Model Fit Statistics


a. Akaike Inference Criterion-- Akaike Information Criterion (AIC) and Schwarz
Criterion (SC) are deviants of negative two times the Log-Likelihood (-2 Log L).
AIC and SC penalize the log-likelihood by the number of predictors in the model.
The model with the smallest AIC and SC is considered the best, although their
values by themselves are not meaningful

b. 2 Log L - The -2 Log L is used in hypothesis tests for nested models and the
value in itself is not meaningful

c. Test - These are three asymptotically equivalent Chi-Square tests. They test
against the null hypothesis that at least one of the predictors' regression
coefficient is not equal to zero in the model

d. Likelihood Ratio - Likelihood Ratio is defined as two times the log of the ratio of
the log likelihood (Model fit) functions of the two models. The two models being L
full and L reduced. L full is the model having all the explanatory variables and L
reduced as the model including all the explanatory variables excluding the
explanatory variable with interest. Model fit for the reduced model is always
higher than the model fit for the full model. Likelihood ratio test is superior to
Wald statistics due to the precision obtained in the p-value for both the tests.
e. Score - This is the Score Chi-Square Test that at least one of the predictors'
regression coefficient is not equal to zero in the model.

f. Wald - This is the Wald Chi-Square Test that at least one of the predictors'
regression coefficient is not equal to zero in the model.

6. Model Diagnostics

a. Association of Predicted probabilities and Observed Responses:


i. Concordance - It is a measure of the association between the predicted
values and the actual values. Consider a set of 100 individuals out of
which 10 are the responders (denoted by 1) and 90 are non-responders
(denoted by 0). Now we construct pairs for each responders with every
non-responders. Hence, we get 900 such pairs (10*90 = 900). Using the
model under development, we calculate the predicted response rate for
each responder and non-responder in every pair. If responder’s predicted
probability is greater than non-responder’s predicted probability, then the
pair is concordant. If it is vice versa, then the pair is discordant and if
both are equal, then the pair is tied. For a good model, the percentage of
concordant pair lies above 65%.

ii. c-statistic - Area under the ROC curve. It is a measure of discrimination


of the model. It is defined as
c = (# of concordant pairs+ (0.5*# of ties))
(# of concordant pairs+ # of discordant pairs+ (0.5*# of ties))
It is the number of times that a subject with a positive outcome has a
higher event probability than a subject with a negative outcome.
The value of c-statistic should be between 0.6 - 0.8.

b. ROC(Receiver Operating Characteristic) Curve - The ROC curve is a plot of


Sensitivity vs. 1-Specificity.
Sensitivity- The ability to correctly classify an event. It is defined as
Sensitivity= true positive/(true positive + false negative)
Specificity- The ability to correctly classify a non-event. It is defined as
Specificity=true negative/(true negative+ false positive)
This curve plots the probability of correctly classifying
a positive subject (sensitivity) against the probability of incorrectly classifying a
negative subject (one minus specificity) for the entire range of possible cutoff
points. The area under the ROC curve, which ranges from zero to one, provides
ameasure of the model’s ability to discriminate - the larger the area under the
ROCcurve, the more the model discriminates.
If we want to choose an optimal cut-off point for the purposes of classification,
one might select a cut-off point that maximizes both sensitivity and specificity.
This choice is facilitated by the use of the ROC curve area - the best choice
for the cut-off point is approximately where the curve starts bending.

c. Classification table - The classification table gives a cut off probability which
maximizes both sensitivity and specificity.

d. Goodness of Fit - The Hosmer and Lemeshow's (H-L) goodness of fit test
divides subjects into deciles based on predicted probabilities, then computes a
chi-square from observed and expected frequencies. Then a probability (p) value
is computed from the chi-square distribution to test the fit of the logistic model. If
the H-L goodness-of-fit test statistic is greater than .05, as we want for well-fitting
models, we fail to reject the null hypothesis that there is no difference between
observed and model-predicted values, implying that the model's estimates fit the
data at an acceptable level. That is, well-fitting models show nonsignificance on
the goodness-of-fit test, indicating model prediction that is not significantly
different from observed values.
Like other significant tests it only tells us whether the model fits or not, and does
not tell us anything about the extent of the fit. Similarly, like other significance
tests, it is strongly influenced by the sample size. As the sample size gets large,
the H-L statistic can find smaller and smaller differences between observed and
model-predicted values to be significant

e. Confusion Matrix - The confusion matrix contains information about actual and
predicted classifications done by a classification system.
Actual

Positiv Negativ
e e

Positive TP FP
Predicte
d Negativ
e FN TN
True Precision= TP/ (TP+FP)

True Recall=TP/ (TP+FN)

We need to form a confusion matrix using the actual data and the propensity score to
check for the predictive ability of the model.

f. Lift chart & Gains Charts –


 Lift is a measure of the effectiveness of a predictive model calculated as
the ratio between the results obtained with and without the predictive
model.
 Cumulative gains and lift charts are visual aids for measuring model
performance
 Both charts consist of a lift curve and a baseline
 The greater the area between the lift curve and the baseline, the better
the model

We need to make the lift curve and gains chart to check for how good the model
is predicting. The process for making a lift chart is as follows

 First arrange the members in descending order of their propensity scores.


 The members are then grouped into deciles and cumulative number of
responders is calculated for each bucket
 Then the cumulative % of respondents is calculated for each bucket
 The difference between the cumulative % for each bucket and the %
prediction for no model is called the lift for that bucket
 The cumulative % of respondents and a 45 degree baseline is plotted on
a graph with buckets on x-axis and y-axis with the % prediction.

A good model should have more than 50% of respondents in the first 4 buckets.

g. K-S Chart – It captures the maximum difference between the prediction of 1s


and 0s

After checking for the above statistics ( though all of them might not be applicable) we have an
idea about the goodness of fit and the discriminative power of the model.

Now we need to validate the model on different samples

7. Model Validation
a. Data Splitting -We would be using bootstrapping as a process for data splitting.
It is a method of internal validation which involves taking a large number of
simple random samples with replacement from the original and fitting the model
on each of these samples.
i. We can use proc surveyselect with the same sample size and different
seed values to generate different samples(200-500) from the dataset.
b. Building models for all the samples – Run proc logistic on each sample with
the same set of variables used for the training dataset and run the model and
obtain the parameter estimates, model fit statistics and diagnostics.
c. Co-efficient stability – Coefficient stability is checked across development and
validation sample. Once the model is performing satisfactorily on development
sample, we use the same set of variables to model the validation sample. A
robust model should perform equally well on validation sample too. Hence, the
coefficients should be in a close range and should be of same sign. Do check
whether the signs of the coefficients make business sense. Check for number of
times each variable deviates from the training dataset and make sure that the
count is as small as possible. Otherwise, we would have to remove the variable
and try different iterations.
d. Comparison of training model and validation models - Check for the model
diagnostics of the validation models ( lift curve, confusion matrix, roc curve, c-
statistic etc) and compare it with the training dataset. The difference between the
lift values of training model and the validation models should be similar.

You might also like