Professional Documents
Culture Documents
2. Bivariate analysis
a. Bivariate analysis of dependent(% 1s) with independent variables
b. Check for the relationship of the independent variables with the dependent
variables
c. Do transformations wherever necessary
b. 2 Log L - The -2 Log L is used in hypothesis tests for nested models and the
value in itself is not meaningful
c. Test - These are three asymptotically equivalent Chi-Square tests. They test
against the null hypothesis that at least one of the predictors' regression
coefficient is not equal to zero in the model
d. Likelihood Ratio - Likelihood Ratio is defined as two times the log of the ratio of
the log likelihood (Model fit) functions of the two models. The two models being L
full and L reduced. L full is the model having all the explanatory variables and L
reduced as the model including all the explanatory variables excluding the
explanatory variable with interest. Model fit for the reduced model is always
higher than the model fit for the full model. Likelihood ratio test is superior to
Wald statistics due to the precision obtained in the p-value for both the tests.
e. Score - This is the Score Chi-Square Test that at least one of the predictors'
regression coefficient is not equal to zero in the model.
f. Wald - This is the Wald Chi-Square Test that at least one of the predictors'
regression coefficient is not equal to zero in the model.
6. Model Diagnostics
c. Classification table - The classification table gives a cut off probability which
maximizes both sensitivity and specificity.
d. Goodness of Fit - The Hosmer and Lemeshow's (H-L) goodness of fit test
divides subjects into deciles based on predicted probabilities, then computes a
chi-square from observed and expected frequencies. Then a probability (p) value
is computed from the chi-square distribution to test the fit of the logistic model. If
the H-L goodness-of-fit test statistic is greater than .05, as we want for well-fitting
models, we fail to reject the null hypothesis that there is no difference between
observed and model-predicted values, implying that the model's estimates fit the
data at an acceptable level. That is, well-fitting models show nonsignificance on
the goodness-of-fit test, indicating model prediction that is not significantly
different from observed values.
Like other significant tests it only tells us whether the model fits or not, and does
not tell us anything about the extent of the fit. Similarly, like other significance
tests, it is strongly influenced by the sample size. As the sample size gets large,
the H-L statistic can find smaller and smaller differences between observed and
model-predicted values to be significant
e. Confusion Matrix - The confusion matrix contains information about actual and
predicted classifications done by a classification system.
Actual
Positiv Negativ
e e
Positive TP FP
Predicte
d Negativ
e FN TN
True Precision= TP/ (TP+FP)
We need to form a confusion matrix using the actual data and the propensity score to
check for the predictive ability of the model.
We need to make the lift curve and gains chart to check for how good the model
is predicting. The process for making a lift chart is as follows
A good model should have more than 50% of respondents in the first 4 buckets.
After checking for the above statistics ( though all of them might not be applicable) we have an
idea about the goodness of fit and the discriminative power of the model.
7. Model Validation
a. Data Splitting -We would be using bootstrapping as a process for data splitting.
It is a method of internal validation which involves taking a large number of
simple random samples with replacement from the original and fitting the model
on each of these samples.
i. We can use proc surveyselect with the same sample size and different
seed values to generate different samples(200-500) from the dataset.
b. Building models for all the samples – Run proc logistic on each sample with
the same set of variables used for the training dataset and run the model and
obtain the parameter estimates, model fit statistics and diagnostics.
c. Co-efficient stability – Coefficient stability is checked across development and
validation sample. Once the model is performing satisfactorily on development
sample, we use the same set of variables to model the validation sample. A
robust model should perform equally well on validation sample too. Hence, the
coefficients should be in a close range and should be of same sign. Do check
whether the signs of the coefficients make business sense. Check for number of
times each variable deviates from the training dataset and make sure that the
count is as small as possible. Otherwise, we would have to remove the variable
and try different iterations.
d. Comparison of training model and validation models - Check for the model
diagnostics of the validation models ( lift curve, confusion matrix, roc curve, c-
statistic etc) and compare it with the training dataset. The difference between the
lift values of training model and the validation models should be similar.