Professional Documents
Culture Documents
When the dependent variable has only two groups, logistic regression may be
preferred for two reasons:
• Even if the assumptions are met, some researchers prefer logistic regression
because it is similar to multiple regression. It has straightforward statistical tests,
similar approaches to incorporating metric and nonmetric variables and nonlinear
effects, and a wide range of diagnostics.
DECISION PROCESS FOR LOGISTIC REGRESSION
Explanation
Classification
Two Research Objective of Logistic Regression
Explanation
• Identifying the independent variables that impact group membership in the
dependent variable.
• Estimating the importance of each independent variable in explaining group
membership.
Classification
• Establishing a classification system based on the logistic model for determining
group membership.
• Performance measured by predictive accuracy of group membership, with
possibly associated costs of misclassification.
STAGE 2: RESEARCH DESIGN FOR LOGISTIC
REGRESSION
Representation of the Binary Dependent Variable
Sample Size
Use of Aggregated Data
Representation of the Binary Dependent Variable
• An inherent assumption that should be addressed with the Box–Tidwell test is the
linearity of the independent variables, especially continuous variables, with the
outcome.
STAGE 4: ESTIMATION OF THE LOGISTIC
REGRESSION MODEL AND ASSESSING OVERALL FIT
Pseudo R2 Measures:
• Interpreted in a manner similar to the coefficient of determination in multiple
regression.
• Different pseudo R2 measures vary widely in terms of magnitude and no one
version has been deemed most preferred.
• For all of the pseudo R2 measures, however, the values tend to be much lower
than for multiple regression models.
Issues in Model Estimation . . .
Most Problematic Issues:
• Small sample sizes
• Difficult to accurately estimate coefficients and standard
errors.
• Complete Separation
• Dependent variable is perfectly predicted by an
independent variable.
• Problem is that probabilities of one and zero are not
defined for logit values, thus no values are available for
estimation.
• Quasi-Complete Separation (Zero Cells Effect)
• Most frequently encountered.
• One or more of the groups defined by the
ROC curve
• Graphical portrayal of the trade-off of sensitivity and specificity values across the
entire range of cut-off values.
• A larger AUC (area under the curve) indicates better fit.
• Residuals
Both residuals (Pearson and deviance) reflect standardized differences
between predicted probabilities and outcome value (0 and 1). Values above
± 2 merit further attention.
• Influential Observations
Influence measures reflect impact on model fit and estimated coefficients if
an observation is deleted from the analysis.
Comparable to those measures found in multiple regression.
STAGE 5: INTERPRETATION OF THE RESULTS
Wald Statistic
• Measure of statistical significance for each estimated coefficient so that
hypothesis testing can occur just as it does in multiple regression.
Interpreting the Coefficients
Direction
• Best assessed in the original coefficients (positive or negative signs).
• Indirectly assessed in the exponentiated coefficients (less than 1 are negative,
greater than 1 are positive).
Magnitude
• Best assessed by the exponentiated coefficient, with the percentage change in the
dependent variable shown by:
Percentage change = (Exponentiated coefficient - 1.0) x 100
• Note that the impact is multiplicative and a coefficient of 1.0 denotes no change
(1.0 times the independent variable = no change).
Other Issues in Interpreting Coefficients . . .
Involves ensuring both the internal and external validity of the results.
Cross-validation . . .
• Typically achieved with a jackknife or “leave-one-out” process of calculating the
hit ratio.
AN ILLUSTRATIVE EXAMPLE OF LOGISTIC
REGRESSION
Research Objectives, Research Design and Assumptions
Objective:
• Identical to two-group discriminant analysis.
• Assess differences in perceptions between those customers located and served by
their US-based salesforce versus those outside the United States.
Variables In The Analysis
• Dependent variable – Region (X4)
• Independent variables – HBAT perceptions (X6 to X18)
Sample
• Division in to estimation (60 observations) and holdout (40 observations) sub-samples.
• Adequate sample sizes for both outcomes (36 and 24).
Assumptions
• Minimal assumptions (independence of observations) met.
Descriptive Measures of Group Statistics
Final Model:
• Two significant variables (X13 and
X17).
• Overall model statistically
significant.
• Substantive Pseudo-R2 values
(ranging from .505 to .677).
• Overall Predictive Accuracy:
• Over 80% for Overall and groups
specific values for estimation.
• Only USA/North America lower
(69.2) for holdout sample.
Additional Measures of Predictive Accuracy
Directionality
• Both variables have positive relationship to outcome (logistic coefficients of 1.079
and 1.844 (both positive).
Magnitude
• Exponentiated coefficients identify X17 as markedly more impactful (6.321)
than X13 (2.942).
Testing For Nonlinear Effects of the Independent Variables
Results:
• Two of the independent variables exhibit potential nonlinear effects.
• Since neither of these variables were included in the stepwise solution, their
nonlinear effects (interaction term) were added to the model.
• Only X16 demonstrated a potential for increased model improvement.
Stage 6: Validation of the Results