You are on page 1of 42

Chapter 8

Logistic Regression: Regression


With a Binary Dependent
Variable
Learning Objectives

Upon completing this chapter, you should be able to do the following:


• State the circumstances under which logistic regression should be used instead of
multiple regression or discriminant analysis.
• Identify the types of variables used for dependent and independent variables in
the application of logistic regression.
• Describe the method used to transform binary measures into the likelihood and
probability measures used in logistic regression.
• Interpret the results of a logistic regression analysis and assessing predictive
accuracy, with comparisons to both multiple regression and discriminant analysis.
• Understand the strengths and weaknesses of logistic regression compared to
discriminant analysis and multiple regression.
Overview

What Is Logistic Regression?


The Decision Process for Logistic Regression
• Stage 1: Objectives of Logistic Regression
• Stage 2: Research Design for Logistic Regression
• Stage 3: Assumptions of Logistic Regression
• Stage 4: Estimation of the Logistic Regression Model and Assessing Overall Fit
• Stage 5: Interpretation of the Results
• Stage 6: Validation of the Results
An Illustrative Example of Logistic Regression
WHAT IS LOGISTIC REGRESSION?
Logistic Regression

A specialized form of regression that is designed to predict and explain a


binary (two-group) categorical variable rather than a metric dependent
measure.

Unique Characteristics of Logistic Regression

• Its variate is similar to regular regression and made up of metric independent


variables.

• It is less affected than discriminant analysis when the basic assumptions,


particularly normality of the independent variables, are not met.
When Logistic Regression is Preferred

When the dependent variable has only two groups, logistic regression may be
preferred for two reasons:

• Discriminant analysis assumes multivariate normality and equal variance-


covariance matrices across groups, and these assumptions are often not met.
Logistic regression does not face these strict assumptions and is much more
robust when these assumptions are not met, making its application appropriate in
many situations.

• Even if the assumptions are met, some researchers prefer logistic regression
because it is similar to multiple regression. It has straightforward statistical tests,
similar approaches to incorporating metric and nonmetric variables and nonlinear
effects, and a wide range of diagnostics.
DECISION PROCESS FOR LOGISTIC REGRESSION

Stage 1: Objectives of Logistic Regression


Stage 2: Research Design for Logistic Regression
Stage 3: Assumptions of Logistic Regression
Stage 4: Estimation of the Logistic Regression Model and
Assessing Overall Fit
Stage 5: Interpretation of the Results
Stage 6: Validation of the Results
STAGE 1: OBJECTIVES OF LOGISTIC REGRESSION

 Explanation
 Classification
Two Research Objective of Logistic Regression

Explanation
• Identifying the independent variables that impact group membership in the
dependent variable.
• Estimating the importance of each independent variable in explaining group
membership.

Classification
• Establishing a classification system based on the logistic model for determining
group membership.
• Performance measured by predictive accuracy of group membership, with
possibly associated costs of misclassification.
STAGE 2: RESEARCH DESIGN FOR LOGISTIC
REGRESSION
 Representation of the Binary Dependent Variable
 Sample Size
 Use of Aggregated Data
Representation of the Binary Dependent Variable

Dependent variable is a binary measure:


• Represents discrete events (e.g., purchase versus non-purchase) or objects in only
two classes (e.g., males versus females).
Use of the Logistic Curve:
• Since the dependent variable can only take on two values (1 or 0), need a
relationship form that is non-linear so that the predicted values will never be
below 0 or above 1.
• Not possible with multiple regression,
thus use logistic curve.
• Also violates assumptions of multiple
regression, requiring different model form
as well.
Sample Size

Overall sample size


• Should be 400 to achieve best results with maximum likelihood estimation.
• With smaller samples sizes method could be less efficient in model estimation.
More focused on the size of each outcome group, which should have 10
times the number of estimated model coefficients:
• Particularly problematic are situations where the actual frequency of one outcome
group is small (i.e., below 20). The actual size of the small group is more
problematic than the low rate of occurrence.
• Several approaches have been proposed for addressing what may be termed
“rare event” situations.
Sample size requirements should be met in both the analysis and the
holdout samples.
Use of Aggregated Data

While normally associated with data at the individual level, logistic


regression can analyze aggregated data if all the independent variables
are nonmetric.

Dataset has observation for each combination of nonmetric independent


variables . . .
• Example: Three binary variables would result in 8 combinations (2 x 2 x 2). Then
for each combination you have to consider the number of observations in that
combination.
STAGE 3: ASSUMPTIONS OF LOGISTIC REGRESSION
The advantages of logistic regression are primarily the result of the general
lack of assumptions (i.e., normality and homoscedasticity)

• The primary assumption is the independence of observations, which if violated


requires some form of hierarchical/nested model approach.

• An inherent assumption that should be addressed with the Box–Tidwell test is the
linearity of the independent variables, especially continuous variables, with the
outcome.
STAGE 4: ESTIMATION OF THE LOGISTIC
REGRESSION MODEL AND ASSESSING OVERALL FIT

 Estimating the Logistic Regression Model


 Assessing the Goodness-of-Fit of the Estimated Model
 Measure of Predictive Accuracy
 Casewise Diagnostics
Fitting the Logistic Curve to Sample Data
Uses the specific form of the logistic curve, which is S-shaped, to stay within
the range of 0 to 1.
To estimate a logistic regression model, this curve of predicted values is
fitted to the actual data, just as was done with a linear relationship in
multiple regression.
Estimating the Logistic Regression Model

Two basic steps:


• Transforming a probability into odds and logit values.
• Model estimation using a maximum likelihood approach, not least squares as in
regular multiple regression.
The logistic transformation has two basic steps:
• Restating a probability as odds, then
• Calculating the logit values.

Logit values are bounded at both


one and zero.
Model Estimation Fit and Between Model comparisons

Maximum Likelihood Estimation:


• Maximizes the likelihood that an event will occur – the event being a respondent is
assigned to one group versus another.
• The basic measure of how well the maximum likelihood estimation procedure fits
is the likelihood value.

Comparisons of the likelihood values follow three steps:


• Estimate a Null Model – which acts as the “baseline” for making comparisons of
improvement in model fit.
• Estimate Proposed Model – the model containing the independent variables to be
included in the logistic regression.
• Assess – 2LL Difference.
Measures of Model Fit

Global Null Hypothesis Test:


• Test for the significance of any of the estimated coefficients.

Pseudo R2 Measures:
• Interpreted in a manner similar to the coefficient of determination in multiple
regression.
• Different pseudo R2 measures vary widely in terms of magnitude and no one
version has been deemed most preferred.
• For all of the pseudo R2 measures, however, the values tend to be much lower
than for multiple regression models.
Issues in Model Estimation . . .
Most Problematic Issues:
• Small sample sizes
• Difficult to accurately estimate coefficients and standard
errors.
• Complete Separation
• Dependent variable is perfectly predicted by an
independent variable.
• Problem is that probabilities of one and zero are not
defined for logit values, thus no values are available for
estimation.
• Quasi-Complete Separation (Zero Cells Effect)
• Most frequently encountered.
• One or more of the groups defined by the

nonmetric independent variable have counts of zero.


• Either use specialized methods or collapse categories.
Measuring Predictive Acurracy

Predictive accuracy – ability to classify observations into correct outcome group.


• All predictive accuracy measures are based on the cut-off value selected for
classification.
• The final cut-off value selected should be based on comparison of predictive accuracy
measures across cut-off values. While .5 is generally the default cut-off, other values
may substantially improve predictive accuracy.
Generates Four Outcomes:
• True Negatives
• True Positives
• False Negatives
• False Positives
Overall and Outcome-Specific Measures

Overall Measures of Predictive Accuracy:


• Accuracy (aka, Hit Ratio): predictive accuracy of positives and negatives combined.
• Youden index – combination of the True Positive rate (Sensitivity) and the True
Negative rate (Specificity) minus 1.

Predictive Accuracy of Actual Outcomes


• Sensitivity: true positive rate – percentage of positive outcomes correctly predicted.
• Specificity: true negative rate – percentage of negative outcomes correctly predicted.

Predictive Accuracy of Predicted Outcomes


• PPV (positive predictive value): percentage of positive predictions that are correct.
• NPV (negative predictive value): percentage of negative predictions that are correct.
Other Measures of Predictive Accuracy . . .

ROC curve
• Graphical portrayal of the trade-off of sensitivity and specificity values across the
entire range of cut-off values.
• A larger AUC (area under the curve) indicates better fit.

Hosmer and Lemeshow


• Only statistical test of predictive accuracy;
• Non-significance indicates well-fitting model.
Casewise Diagnostics

Two Types of Casewise Diagnostics Similar to Multiple Regression

• Residuals
 Both residuals (Pearson and deviance) reflect standardized differences
between predicted probabilities and outcome value (0 and 1). Values above
± 2 merit further attention.

• Influential Observations
 Influence measures reflect impact on model fit and estimated coefficients if
an observation is deleted from the analysis.
 Comparable to those measures found in multiple regression.
STAGE 5: INTERPRETATION OF THE RESULTS

 Testing for Significance of the Coefficients


 Interpreting the Coefficients
Testing for Significance of the Coefficients

Similar to regression = testing the coefficient value to determine if it


differs from zero>
• However, in logistic regression a zero coefficient corresponds to odds of 1.0 –
no change.

Wald Statistic
• Measure of statistical significance for each estimated coefficient so that
hypothesis testing can occur just as it does in multiple regression.
Interpreting the Coefficients

Two forms of coefficients for each independent variable


• Original logistic coefficients
 estimated parameter from the logistic model that reflects the change in the logged odds value
(logit) for a one unit change in the independent variable.
 It is similar to a regression weight or discriminant coefficient.

• Exponentiated logistic coefficient


• Antilog of the logistic coefficient, which is used for interpretation purposes in logistic
regression.
• The exponentiated coefficient minus 1.0 equals the percentage change in the odds.
 For example, an exponentiated coefficient of .20 represents a negative 80 percent change
in the odds (.20 - 1.0 = -.80), for each unit change in the independent variable (the same
as if the odds were multiplied by .20).
 Thus, a value of 1.0 equates to no change in the odds and values above 1.0 represent
increases in the predicted odds.
Assessing Directionality Versus Magnitude of the Relationship

Direction
• Best assessed in the original coefficients (positive or negative signs).
• Indirectly assessed in the exponentiated coefficients (less than 1 are negative,
greater than 1 are positive).

Magnitude
• Best assessed by the exponentiated coefficient, with the percentage change in the
dependent variable shown by:
Percentage change = (Exponentiated coefficient - 1.0) x 100
• Note that the impact is multiplicative and a coefficient of 1.0 denotes no change
(1.0 times the independent variable = no change).
Other Issues in Interpreting Coefficients . . .

Evaluating a Coefficient for A Binary Variable


• Comparable to other linear models, the coefficient represents the effect relative to
the reference category.
• Exponentiated coefficient represents the relative level of the dependent variable
for the represented group versus the omitted group.

Measures of Variable Importance


• Relative Importance
 Specific measures can be calculated where the weights across all variables sum to R2
• Weight of Evidence/Information Value
 Indicates how well each category of the variable distinguishes among the outcomes. The
weight of evidence values of the categories as combined to provide an overall measure of
discrimination for an independent variable that is information value.
STAGE 6: VALIDATION OF THE RESULTS
Validating the Logistic Regression Model

Involves ensuring both the internal and external validity of the results.

Holdout versus Estimation Samples . . .


• The most common form of estimating external validity is creation of a holdout or
validation sample and calculating the hit ratio.

Cross-validation . . .
• Typically achieved with a jackknife or “leave-one-out” process of calculating the
hit ratio.
AN ILLUSTRATIVE EXAMPLE OF LOGISTIC
REGRESSION
Research Objectives, Research Design and Assumptions

Objective:
• Identical to two-group discriminant analysis.
• Assess differences in perceptions between those customers located and served by
their US-based salesforce versus those outside the United States.
Variables In The Analysis
• Dependent variable – Region (X4)
• Independent variables – HBAT perceptions (X6 to X18)
Sample
• Division in to estimation (60 observations) and holdout (40 observations) sub-samples.
• Adequate sample sizes for both outcomes (36 and 24).
Assumptions
• Minimal assumptions (independence of observations) met.
Descriptive Measures of Group Statistics

Five of the independent variables


show significant differences
between groups . . .
Stepwise Model Estimation Results

Final Model:
• Two significant variables (X13 and
X17).
• Overall model statistically
significant.
• Substantive Pseudo-R2 values
(ranging from .505 to .677).
• Overall Predictive Accuracy:
• Over 80% for Overall and groups
specific values for estimation.
• Only USA/North America lower
(69.2) for holdout sample.
Additional Measures of Predictive Accuracy

Cut-off Values in the


range of .45 to .55
give highest levels of
predictive accuracy.
Measures of Overall Predictive Accuracy
ROC Curve
Hosmer and Lemeshow Test and
Concordance Measure
Stage 5: Interpreting The Logistic Coefficients

Variables In The Model


• Two significant variables (X13 and X17).
• Comparable to discriminant results – three significant variables (X13, X17 and X11).

Directionality
• Both variables have positive relationship to outcome (logistic coefficients of 1.079
and 1.844 (both positive).

Magnitude
• Exponentiated coefficients identify X17 as markedly more impactful (6.321)
than X13 (2.942).
Testing For Nonlinear Effects of the Independent Variables

Testing For Nonlinear Effects:


• Assess the significance test of the interaction term, and
• a test of the incremental model fit from adding the interaction term,

Results:
• Two of the independent variables exhibit potential nonlinear effects.
• Since neither of these variables were included in the stepwise solution, their
nonlinear effects (interaction term) were added to the model.
• Only X16 demonstrated a potential for increased model improvement.
Stage 6: Validation of the Results

All measures of predictive accuracy demonstrated strong model


performance for the holdout sample.
Learning Checkpoints

1. When is logistic regression the appropriate model for modeling


non-metric outcomes?
2. In what ways is logistic regression comparable to multiple
regression? How does it differ?
3. What are the ways in which model fit is assessed in logistic
regression?
4. Why are there two forms of logistic coefficients (original and
exponentiated)?

You might also like