Professional Documents
Culture Documents
26
MULTINOMIAL
LOGISTIC REGRESSION
C AROLYN J. A NDERSON
L ESLIE RUTKOWSKI
C
hapter 24 presented logistic regression Many of the concepts used in binary logistic
models for dichotomous response vari- regression, such as the interpretation of parame-
ables; however, many discrete response ters in terms of odds ratios and modeling prob-
variables have three or more categories (e.g., abilities, carry over to multicategory logistic
political view, candidate voted for in an elec- regression models; however, two major modifica-
tion, preferred mode of transportation, or tions are needed to deal with multiple categories
response options on survey items). Multicate- of the response variable. One difference is that
gory response variables are found in a wide with three or more levels of the response variable,
range of experiments and studies in a variety of there are multiple ways to dichotomize the
different fields. A detailed example presented in response variable. If J equals the number of cate-
this chapter uses data from 600 students from gories of the response variable, then J(J – 1)/2 dif-
the High School and Beyond study (Tatsuoka & ferent ways exist to dichotomize the categories.
Lohnes, 1988) to look at the differences among In the High School and Beyond study, the three
high school students who attended academic, program types can be dichotomized into pairs
general, or vocational programs. The students’ of programs (i.e., academic and general, voca-
socioeconomic status (ordinal), achievement tional and general, and academic and vocational).
test scores (numerical), and type of school How the response variable is dichotomized
(nominal) are all examined as possible explana- depends, in part, on the nature of the variable. If
tory variables. An example left to the interested there is a baseline or control category, then the
reader using the same data set is to model the analysis could focus on comparing each of the
students’ intended career where the possibilities other categories to the baseline. With three or
consist of 15 general job types (e.g., school, more categories, whether the response variable
manager, clerical, sales, military, service, etc.). is nominal or ordinal is an important consider-
Possible explanatory variables include gender, ation. Since models for nominal responses
achievement test scores, and other variables in can be applied to both nominal and ordinal
the data set. response variables, the emphasis in this chapter
390
26-Osborne (Best)-45409.qxd 10/9/2007 5:52 PM Page 391
Table 26.1 Estimated Parameters (and Standard Errors) From Separate Binary Logistic Regressions and
From the Simultaneously Estimated Baseline Model
each (numerical) explanatory variable in the This odds ratio is interpreted as follows: For
model. a one-unit increase in achievement, the odds of
Turning to interpretation, the regression coeffi- a student attending an academic versus a gen-
cients provide estimates of odds ratios. Using the eral program are 1.12 times larger. For
parameter estimates of the baseline model (col- example, the odds of a student with x = 50
umn 5 of Table 26.1), the estimated odds that a attending an academic program versus a gen-
student is from an academic program versus a gen- eral one is 1.12 times the odds for a student
eral program given achievement score x equals with x = 49. Given
_ the scale of the achievement
variable (i.e., x = 51.99, s = 8.09, min = 32.94,
P̂(Yi = academic|x) and max = 70.00), it may be advantageous to
= exp[−5.0391 + (7) report the odds ratio for an increase of one
P̂(Yi = general|x) standard deviation of the explanatory variable
0.1099x], rather than a one-unit increase. Generally
speaking, exp(βc), where c is a constant, equals
and the estimated odds of an academic versus a the odds ratio for an increase of c units. For
general program for a student with achievement example, for an increase of one standard
score x + 1 equals deviation in mean achievement, the odds
ratio for academic versus general equals
P̂(Yi = academic|x + 1) exp(0.1099(8.09)) = 2.42. Likewise, for a one
= exp[−5.0391 + (8) standard deviation increase in achievement, the
P̂(Yi = general|x + 1)
odds of an academic versus a vocational pro-
0.1099(x + 1)]. gram are exp(0.1698(8.09)) = 3.95 times larger,
but the odds of a vocational program versus a
The ratio of the two odds in Equations 8 and general program are only exp(–0.0599(8.09)) =
7 is an odds ratio, which equals 0.62 times as large.
Odds ratios convey the multiple one odds
P̂(Yi = academic|x + 1) P̂(Yi = academic|x) is relative to another odds. The parameters
also provide information about the probabili-
P̂(Yi = general|x + 1) P̂(Yi = general|x)
ties; however, the effects of explanatory vari-
exp[−5.0391 + 0.1099(x + 1)] ables on probabilities are not necessarily
=
exp[−5.0391 + 0.1099x] straightforward, as illustrated in the “Multiple
Explanatory Variables” section.
= exp(0.1099) = 1.12.
26-Osborne (Best)-45409.qxd 10/9/2007 5:52 PM Page 394
exp[−5.0391 + 0.1099xi ]
P̂(Yi = academic) =
1 + exp[−5.0391 + 0.1099xi ] + exp[2.8996 − 0.0599xi ]
1
P̂(Yi = general) =
1 + exp[−5.0391 + 0.1099xi ] + exp[2.8996 − 0.0599xi ]
26-Osborne (Best)-45409.qxd 10/9/2007 5:52 PM Page 395
1.0
0.9
0.8
Estimated Probability
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
30 40 50 60 70
Mean Achievement Test Score
Figure 26.1 Estimated probabilities of attending different high school programs as a function of mean
achievement.
Table 26.2 Estimated Parameters, Standard Errors, and Wald Test Statistics for All Main Effects Model
achievement, private schools, and higher SES will also be parallel; however, this is not true for
levels. It appears that the odds of vocational ver- the other categories. In fact, the curves for
sus general programs are larger for public responses whose probabilities increase and
schools and lower SES levels; however, this con- then decrease are not parallel and may even
clusion is not warranted. These parameters are cross within the range for which there are data.
not significantly different from zero (see the last To illustrate this, the probabilities of students
two columns of Table 26.2). Statistical inference attending a general program for private schools
and nonsignificant parameters are issues that are plotted against mean achievement scores
are returned to later in this chapter. with a separate curve for each SES level at the
To illustrate the effect of the extra variables top of Figure 26.3. The curves for different SES
on probabilities, the estimated probabilities of levels cross. In normal linear regression, this
attending an academic program are plotted would indicate an interaction between achieve-
against mean achievement scores in Figure 26.2 ment and SES; however, this is not the case in
with a separate curve for each combination of logistic regression. Crossing curves can occur
SES. The figure on the left is for students at in the baseline model with only main effects
public schools, and the figure on the right is for because the relative size of the numerator and
private schools. The discrete variables shift the denominator changes as values of the explana-
curves horizontally. In particular, the horizon- tory variables change. If there are no interac-
tal distance between the curves for high and tions in the model, then curves of odds ratios
middle SES for either public or private schools and logarithms of odds ratios will not cross.
is the same regardless of the mean achievement This is illustrated at the bottom of Figure 26.3,
value. where the logarithms of the odds are plotted
When there is no interaction, the curves for against mean achievement. The logarithms of
the estimated probabilities of the response cat- the odds are linear, and the lines are parallel.
egory that is monotonically increasing (e.g., Although probabilities are more intuitively
academic programs) will be parallel; that is, understandable, figures of estimated probabili-
they will have the same shape and are just ties may be misleading to researchers unfamil-
shifted horizontally. The curves for the iar with multicategory logistic regression
response category where the probabilities models. Care must be taken when presenting
monotonically decrease (e.g., vocational programs) results such as these.
26-Osborne (Best)-45409.qxd 10/9/2007 5:52 PM Page 397
Estimated P(Academic)
1.0 1.0
High High
0.8 0.8
0.0 0.0
30 40 50 60 70 30 40 50 60 70
Mean Achievement Test Score Mean Achievement Test Score
Figure 26.2 Estimated probabilities of attending an academic high school program as a function of mean
achievement with a separate curve for each SES level.
P(General)
1.0
Estimated P(General)
0.8
0.6
Low
0.4
Middle
0.2
High
0.0
30 40 50 60 70
Mean Achievement Test Score
Log[P(Academic)/P(General)] Log[P(Vocational)/P(General)]
3 3
Log [P(Academic)/
High
Log [P(Vocation)/
2 2
P(General)]
P(General)]
1 1
Middle Middle
0 0
Low High
−1 Low −1
−2 −2
−3 −3
30 40 50 60 70 30 40 50 60 70
Mean Achievement Test Score Mean Achievement Test Score
Figure 26.3 At the top is a plot of estimated probabilities of attending a general high school program for
private schools as a function of mean achievement with a separate curve for each SES level
from the model with only main effects. The two lower figures are plots of the estimated
logarithm of odds ratios using parameters from the same model.
With ordinal explanatory variables such as βs for the variable. In our example, suppose we
SES, one way to use the ordinal information is assign 1 to low SES, 2 to middle SES, and 3 to
by assigning scores or numbers to the categories high SES and refit the model. In this model,
and treating the variables as numerical variables only one β parameter is estimated for SES
in the model (e.g., like mean achievement). rather than two for each of the (J – 1) odds.
Often, equally spaced integers are used, which Using SES as a numerical variable with equally
amounts to putting equality restrictions on the spaced scores in the model imposes restrictions
26-Osborne (Best)-45409.qxd 10/9/2007 5:52 PM Page 398
on the βs. In our example, βj3 = βj4 for the odds incorporated into the model, which permits sta-
for j = 1 and 2 (i.e., academic and vocational, tistical testing of these hypotheses.
respectively). The parameter estimates for the
model where SES is treated numerically are Conditional Multinomial
reported in Table 26.3. Logistic Regression
Placing the restrictions on the βs for ordinal
variables is often a good way to reduce the com- The conjectures described above regarding
plexity of a model. For example, the estimated possible equalities, nonsignificant effects, and
odds ratio of academic versus general for middle restrictions on parameters can be imposed (and
versus low SES equals exp(β̂13(2−1)) = exp(β̂13) = tested) by reexpressing the baseline model as a
1.70, which is the same as the odds ratio of high conditional multinomial logistic regression
versus middle SES, exp(β̂13(3−2)) = 1.70. model, which is a more general model. The
In our example, putting in equally spaced conditional multinomial logistic regression
scores for SES is not warranted and is misleading. model is also known as the “discrete choice
When no restrictions were imposed on the para- model,” “McFadden’s model,” and “Luce’s choice
meters (see Figure 26.3 and Table 26.2), the order model.” Some sources for more complete intro-
of the SES levels for the odds of academic (versus ductions to this model include Long (1997),
general) schools is in the expected order (i.e., the Agresti (2002), and Powers and Xie (2000).
odds of an academic program are larger the Unlike the baseline model, the conditional
higher the student’s SES level), and the parameter model permits explanatory variables that are
estimates are approximately equally spaced. On attributes of response categories.
the other hand, the parameter estimates of SES The general form of the conditional multi-
for odds of vocational schools do not follow the nomial logistic model is
natural ordering of low to high, are relatively
close together, and are not significantly different
from zero. The numerical scores could be used for exp(β∗ x∗ij )
P(Yi = j|x ) =
∗
, (15)
the SES effect on the odds of academic programs ij
exp(β∗ xih∗ )
h
but the scores are inappropriate for the odds of
vocational programs. There may not even be a
difference between vocational and general pro- where β* is a vector of coefficients and x*ij is a
grams in terms of SES. Furthermore, there may vector of explanatory variables. The explana-
not be a difference between students who tory variables x*ij may depend on the attributes
attended vocational and general programs with of an individual (i), the response category (j), or
respect to school type (Wald = 0.27, df = 1, p = .60). both, but the coefficients β* do not depend on
In the following section, these conjectures are the response categories. To reexpress the baseline
Table 26.3 Estimated Parameters, Standard Errors, and Wald Statistics for All Main Effects Model
NOTE: SES is treated as a numerical variable with scores of 1 = low, 2 = middle, and 3 = high.
26-Osborne (Best)-45409.qxd 10/9/2007 5:52 PM Page 399
model in the general form of the conditional For general programs, di1 = di2 = 0.
logistic model given in Equation 15, data need For our simple baseline model with only
to be reformatted from having one line in the achievement test scores in the model, β*′ = (α1,
data file for each individual to multiple lines of α2, β11, β21) and x*ij ′ = (di1, di2, di1xi, di2xi). Using
data for each individual. When fitting condi- the definitions of dij, β*, and x*ij, we obtain our
tional multinomial models, the data must have familiar form of the baseline model,
one line of data for each response category for
each individual to incorporate explanatory
variables that are characteristics of the response P(Yi = j) =
categories.
exp[αj + βj1 xi ] (16)
The format of the data file for the baseline ,
model where mean achievement is the only (1 + exp[α1 + β11 xi ]
explanatory variable in the model is given in + exp[α2 + β21 xi ])
Table 26.4. In the data, Y is the response variable
that indicates a student’s high school program.
which, for our example, corresponds to
The indicators, dij, are the key to putting the
baseline model into the form of the conditional
multinomial model, as well as to specifying dif-
P(Yi = academic) =
ferent models for the different program types
(i.e., levels of the response variable). In our exp[α1 + β11 xi ]
, (17)
example, the two indicator variables, di1 and di2, (1 + exp[α1 + β11 xi ]
indicate the category of the response variable + exp[α2 + β21 xi ])
corresponding to particular line in the data file.
They are defined as
1 when program type is academic P(Yi = vocational) =
di1 = 0 otherwise exp[α2 + β21 xi ]
, and (18)
1 when program type is vocational (1 + exp[α1 + β11 xi ]
di2 = 0 otherwise + exp[α2 + β21 xi ])
Table 26.4 The Format of the Data File Needed to Fit the Baseline Multinomial Model as Conditional
Multinomial Model With Mean Achievement as the Explanatory Variable
1 32.94 General 0 0 0 0 0
1 32.94 Academic 0 1 0 32.94 0
1 32.94 Vocational 1 0 1 0 32.94
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
β*′ = (α1, α2, β11, β21, β12, β22, β13, β23, β14, β24) P(Yi = general) =
1
and .
1 + exp[α1 + β11 xi + β12 pi + β13 si ]
+ exp[α2 + β21 xi ]
x*ij′ = (di1, di2, di1xi, di2xi, di1pi,
di2pi, di1s1i, di2s1i, di1s2i, di2s2i).
The parameter estimates for this final model
Using these β* and x*ij in Equation 15, the complex are given in Table 26.5 and are similar to those in
baseline model with main effects for achievement the baseline model with no restrictions on the
(xi), school type (pi), and SES (s1i, s2i) expressed as parameters (i.e., Table 26.2). The model with
a conditional multinomial model is restrictions fit as a conditional multinomial
model is more parsimonious than the baseline
model (i.e., 6 vs. 10 nonredundant parameters).
P(Yi = j) = exp[αj dij + βj1 dij xi The models for the probabilities of different
+ βj2 dij pi + βj3 dij s1i high school programs do not have the same
effects. As stated earlier, by fitting all of the odds
+ βj4 dij s2i ]ki ,
(20) simultaneously, the logical restrictions on the
where ki = exp[αh dih + βh1 dih xi parameters are maintained, and the standard
h
errors of the parameters are smaller (i.e., there is
+ βj2 dih pi + βh3 dih s1i + βh4 dih s2i ].
greater precision).
Switching to the conditional multinomial
model emphasizes that users of multinomial logis-
When reexpressing the baseline model as a con- tic regression are not restricted to the standard
ditional multinomial model, a good practice is models that computer programs fit by default,
to refit the baseline models that have already which often fail to address specific research ques-
been fit to the data as conditional models. If tions. By creating new variables, using indicator
the data matrix and conditional models have variables, and using a slightly more general model,
been correctly specified, then the results researchers can tailor their models to best match
obtained from the baseline and conditional their research questions. Imposing restrictions on
models will be the same (i.e., parameters esti- parameters when they are warranted can greatly
mates, standard errors, etc.). simplify a complex model. This was illustrated in
One simplification of Equation 20 is to treat our example. Compare Tables 26.2 and 26.5. In
SES numerically for academic programs (i.e., use the model reported in Table 26.2, there were 10
26-Osborne (Best)-45409.qxd 10/9/2007 5:52 PM Page 401
Table 26.5 Estimated Parameters From the Conditional Multinomial Logistic Regression Model
nonredundant parameters, some of which are not for assessing whether β = 0 are the Wald statistic
significant and others that are. An additional rea- and the likelihood ratio statistic.
son for going the extra step of using the condi- Maximum likelihood estimates of the model
tional model is that novice users often succumb to parameters are asymptotically normal with
the temptation of interpreting nonsignificant mean equal to the value of the parameter in the
parameters such as those in Table 26.2. The better population—that is, the sampling distribution
approach is to avoid the temptation. Alternatively, of β̂ ∼ N (β, σ 2β). This fact can be used to test the
researchers may simply not report the nonsignifi- hypothesis β = 0 and to form confidence inter-
cant effects even though they were in the model. vals for β and odds ratios.
This also is a poor (and misleading) practice. Table If the null hypothesis H0 : β = 0 is true, then
26.5 contains only 6 nonredundant parameters, all the statistic
of which are significant. What is statistically
important stands out, and the interpretation is β̂
much simpler. In our example, it is readily appar- z= ,
SE
ent that students in general and vocational pro-
grams only differ with respect to achievement test
where SE is the estimate of σ 2β, has an approxi-
scores, whereas students in academic programs
mate standard normal distribution. This statistic
and those in one of the other programs differ
can be used for directional or nondirectional
with respect to achievement test scores, SES, and
tests. Often, the statistic is squared, X2 = z2, and
school type.
is known as a Wald statistic, which has a sam-
pling distribution that is approximately chi-
square with 1 degree of freedom.
STATISTICAL INFERENCE When reporting results in papers, many jour-
nals require confidence intervals for effects.
Simple statistical tests were used informally in Although many statistical software packages
the previous section to examine the significance automatically provide confidence intervals, we
of effects in the models discussed. The topic of show where these intervals come from, which
statistical inference is explicitly taken up in points to the relationship between the confi-
detail in this section. There are two basic types dence intervals and the Wald tests. A (1 − α)%
of statistical inference that are of interest: the Wald confidence interval for β is
effect of explanatory variables and whether
response categories are indistinguishable with β̂ ± z(1 − α)/2(SE),
respect to the explanatory variables.
where z(1 − α)/2 is the (1 − α)/2th percentile of the
standard normal distribution. A (1 − α)% confi-
Tests of Effects
dence interval for the odds ratio is found by expo-
If β = 0, then an effect is unrelated to the nentiating the end points of the interval for β.
response variable, given all the other effects in As an example, consider the parameter esti-
the model. The two test statistics discussed here mates given in Table 26.5 of our final model from
26-Osborne (Best)-45409.qxd 10/9/2007 5:52 PM Page 402
the previous section. The Wald statistics and cor- example, in the baseline model with all main
responding p values are reported in the seventh effects (nominal SES), we could set all of the
and eighth columns. For example, the Wald statis- parameters for SES and school type equal to
tic for achievement for academic programs equals zero; that is, βj2 = βj3 = βj4 = 0 for j = 1 and 2. The
complex model is
2
0.0978
X2 = = 39.81. P(Yi = j)
0.0155 = exp[αj + βj1 xi + βj2 pi + βj3 s1i + βj4 s2i ],
P(Yi = J )
The 95% confidence interval of β for achieve- and the simpler model is
ment equals
P(Yi = j)
0.0987 ± 1.96(0.0155)→(0.06742, 0.12818), = exp[αj + βj1 xi ].
P(Yi = J )
and the 95% confidence interval for the odds
ratio equals The maximum of the likelihoods of all
the models estimated for this chapter are
(exp(0.06742), exp(0.12818))→(1.07, 1.14). reported in Table 26.6. The difference between
the likelihood for these two models equals
The ratio of a parameter estimate to its stan-
dard errors provides a test for single parameters; G 2 = –2(–541.8917 + 520.26080) = 43.26.
however, Wald statistics can be used for testing
whether any and multiple linear combinations of The degrees of freedom for this test equal 6,
the βs equal zero (Agresti, 2002; Long, 1997). Such because 6 parameters have been set equal to
Wald statistics can be used to test simultaneously zero. Comparing 43.26 to the chi-square distri-
whether an explanatory variable or variables have bution with 6 degrees of freedom, the hypothe-
an effect on any of the models for the odds. Rather sis that SES and/or school type have significant
than present the general formula for Wald tests effects is supported.
(Agresti, 2002; Long, 1997), the likelihood ratio Equality restrictions on parameters can also
statistic provides a more powerful alternative to be tested using the likelihood ratio test. Such
test more complex hypotheses. tests include determining whether an ordinal
The likelihood ratio test statistic compares the explanatory variable can be treated numerically.
maximum of the likelihood functions of two mod- For example, when SES was treated as a nominal
els, a complex and a simpler one. The simpler variable in the baseline model, SES was repre-
model must be a special case of the complex model sented in the model by βj3s1i + βj4s2i, where s1i and
where restrictions have been placed on the para- s2i were effect codes. When SES was treated as a
meters of the more complex model. Let ln(L(M0)) numeric variable (i.e., si = 1, 2, 3 for low, middle,
and ln(L(M1)) equal the natural logarithms of the high), implicitly the restriction that βj3 = βj4 was
maximum of the likelihood functions for imposed, and SES was represented in the model
the simple and the complex models, respectively. by βj3si. The likelihood ratio test statistic equals
The likelihood ratio test statistic equals 3.99, with df = 2 and p = .14 (see Table 26.6).
Even though SES should not be treated numeri-
G 2 = –2(ln(L(M0)) − ln(L(M1))). cally for vocational and general programs, the
test indicates otherwise. It is best to treat cate-
Computer programs typically provide either the gorical explanatory variables nominally, exam-
value of the maximum of the likelihood func- ine the results, and, if warranted, test whether
tion, the logarithm of the maximum, or –2 times they can be used as numerical variables.
the logarithm of the maximum. If the null
hypothesis is true, then G 2 has an approximate Tests Over Regressions
chi-square distribution with degrees of freedom
equal to the difference in the number of param- A second type of test that is of interest in
eters between the two models. multinomial logistic regression modeling is
The most common type of restriction on whether two or more of the response categories
parameters is setting them equal to zero. For are indistinguishable in the sense that they have
26-Osborne (Best)-45409.qxd 10/9/2007 5:52 PM Page 403
Table 26.6 Summary of All Models Fit to the High School and Beyond Data Where Mean Achievement,
School Type, and SES Are Explanatory Variables
Number of Models
Model Parameters Ln(Likelihood) Compared G2 df p AIC
the same parameter values. If there is no differ- main effects, we fit the conditional multinomial
ence between two responses—say, j and j*—then model with β*′ = (α1, α2, β11, β12, β13, β14) and x*ij′ =
(di1, di2, di1xi, di1pi, di1s1i, di1s2i); that is, the terms
(βj1 − βj*1) = . . . = (βjK − βj*K) = 0, (21) with di2 indicators were all dropped from the
model except for the intercept α2. The reduced
where K equals the number of explanatory vari- model is M4 in Table 26.6, and G2 = 16.22, df = 4,
ables. If two response levels are indistinguish- and p < .01. The null hypothesis is again
able, they can be combined (Long, 1997). In our rejected, and the general and vocational pro-
example, the parameter estimates for the two grams are distinguishable on at least one of the
main effect models have nonsignificant odds for three explanatory variables.
vocational versus general (see Tables 26.2 and
26.3). This suggests that vocational and general
programs may be indistinguishable. MODEL ASSESSMENT
The indistinguishability hypothesis repre-
sented in Equation 21 can be tested in two ways. Before drawing conclusions, the adequacy of
The simple method is to create a data set that only the model or subset of models must be assessed.
includes the two response variables, fit a binary Goodness of fit, model comparisons, and regres-
logistic regression model to the data, and use a sion diagnostics are discussed below.
likelihood ratio test to assess whether the explana-
tory variables are significant (Long, 1997). In our
Goodness of Fit
example, the model in Equation 13 was fit to the
subset of data that only includes students from Two typical tests of goodness of fit are
general and vocational programs. According to Pearson’s chi-square statistic and the likelihood
the likelihood ratio test, the explanatory variables ratio chi-square statistic; however, for tests of
are significant (G2 = 17.53, df = 4, p < .01). goodness of fit to be valid, these statistics should
The second and preferable way to test the have approximate chi-square distributions. For
indistinguishability hypothesis makes use of the the sampling distribution of the goodness-of-fit
fact that the baseline model is a special case of statistics to be chi-square requires two condi-
the conditional model. Besides using all of the tions (Agresti, 2002, 2007). First, most fitted val-
data, this second method has the advantage ues of “cells” of the cross-classification of the
that it can be used to simultaneously test response variable by all of the explanatory vari-
whether three or more of the responses are ables should be greater than 5. Second, as more
indistinguishable or to place restrictions on sub- observations are added to the data set, the size
sets of parameters. To test the indistinguishabil- of the cross-classification does not increase. In
ity hypotheses for vocational and general other words, as observations are added, the
programs relative to the baseline model with all number of observations per cell gets larger.
26-Osborne (Best)-45409.qxd 10/9/2007 5:52 PM Page 404
The High School and Beyond data set, even the Hosmer-Lemeshow statistic for each of them
with 600 students, fails both of the require- (Hosmer & Lemeshow, 2000).
ments. Consider the model with only achieve- The Hosmer-Lemeshow statistic is basically
ment. The cross-classification of students by Pearson’s chi-square statistic,
high school program type and achievement is
given on the right side of Table 26.7. There are (observed − expected)2
545 unique values of the mean achievement X2 = .
scores, so most cells equal 0. If a new student is cells
expected
added to the data set, it is possible that his or
her mean achievement level will be different The observed and expected values are frequen-
from the 545 levels already in the data set, and cies found by ordering the predicted values from a
adding a new student would add another row to binary logistic regression model from smallest to
Table 26.7. Even if only school type and SES are largest and then partitioning the cases into
included in the model, the closeness of the approximately equal groups. Both the data and
approximation of the sampling distribution of the expected values from the regression are cross-
goodness-of-fit statistics is uncertain. The classified into tables of group by response cate-
cross-classification of programs by SES by gory. Even though the sampling distribution of
school type is given on the left side of Table the Hosmer-Lemeshow statistic is not chi-square,
26.7, and 5 of the 18 (28%) of the cells are less comparing the Hosmer-Lemeshow statistic to a
than 5. The sampling distributions of Pearson’s chi-square distribution performs reasonably well.
chi-square and the likelihood ratio statistics A dichotomous logistic regression model for
may not be close to chi-square. The solution is academic versus general was fit with achieve-
not to throw away information by collapsing ment, school type, and SES as an ordinal vari-
data. Alternatives exist. able and yielded a Hosmer-Lemeshow statistic =
In the case of binary logistic regression, 8.84, df = 8, and p = .36. For the binary logistic
the Hosmer-Lemeshow statistic is often used regression model of vocational and general pro-
to assess model goodness of fit (Hosmer grams with only achievement as an explanatory
& Lemeshow, 2000); however, no such statistic variable, the model had a Hosmer-Lemeshow
is readily computed for the multinomial case. statistic = 9.00, df = 8, and p = .34. In both cases,
Goodness-of-fit statistics for large, sparse con- the models appear adequate.
tingency tables, including multinomial logistic
models, is an active area of research (e.g.,
Model Comparisons
Maydeu-Olivares & Joe, 2005). Until more suit-
able procedures become available in standard When a subset of models is available and a
statistical packages, one suggestion is to perform user wishes to select the “best” one to report
dichotomous logistic regressions and compute and interpret, various measures and statistics are
Table 26.7 Cross-Classifications of the 600 Students in the High School and Beyond Data Set by School
Type, SES, and High School Program (Left) and by Mean Achievement and High School
Program Type (Right)
...
...
...
High 24 82 20
...
...
...
Middle 7 36 7
High 1 35 0 70.00 0 1 0
26-Osborne (Best)-45409.qxd 10/9/2007 5:52 PM Page 405
available for model comparisons. If the models fit binary logistic regression models, and use
are nested, then the conditional likelihood ratio the regression diagnostics that are available
tests discussed in the “Statistical Inference” sec- for binary logistic regression (Hosmer &
tion are possible. Although the goodness-of-fit Lemeshow, 2000).
statistics may not have good approximations by Observations may have too much influence
chi-square distributions, the conditional likeli- on the parameter estimates and the goodness of
hood ratio tests often are well approximated by fit of the model to data. Influential observations
chi-square distributions. One strategy is to start tend to be those that are extreme in terms of
with an overly complex model that gives an ade- their values on the explanatory variables. Most
quate representation of the data. Various effects diagnostics for logistic regression are general-
could be considered for removal by performing izations of those for normal linear regression
conditional likelihood ratio tests following the (Pregibon, 1981; see also Agresti, 2002, 2007;
procedure described earlier for testing effects in Hosmer & Lemeshow, 2000), but there are some
the model. exceptions (Fahrmeir & Tutz, 2001; Fay, 2002).
When the subset of models is not nested, One exception is “range-of-influence” statistics.
information criteria can help to choose the Rather than focusing on whether observations
“best” model in the set. Information criteria are outliers in the design or exert great influence
weight goodness of model fit and complexity. on results, range-of-influence statistics are
A common choice is Akaike’s information crite- designed to check for possible misclassifications
rion (AIC), which equals of a binary response (Fay, 2002). They are par-
ticularly useful if the correctness of classifica-
–2(maximum log likelihood – tions into the response categories is either very
number of parameters) costly or impossible to check.
Table 26.8 Statistical Test for Effects When There Are Only Main Effects in the Model, All
Two-Way Interactions (Highly Correlated Effects) and All Two-Way Interactions
Where Mean Achievement and SES Are Standardized
achievement, SES, school type, and interactions using the model to classify individuals into
containing them are reported in Table 26.9. Four response levels performs perfectly for one or
of the nine correlations are greater than 0.70, and more of the response categories. In a very simple
two of these are greater than 0.90. case, if only men attend vocational programs,
Dealing with multicollinearity is the same as then including gender in the model would lead
that for normal multiple linear regression. If to separation. Typically, the situation is not quite
correlations between main effects and interac- so simple when specific combinations among
tions lead to multicollinearity, then the variables the explanatory variables occur such that one or
can be standardized. Table 26.9 also contains the more categories of the response variable can be
correlations between the standardized explana- predicted perfectly. When data exhibit quasi-
tory variables, and these are all quite small. In complete or complete separation, maximum
our example, standardization solved the multi- likelihood estimates of the parameters may not
collinearity problem. As can be seen in the exist. Some computer programs will issue a
right side of Table 26.8, the main effects are sig- warning message that quasi-complete separa-
nificant when the explanatory variables are tion has occurred, and others may not. The esti-
standardized. mated standard errors for the parameter
A problem more unique to multinomial estimates get very large or “blow up.” For
logistic regression modeling is “quasi-complete example, separation is a problem in the High
separation” or “complete separation of data School and Beyond data set if the variable race is
points.” Separation means that the pattern in the added to the model that includes all two-way
data is such that there is no overlap between two interactions for achievement, SES, and school
or more response categories for some pattern(s) type. Most of the standard errors are less than 1;
of the values on the explanatory variables however, there are many that are between 10 and
(Albert & Anderson, 1984). In other words, 64. Separation occurs when the model is too
Unstandardized Standardized
complex for the data. The solution is to get more MANOVA, DISCRIMINANT
data or simplify the model. ANALYSIS, AND LOGISTIC REGRESSION
Before concluding this chapter, a discussion of
SOFTWARE the relationship between multinomial logistic
regression models, multivariate analysis of vari-
An incomplete list of programs that can fit the ance (MANOVA), and linear discriminant
models reported in this chapter is given here. analysis is warranted. As an alternative to mul-
The models were fit to data for this chapter tinomial logistic regression with multiple
using SAS Version 9.1 (SAS Institute, Inc., explanatory variables, a researcher may choose
2003). The baseline models were fit using MANOVA to test for group differences or linear
PROC LOGISTIC under the STAT package, and discriminant analysis for either classification
the conditional multinomial models were fit into groups or description of group differences.
using PROC MDC in the econometrics pack- The relationship between MANOVA and linear
age, ETS. Input files for all analyses reported in discriminant analysis is well documented (e.g.,
this chapter as well as how to compute regres- Dillon & Goldstein, 1984; Johnson & Wichern,
sion diagnostics are available from the author’s 1998); however, these models are in fact very
Web site at http://faculty.ed.uiuc.edu/cja/Best closely related to logistic regression models.
Practices/index.html. Also available from this MANOVA, discriminant analysis, and multi-
Web site is a SAS MACRO that will compute nomial logistic regression all are applicable to
range-of-influence statistics. Another commer- the situation where there is a single discrete vari-
cial package that can fit both kinds of models is able Y with J categories (e.g., high school pro-
STATA (StataCorp LP, 2007; see Long, 1997). In gram type) and a set of K random continuous
the program R (R Development Core Team, variables denoted by X = (X1, X2, . . . , XK)′ (e.g.,
2007), which is an open-source version of achievement test scores on different subjects).
SPLUS (Insightful Corp., 2007), the baseline Before performing discriminant analysis, it is
model can be fit to data using the glm function, the recommended practice to first perform a
and the conditional multinomial models can be MANOVA to test whether differences over the
fit using the coxph function in the survival pack- J categories or groups exist (i.e., H0 : µ1 =
age. Finally, SPSS (SPSS, 2006) can fit the base- µ2 = . . . = µJ, where µj is a vector of means on the
line model via the multinomial logistic K variables). The assumptions for MANOVA are
regression function, and the conditional multi- that vectors of random variables from each
nomial model can be fit using COXREG. group follow a multivariate normal distribution
with mean equal to µj for group j, the covariance
matrices for the groups are all equal, and obser-
MODELS FOR ORDINAL RESPONSES vations over groups are independent (i.e.,
Xj ~ NK (µj, ∑) and independent).
Ordinal response variables are often found in If the assumptions for MANOVA are met,
survey questions with response options such as then a multinomial logistic regression model
strongly agree, agree, disagree, and strongly dis- must necessarily fit the data. This result is based
agree (or never, sometimes, often, all the time). on statistical graphical models for discrete and
A number of extensions of the binary logistic continuous variables (Laurizten, 1996; Laurizten
regression model exist for ordinal variables, & Wermuth, 1989). Logistic regression is just the
the most common being the proportional odds “flip side of the same coin.” If the assumptions
model (also known as the cumulative logit for MANOVA are not met, then the statistical
model), the continuation ratios model, and tests performed in MANOVA and/or discrimi-
the adjacent categories model. Each of these is a nant analysis are not valid; however, statistical
bit different in terms of how the response cate- tests in logistic regression will likely still be valid.
gories are dichotomized as well as other specifics. In the example in this chapter, SES and school
Descriptions of these models can be found in type are discrete and clearly are not normal. This
Agresti (2002, 2007), Fahrmeir and Tutz (2001), poses no problem for logistic regression.
Hosmer and Lemeshow (2000), Long (1997), A slight advantage of discriminant analysis and
Powers and Xie (2000), and elsewhere. MANOVA over multinomial logistic regression
26-Osborne (Best)-45409.qxd 10/9/2007 5:52 PM Page 408
may be a greater familiarity with these methods The differences between students in academic
by both researchers and readers and the ease of programs and either a general or vocational pro-
implementation of these methods. However, gram were slightly more complex but still
when the goal is classification or description, these straightforward to describe.
advantages may come at the expense of classifica-
tion accuracy, invalid statistical tests, and difficulty
describing differences between groups. As an illus- EXERCISES
tration of classification using the High School and
Beyond data, discriminant analysis with achieve-
1. Use the High School and Beyond data set
ment as the only feature was used to classify
(Tatsuoka & Lohnes, 1988) to model students’
students, as well as a discriminant analysis with all
career choice as the response variable. Consider
main effects. Frequencies of students in each pro-
gender, achievement test scores, and self-
gram type and the predicted frequencies under
concept as possible explanatory variables. Do
discriminant analysis and multinomial logistic
any of the careers have the same regression para-
regression are located in Table 26.10. In this
meters? Note that some careers have very low
example, logistic regression performed better than
numbers of observations; these may have to be
discriminant analysis in terms of overall classifica-
deleted from the analysis.
tion accuracy. The correct classification rates from
the multinomial logistic regression models were 2. The English as a second language (ESL)
.58 (achievement only) and .60 (main effects of data come from a study that investigated the
achievement, SES, and school type), and those for validity and generalizability of an ESL place-
linear discriminant analysis were .52 in both cases. ment test (Lee & Anderson, in press). The test
If a researcher’s interest is in describing dif- is administrated to international students at a
ferences between groups, multinomial logistic large midwestern university, and the results are
regression is superior for a number of reasons. used to place students in an appropriate ESL
The flexibility of logistic regression allows for a course sequence. The data set contains the test
number of different models. In our example, we results (scores of 2, 3, 4), self-reported Test of
were able to test whether SES should be treated English as a Foreign Language (TOEFL) scores,
as an ordinal or nominal variable. Furthermore, field of study (business, humanities, technology,
we were also able to discern that students who life science), and topic used in the placement
attended general and vocational programs were exam (language acquisition, ethics, trade barri-
indistinguishable in terms of SES and school ers) for 1,125 international students. Controlling
type, but students who attended academic pro- for general English-language ability as measured
grams versus one of the other programs were by the TOEFL, use this data set to investigate
distinguishable with respect to achievement test whether the topic of the placement test influ-
scores, SES, and school type. We were also able ences the test results. In particular, do students
to describe the nature and amount of differences who receive a topic in their major field of study
between students who attended general and have an advantage over others? Majors in busi-
vocational programs in a relatively simple way. ness were thought to have an advantage when
Table 26.10 Predicted Frequencies of Program Types for Linear Discriminant Analysis (DA) and
Multinomial Logistic Regression (LR)
Predicted Frequency
Program Type Observed Frequency DA—Achieve DA—All Main LR—Achieve LR—All Main
the topic was trade, and those in humanities Lauritzen, S. L., & Wermuth, N. (1989). Graphical
might have had an advantage when the topic models for associations between variables,
was language acquisition or ethics. Create a vari- some of which are qualitative and some of
able that tests this specific hypothesis. which are quantitative. Annals of Statistics,
17, 31–57.
Lee, H. K., & Anderson, C. J. (in press). Validity and
topic generality of a writing performance test.
NOTE Language Testing.
Lesaffe, E., & Albert, A. (1989). Multiple-group
1. See Chapter 21 (this volume) for details on logistic regression diagnostics. Applied Statistics,
Poisson regression. 38, 425–440.
Long, J. S. (1997). Regression models for categorical
and limited dependent variables. Thousand
REFERENCES Oaks, CA: Sage.
Maydeu-Olivares, A., & Joe, J. (2005). Limited and
Agresti, A. (2002). Categorical data analysis (2nd full-information estimation and goodness-of-fit
ed.). New York: John Wiley. testing in 2n contingency tables: A unified
Agresti, A. (2007). An introduction to categorical data framework. Journal of the American Statistical
analysis (2nd ed.). New York: John Wiley. Association, 100, 1009–1020.
Albert, A., & Anderson, J. A. (1984). On the existence Powers, D. A., & Xie, Y. (2000). Statistical methods for
of maximum likelihood estimates in logistic categorical data analysis. San Diego: Academic
regression models. Biometrika, 71, 1–10. Press.
Dillon, W. R., & Goldstein, M. (1984). Multivariate Pregibon, D. (1981). Logistic regression diagnostics.
analysis: Methods and applications. New York: Annals of Statistics, 9, 705–724.
John Wiley. R Development Core Team. (2007). R: A language
Fahrmeir, L., & Tutz, G. (2001). Multivariate and environment for statistical computing
statistical modeling based on generalized linear [Computer software]. Vienna, Austria: Author.
models. New York: Springer. Available from http://www.R-project.org
Fay, M. P. (2002). Measuring a binary response’s SAS Institute, Inc. (2003). SAS/STAT software,
range of influence in logistic regression. Version 9.1 [Computer software]. Cary, NC:
The American Statistician, 56, 5–9. Author. Available from http://www.sas.com
Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic SPSS. (2006). SPSS for Windows, Release 15
regression (2nd ed.). New York: John Wiley. [Computer software]. Chicago: Author.
Insightful Corp. (2007). S-Plus, Version 8 [Computer Available from http://www.spss.com
software]. Seattle, WA: Author. Available from StataCorp LP. (2007). Stata Version 10 [Computer
http://www.insightful.com software]. College Station, TX: Author.
Johnson, R. A., & Wichern, D. W. (1998). Applied Available from http://www.stata.com/products
multivariate statistical analysis (4th ed.). Tatsuoka, M. M., & Lohnes, P. R. (1988). Multivariate
Englewood Cliffs, NJ: Prentice Hall. analysis: Techniques for educational and
Lauritzen, S. L. (1996). Graphical models. New York: psychological research (2nd ed.). New York:
Oxford University Press. Macmillan.