Professional Documents
Culture Documents
Scott Long
Advanced Quantitative Techniques
in the Social Sciences
Al! rights reserved. No par! 01' this book may be reproduced or utilized in any fonn or by any means,
electronic or mechanical, including photocopying, recording, or by any infonnation storage and retneval
system, wi!hout pennission in writing from !he pubIísher.
For mf"rnlal¡ionaddress:
Long,],Scott
Regression models for categorical and lirnited dependent variables
I aUlhor, J. Scott Long,
04 10
List of Figures xi
List of Tables xv
Series Editor's Introduction xix
Preface xxiii
Acknowledgments xxv
Abbreviations and Notation xxvii
1. Introduction 1
1.1. Linear and Nonlinear Models 3
1.2. Organization 6
1.3. Orientation 9
1.4. Bibliographic Notes 10
2. Continuous Outcomes: The Linear Regression Model 11
2.1. The Linear Regression Model 11
2.2. Interpreting Regression Coefficients 14
2.3. Estimation by Ordinary Least Squares 18
2.4. Nonlinear Linear Regression Models 20
2.5. Violations of the Assumptions 22
2.6. Maximum Likelihood Estimatíon 25 6.9. Related Models ]84
2.7. Conclusions 33 6.10. Conclusions 185
2.8. Notes 33 6.11. Bibliographic Notes 186
3. Outcomes: The Linear Probability, Probit, and Logit 7. Limited Outcomes: The Tobit Model 187
Models 34 7.l. The Problem of Censoríng 188
3.l. The Linear Probability Model 35 7.2. Truncated and Censored Dístributions 192
3.2. A Latent Variable Model for Binary Variables 40 7.3. The Tobit Model for Censored Outcomes 196
3.3. Identification 47 7.4. Estimation 204
3.4. A Nonlinear Probability Model 50 7.5. Interpretation 206
3.5. ML Estimation 52 7.6. Extensions 211
3.6. Numerical Methods for ML Estimation 54 7.7. Conclusions 216
3.7. 61 7.8. Bibliographic Notes 216
3.8. lnterpretation Odds Ratios 79
3.9. Conclusions 83 8. Count Outcomes: Regression Models for Counts 217
3.10. Notes 83 8.1. The Poisson Distribution 218
8.2. The Poisson Regression Model 221
4. 85 8.3. The Negative Binomial Regression Model 230
85 8.4. Models for Truncated Counts 239
4.2. Residuals and Influence 98 8.5. Zero Modified Count Models 242
4.3. Scalar Measures of Fit 102 8.6. Comparisons Among Count Models 247
4.4. Conclusions 112 8.7. Conclusions 249
4.5. Notes 113 8.8. Bibliographic Notes 249
5. Ordinal Outcomes: Ordered Logit and Ordered Probit Analysis 114 9. Conclusions 251
5.1. A Latent Variable Model for Ordinal Variables 116 9.1. Links Using Latent Variable Models 252
5.2. Identification 122 9.2. The Generalized Linear Model 257
5.3. Estimation 123 9.3. Similarities Among Probability Models 258
5.4. 127 9.4. Event History Analysis 258
5.5. Assumption 140 9.5. Log-Linear Models 259
5.6. Rclated Models for Ordinal Data 145
5.7. Conclusions 146 A. Answers to Exercises 2M
5.8. Notes 147 References 274
6. Nominal Outcomes: Multinomial Logit and Related Models 148 Author Index 283
6.1. Introduction to tile Multinomial Logit Model 149
6.2. The Mnltinomial Model 151 Subject Index 287
6.3. ML Estimation 156 Abont the Author 297
6.4. and Other Contrasts 158
6.5. Two Useful Tests 160
6.6. 164
6.7. The Conditional Logit Model 178
6.8. of Irrelevant Alternatives 182
List of Figures
3.11 of Labor Force Participatíon by Age and Family 7.4 Inverse Milis Ratio 195
lncome for Women Without Sorne College Educatíon 68 7.5 Probability of Being Censored in the Tobit Model 198
3.12 Effect in the Binary Response ModeI 73 7,6 Probability of Being Censored by Gender, Fellowship Status,
3.13 Versus Discrete Change in Nonlinear Models 76 and Prestige of Doctoral Department 200
4.1 Distribution for a z-Statistic 86 7.7 Expected Values of y*, y I y> T, and y in the Tobít Model 202
4.2 Likelihood Ratio, and Lagrange Multiplier Tests 88 7.8 Maximum Likelihood Estimation for the Tobit Model 204
4.3 Sampling Dístribution of a Chi-Square Statistic with 5 De- 8.1 Poisson Probability Distribution 219
grces of Freedom 89 8.2 Distribution of Observed and Predicted Counts of Artides 220
Index Plot of Standardized Pearson ResiduaIs 100 8.3 Distribution of Counts for the Poisson Regression Model 222
4.5 [ndex Plot of Cook's Iníluence Statistics 101 8.4 Comparisons of the Mean Predicted Probabilities From the
5.1 . of a Latent Variable Compared to the Regres- Poisson and Negative Binomial Regression Models 229
S10n of the Observed Variable y 118 8.5 Probability Density Function for the Gamma Distribution 232
5.2 Distribution of Given x for the Ordered Regression Model 120 8.6 Comparisons of the Negative Binomial and Poisson Distri-
5.3 Predictcd and Cumulative Probabilities for Women in 1989 132 butions 234
5.4 IlIustration of the Parallel Regression Assumption 141 8.7 Distribution of Counts for the Negative Binomial Regression
6.1 Discrete Plot for the Multinomial Logit Model of Model 235
Control Variables Are Held at Their Means. 8.8 Probability of O's From the Poisson and Negative Binomial
Jobs Are Classified as: M Menial; e Craft; B BIue Regression Models 238
Collar; W White Collar; and P = Professional 168 8.9 Comparison of the Predictions From Four Count Models 248
6.2 Odds Ratio Plot for a HypotheticaI Binary Logit Model 172 9.1 Similarities Between the Tobit and Probit Models 253
6.3 Odds Ratio Plot of Coefficients for a Hypothetical Multino- 9.2 Similarities Arnong the Ordinal Regression, Two-Limit Tobit,
mí al Model With Three Outcomes 174 and Grouped Regression Models 255
6.4 Odds Ratio Plot for a Multinomial Logit ModeI of Occu-
Attainment. Jobs Are Classified as: M = Menial;
B BIue Collar; W = White Collar; and P =
ProfessionaI 175
6.5 Enhanced Odds Ratio Plot With the Size of Letters Corre-
to of the Discrete Change in the Prob-
ability. Discrete Are Computed With All Variables
Held at Their Means. Jobs Are Classified as: M = Menial'
e B Blue Collar; W White Collar; and P ,
Professional 176
6.6 Enhanced Odds Ratio Plot for the Multinomial Logit Model
of Attitudes Toward Working Mothers. Discrete Changes
Were Computed With AIl Variables Held at Their Means.
categones Are: 1 Strongly Disagree; 2 Disagree; 3
Strongly 178
'~""~_,,,·I and Truncated Variables 188
7.2 Model With and Without Censoring and
Truncation 190
7.3 Normal Distribution With Truncation and Censoring 192
List of Tables
distribution, usualIy not the normal. For example, Finally, a word about software. For most of the procedures disc~s~ed
the log of the odds of some binary outcome (e.g., in those book there exist statistical routines in all of the major statlstIcal
is on tbe usual linear eombination of explana- packages. This is both a blessing and a curse .. The bles~i~g is that min-
variables with the underlying conditional distribution of the binary imal computer skills are required. The curse IS that mll1lmal compu~er
outcomes taken to be binomial. skills are required. Right answers and wrong answers are easy to obtam.
In this Scott addresses these and related kinds of statis- With this in mind, Pro1'essor Long discusses ,some of the most popular
tical 1 am very to add Scott Long's Limited Depen- software. This too deserves serious study.
dent Van:ables to the series. The topies are of both praetical and theoret-
Ical and Professor Long has done a excellent job of exposi- RICHARD BERK
tion. The book is well suited as a text for graduate students in the social
and biomcdical sciences. It will also serve as a wonderful reference for
This book is about regression models that are appropriate when the
dependent variable is binary, ordinal, nominal, censored, truncated, or
counted. 1 refer to these outcomes as categorical and limited dependent
variables (CLDVs, for short). Within the last decade, advances in sta-
tistical software and increases in computing power have made it nearly
as easy to estimate models for CLDVs as the linear regression model.
This is reflected in the rapidly increasing use of these models. Nearly ev-
ery issue of major journals in the social sciences contains examples of
models such as logit, probit, or negative binomial regression. While com-
putational problems have largely be en eliminated, the models are more
difficult to learn and to use. There are two quite different reasons for
this. First, the models are nonlinear. As readers willlearn well, the non-
linearity of many models for CLDVs makes interpretation of the results
more difficult. With the linear regression model, most of the work is done
when the estimates are obtained. With models for CLDVs, the task of
interpretatíon is just beginning. Unfortunately, all too often when these
models are used, the substantive meaning of the parameters is incom-
pletely explained, incorrectly explained, or simply ignored. Sometimes
only the statistical significance or possibly the sign ís mentioned. A sec-
ond reason that these models are difficult to learn is that while models
for CLDVs are more complicated than the linear regression model, most
xxiii
,,'Xiv REGRESSION MODELS
xxv
Abbreviations and Notation
The following abbreviatíons and notation are used throughout the book.
While 1 have tried to use consistent notatíon and to avoid using the same
symbol for more than one purpose, there are a few exceptions, such as
A being used as the inverse Milis ratio and the logistic distribution.
Abbreviations
6:
other variables constant; also called the marginal effect.
the error in equation
a vector of parameters
y* = a + {3x + e).
6 = (a {3 O")'J.
the pdf and cdf for the standard logistic distribution with mean
O ami variance
the pdf and cdf for the standardized logistic distribution with
1 Introduction
agree, agree, and strongly disagree, Items asking the fre- Once the level of the dependent variable is determined, it is important
quency occurrence use the categories often, occasionally, seldom, to match the model used to the level of measurement. lf the model
and never. Polítical orientation may be c\assified as radical. liberal, and con- chosen assumes the wrong level of measurement, the estimator could
servative. Educational attainment can be measured in terms of the highest
be biased, inefficient, or simply inappropriate. Fortunately, there are a
received, with the ordinal oí less than high school, high
and sehooL Military rank and civil serviee grade
large number of models specifically designed for CLDVs. Binary logit
are ordinal. and probit are appropriate for binary outcomes. The ordered logit and
probit models explicitly deal with the ordered nature of the dependent
• Nominal variables oecur when therc are multiple outcomes that cannot be variable. Multinomial logit is appropriate for nominal outcomes. The
ordered. can be grouped as manual, trade, blue collar, white tobit model is designed for censored outcomes. Furthermore, a variety
collar, and Marital status might be coded as single, married,
of models such as Poisson and negatíve binomial regression can be used
divorced, and widowed. Political parties in European countries can be con-
sidered nominal c\assifications. Studies of brand preference may inc\ude
for count outcomes. These and related models are the subject of this
cholces among unordered alternatives. book.
Dntíl recently, the greatest obstacle in using models for CLDVs was
• variables occur when the value of a variable is unknown over the lack of software that was flexible, stable, and easy to use. This lim-
sorne mnge of the variable. The c\assic example i8 expenditures for durable itation no longer applies since these models can be estimated routinely
Individuals with less ¡ncome than the price of the cheapest
with standard software. Now, the greatest impediment is the complexity
durable will have zero expenditure. Measures of workers'
wages are restricted on the lower end by the minimum wage rateo
of the models and the difficulty in interpreting the results. The difficul-
Variables percentage, such as the percentage of homes damaged ties arise because most models for CLDVs are nonlinear.
in natmal disaster, are censored below at O and above at 100. Censoring
can al so occur for reasons. In the 1990 Census, all salaries
greater than $140,000 were recorded as $140,000 to ensure confidentiality. 1.1. Linear and Nonlinear Models
• Count variables indicate the number of times that sorne event has occurred.
How often did a person visit the doctor last year? How many jobs did some- The linear regression model is linear, while most models for CLDVs
one have? How rnany strikes occurred? How many artieles did a scientist are nonlinear. This difference is so basic for understanding the materials
How many demonstratÍons occurred? How many children in later chapters that 1 begin with a general overview of the implicatíons
did a have? How many years of formal education were cornpleted? of nonlinearity for interpreting the effects of independent variables. Just
How lIlany newspapers were founded during a given period? as the nonlinearities introduced by relativity theory made physical mod-
els substantialIy more complicated than their Newtonian counterparts,
The level of measurement of a variable is not always clear or unam-
the use of nonlinear statistical models has added new complications for
Indeed, you might with sorne of the examples gíven
the data analyst
above. Carter notes that " ... statements about levels of mea-
Figure 1.1 shows a linear and a nonlinear model predicting the depen-
suremcnt of a cannot be sensibly made in isolation from the
dent variable y. Each model has two independent variables: x is contin-
thcoretical and substantive context in which the [variable J is to be used.
uous and d is dichotomous with values O and 1. To keep the example
that a variable is somehow 'intrinsically' interval (ordinal,
simple, 1 assume that there is no random error. Panel A plots the linear
nominal) are analytically misleading." Education is a good example. Ed-
model
ucation can be measurcd as a binary variable that distinguishes those
school education or less from others. Or, it could be ordinal
llLLll"""l1Jll'. the received: junior high, high school, college, y = a + f3x + 8d [1.1]
or Or, it can be a count variable indicating the number of
years of school complcted. Each of these i8 reasonable and appropriate The solid line beginning at a plots y as x changes when d = O: y
on the substantive purpose of the analysis. a + f3x. The dashed line beginning at a + 8 plots y as x changes when
Introduction 5
REGRESSION MODELS
Panel A: Linear Model The partial derivative, often called the marginal effect, is the ratio of the
change in y to the change in x, when the change in x is infinitely small,
holding d constant. In a linear model, the partial derivative is the same
at all values of x and d. Consequently, when x increases by one unit,
y increases by {3 units regardless of the current level of x or d. This is
shown in panel A by the four small triangles with bases of length 1 and
heights of length {3.
The effect of d cannot be computed by taking the partial derivative
>-,
since d is not continuous. Instead, we measure the discrete change in y
ex él --- as d changes from O to 1, holding x constant:
(ex+{3x+ol)-(a+{3x+oO) o
where g is a non linear function. For example, for the logit model of
Chapter 3, Equation 1.2 becomes
[1.3]
In (-y-)
1-y
ex* + {3*x + o'el
REGRESSION MODELS lntroduction 7
(Show this. J ) The dependent variable is now Inyj(1 y), a quantity show how the same model can be understood as a nonlinear probability
known as the The logit in creases by f3* units for every unit increase model without appealing to a latent variable. Issues of identification are
in x, d constant. As with Equation 1.1, this is true regardless of introduced to explain the apparent differences in results from the logit
the level of x or d. The problem is that it i8 often unclear what a unit and probit models. Since numerical methods are often necessary .for esti-
in crease in h means. For example, an in crease of f3* in the logit is mating these models, as well as later models, these methods are dlscussed
to most people. in some detail. 1 aIso introduce a variety of approaches for interpreting
One of the difficulties in effectively using models for CLDVs the results from nonlinear models. These techniques are the basis for
i5 the nonlinear effects of the independent variables. An aH interpreting al! of the models in later chapters. Chapter 4 reviews stan-
too common, albeit unnecessary, solution i8 to talk only about the statis- dard statistical tests associated with maximum likelihood estimation, and
tical of coefficients without indicating how these parameters considers a variety of measures for assessing the fit of a model. Chap-
to in the outcome of interest. A key ob- ter 5 extends the binary logit and probit modcls to ordered outcomes.
of this book i8 to show how models for CLDVs can be effectively While the resulting ordered logit and probit models are simple exten-
sions of their binary counterparts, having additional outcome categories
the book, 1 use the term "effect" to refer to a change in makes interpretation more complex. Chapter 6 presents the multinon:i~1
an outcome for a in an independent variable, holding all other and conditional logit models for nominal outcomes. The greatest dlffi-
variables constant. For in the probit model the effect of educa- culty in using these models is the large number of parameters required
tion 011 labor force participation might be described as: for an additional and the corresponding problems of ínterpretation. Chapter 7 considers
year of educatíon the probability of being in the labor force in crease s models with censored and truncated dependent variables, with a focus
all other variables at their means. Or, for count models on the tobit model. The tobit model is developed in terms of a latent
we conclude: for each increase in income of $1000, the expected variable that is mapped to the observed, censored outcome. The chapter
number of children in the family deereases by 5%, holding al! other vari- ends by considering a number of related models, including models for
ables constant. The interpretatíon of an "effect" as causal depends on sample selection bias. Chapter 8 presents models for count outcomes,
the natme of the problem being analyzed and the assumptions that a beginning with the Poisson regression model. Negative binomial regres-
researcher is to make. For a detailed discussion of the issues in- sion and zero modified models are considered as alternatives that allow
volved in causal inferences, see Sobel (1995) and the literature for overdispersion or heteroscedasticity in the data. Chapter 9 compares
cited therein. and contrasts the models from earlier chapters, and discusses the links
between these models and models not discussed in the book, such as
log-Iinear and event history models. .
1.2. Organizatíon The material in this book can be learned most effectIvely by read-
ing the chapters in order, but it is possible to skip some chapters or to
\.A!illi'lC;¡ 2 reviews the linear regression model to highlight issues that change the order in which others are read. Everyone should r.ead Chap-
are for the models in ¡ater chapters. Maximum likelihood es- ter 2 to learn the basic terminology and notation. Chapter 3 IS essentlal
timation i5 introduced within this familiar context to make it is easier for al! that follows since it introduces key concepts, such as latent vari-
to understand how to apply this method to the models in later chapters. ables, and methods of interpretation, such as di serete change. Those who
3 models for binary outcomes. 1 begin with regression are familiar with Wald and likelihood ratio tests can skip that section of
variable to illustrate how CLDVs can cause violations of the Chapter 4. The discussion of assessing fit in Chapter 4 is not needed for
of the linear regression model. Binary probit and logit are later chapters. Chapter 5 on ordinal outcomes can be read after Chap-
first derived using an unobserved or latent dependent variable. 1 then ter 6 on nominal outcomes. Chapter 8 on count models builds on the
results for truncated distributions in Chapter 7 to develop the zero mod-
ified models. However, most of Chapter 8 is accessible without reading
1 Exercises fOI" the reader are in italies. Solutíons are found in the Appendix.
Chapter 7.
REGRESSION MODELS Introduction 9
While eaeh model studied has unique characteristics, there are impor- literature. To help you keep the notation clear, atable of notatíon is
tant similarities among the models that are exploited. First, each model given on pages xxvii to xxx.
has the same component (MeCullagh and Nelder, 1989, pp. 26-
cach model enters the independent variables as a linear
combinatian: + ... + Consequently, in specifying your 1.3. Orientation
model you can use all of the "tricks" that you know for entering vari-
ables in the linear model: nominal variables can be coded as Before ending this chapter, a few words about the general orientation
a set of nonlinearities can be introduced by transform- of this book are in order. This is a book about data analysis rather than
the variables; the effects of an independent variable can about statistical theory. The mathematics has been kept as simple as
differ group adding interaction variables; and so on. Second, each possible without oversimplifying the models in ways that could result in
model is estimated by maximum likelihood. Once the general character- misuse or misunderstanding. The mathematics that is used, however, is
istics of maximum Iíkelihood are understood and the associated statisti- essential for understanding the correct applicatíon of these modeIs. To
cal tests are learned, these can be applied to aU of the models. Third, master the methods, it is important to work with the equations and to
the same ideas are used for interpreting eaeh model. Expected try sorne derivations on your own. To help you do this, 1 have included
and discrete changes are computed at interest- exercises in italics at various points. In the long mn, it will be worth
values of the independent variables and are presented in plots or your while to think about each of these questions befo re proceeding.
tables. whenever possihle the mathematical tools used for one Brief answers to the exercises are given in the Appendix.
model are carried over in the presentation of Jater models. Seeing how these models can be applied in substantive research is also
of these models can be derived in different ways. For example, important for understanding the models. Accordingly, each chapter in-
model can be deveJoped as a latent variable model in eludes a substantive example that is used to ilIustrate the interpretation
which the observed variable is an imperfect measurement of an of each model. You are al so encouraged to apply these models to your
latent variahle. Or, the model can be derived as a discrete own data while you are reading. To this end, comments are given about
choice model in which an individual chooses the outcome that provides four statistical packages for estímating models for CLDVs: LIMDEP
the maximum utility. the model can be viewed as a probability Version 7 (Greene, 1995), Markov Versíon 2 (Long, 1993), SAS Version
model with the characteristic S-shaped relationship between indepen- 6 (SAS Institute, 1990a), and Stata Version 5 (Stata Corporation, 1997).
dent variables and the probability of an event. Each of these approaches These comments are not designed to teach you how to use these pack-
results in the same formula relating the independent variables to the ages, but rather are general comments about difficulties that míght be
probability. 1 show alternative derivations of sorne models in
PVr,pf"TI'1"l encountered with any statistieal package. While nearly a1l of the analy-
order to highlight different characteristics of the models. This also serves ses in the book were done with my program Markov (Long, 1993) wrÍt-
to link my to the diverse literature in which these models ten in GAUSS (Aptech Systems lnc., 1996), any of these four packages
were could have been used for most analyses. To help you use these methods,
Models for CLDVs were often deveIoped independently in different 1 have placed the data sets, programs, and output for the examples on
such as engineering, statistics, and econometrics, with my homepage (http://www.indiana.edurjsI650) or access the Sage Web-
vcry littlc contact across the f¡elds. Consequentiy, there is no universally site http://www.sagepub.com/sagepage/authors.HTM for information.
notation or terminology. For example, the ordered logit model While this book contains what 1 believe are the most basic and use-
5 is also known as the ordinal logit model, the proportional fui methods for the analysis of CLDVs, a number of important topies
the model, and the grouped continuous were excluded due to limitations of space. Topics that have not heen dis-
model. 1 have tried to use what appears to be emerging as standard cussed include: robust and nonparametric methods of estimatíon, speci-
within the social sciences. Every effort has been made to fication tests (Davidson & MacKlnnon, 1993, pp. 522-528; Greene, 1993,
the notation consistent across chapters. On rare occasions, this has pp. 648-650), complex sampling, multiple equation systems (see Browne
resulted in notatíon that 1s different from that commonly used in the & Armínger, 1995, for a review), and híerarchical models (Longford,
REGRESSION MODELS
pp.
these are
Additional citatíons are gíven in later chapters. While
where y is the dependent variable, the x's are independent variables, and
e is a stochastic error. The subscript i is the observation number from
N random observations. f3¡ through f3 K are parameters that indicate the
effect of a given x on y. f30 is the intercept which indicates the expected
value of y when all of the x's are O. The modeI can be written in matrix
11
REGRESSION MODELS
Continuous Outcomes 13
=0 x
Tllis means tllat Figure 2.1. Simple Linear Regressicm Model Wíth the Distribution of y Given x
to This aSS1Llm10ucm
REGRESSION MODELS Continuous Outcomes 15
The the curve, the more Iikely it is to have an error of that value. This means that when Xk increases by one unir, y is expected to change
The errors are sinee the variance of the error distribution by f3k uníts, holding other x's constant.
~s the same for ~ach X. While the curves are drawn as normal, normality In the LRM,
IS not for the errors to be homoscedastic.
In 1, derivatives and discrete change were used to de- • liaving characteristic Xk (as opposed to not having the characterístic) results
scribe the effects of an índependent variable on the dependent outcome. in an expected change of f3k in y, holding al! other variables constant.
Even these two measures of change give identical answers for
tlle LRM, 1 consider both in order to introduce ideas that are critical in The slope coefficient is represented in Figure 2.1 by small triangles. The
Jater The subscript i is droppcd to simplify the notation. base of each triangle is one unit long, with the rise in the triangle equal
The delivative of y with respect to x k is to f3. Thus, for a unit increase in x, whethel' starting at x2' x3' or any
other value oí' x, y is expected to increase by f3 units.
• For standard deviation inerease in Xb Y is expeeted to ehange by f3;x The interpretation of the coefficients in the LRM differs in two im-
units, al! other variables constant. portant respects from the nonlinear models in ¡ater chapters. First, in
18 REGRESSION MODELS Continuous Outcomes 19
non linear models, depends on the value of Xk and on the val- has a t-distribution with N - K - 1 degrees of freedom and ean be used
ues of the other x's in the model. Second, in nonlinear models, aE(·)/ aXk to test the hypothesis that Ho: f3k f3*. Without assuming normality,
does not equal !lE(·) /!lx k' It is extremely important to avoid tk has a t-distribution as the sample becomes infinitely large (Greene,
the intelpretation of the LRM to the models in later 1993, pp. 299-301). Issues involved in testing hypotheses are discussed
in Chapter 4.
2.3. Estimation by Ordinary Least Squares Example of the LRM: Prestige of the Fírst Job
Long et al. (1980) examined factors that affect the prestige of a scien-
least squares (OLS) is the most frequently used met~od of tist's first academic job for a sample of male biochemists. Their primary
estimation for the LRM. The OLS estimator of (3 is that value (3 that
interest was whether characteristics associated with scientific productiv-
minímizes the sum of the squared residuals: :L::I (Yi - xj3)2. The result- ity were more important than characteristics associated with educational
estimator is
background. Here 1 extend those analyses to include information on fe-
male scientists.
The dependent variable is the prestige of the first job (JOB). Pres-
with the covariance matrix: tige is rated on a contínuous scale from LOO to S.OO, with schools from
1.00 to 1.99 classified as adequate, those from 2.00 to 2.99 as good, 3.00
to 3.99 as strong, and those aboye 3.99 as distinguished. Graduate pro-
Cov(fio, f;¡) Cov(í§o, f;K) grams rated below adequate or departments without graduate ~rogran:s
were coded as 1.00. The implications of thi8 decision are consldered m
Var(f;I) COV(f;I' f;K) Chapter 7 when this example is used to illustrate the tobit n:odel. The
independent variables are described in Table 2.1. Our regresslOn model
is
~~"~"0 of the model hold, the OLS estimator is the best JOB f30 + f3¡FEM +f32PHD+f33MENT +f34FEL+f35ART +f36C1T +8
This means that if the assumptions hold, the
Table 2.2 presents the estimates of the unstandardized a.nd standar~
is an unbiased estimator [i.e., E(~) (3] that has the
minimum variance among all linear estimators. ized coefficients. t-values are also presented, but are not dlscussed untIl
To estimate Var(~), we need an estimate of the varianee of the errors,
the residual as e¡ Yi x¡p, we can use the unbiased
TABLE 2.1 Descriptive Statistics for the First Academic Job Example
estimator:
Standard
___ ¿>2 N
Name Mean Deviation Mínimum Maximum Descriptioll
1 ;=1 I
JOB 2.23 0.97 1.00 4.80 Prestige of job (from 1 lO 5)
where K is the number of independent variables. This allows us to es- FEM 0.39 0.49 0.00 1.00 1 if female; O if male
timate the eovariance matrix as Var(P) s2(X'X)-1. If the errors are PHD 3.20 0.95 1.00 4.80 Prestige of Ph.D. department
normal and = , then MENT 45.47 65.53 0.00 532.00 Citatíons received by mentor
FEL 0.62 0.49 0.00 l.OO 1 if held fellowshíp; cisc O
ART 2.28 2.26 0.00 18.00 Number of articles published
CIT 21.72 33.06 0.00 203.00 Number of citatÍons reccived
NOTE: N = 408.
20 REGRESSION MODELS Continuous Outcomes 21
is an x-standardízed coefficient; f3s, is a the resulting equation is linear in Jn(z) even though it is nonlinear in
coetIicicnt: t is a Hest of f3. z. Accordingly, the slope f31 can be interpreted as discussed aboye: for
a unit in crease in x], ln( z) is expected to in crease by f31 units, holding
X2 constant. Note, however, that a f31 unit increase in ln(z) fram 1 to
and e/T can be used to ilIustrate the
of coefficients. 1 + f3l involves a different change in z than a change in Jn(z) fram, say,
2 to 2 + f31' This can be seen by taking the derivative 01' z with respect
• Unstandardized Being a fcmale scientist decreases the expected to x:l
of the nrst job .14 points on a five-point scale, holding all other
variables constant. For every additional citation, the prestige of the first job ó'z
is lo increase .004 units, holding al! othcr variables constant. ó'x¡
efiect is small due (o Ihe standard deviation in CIT.)
For every standard deviation increase in cita- exp(f3o + f3 1x¡ + f32 x 2 +
tÍons, the of the first job is expected to increase by .15 units, holding
other variables constant. exp(f3o + f3¡x] + f32 x 2 + e)f31
• a woman decreases the expected prestige zf3l
oC the nrst job .14 standard deviations, holding all other variables con-
stant. For every additional citatíon, the prestige of the first job is expected Thus, even though the expected change in y = ln(z) is the same re-
to in crease .005 standard deviations, holding al1 other variables constant. gardless of the current levels of Xl and xz, the change in z [not ln(z)]
unstandardized and y-standardized coefncíents are nearly identical depends on the level of z.
síncc the variance of y is about 1.) Equation 2.2 is an example of a class of nonlinear models known as
log-linear models: whíle z is nonlinearly reJated to the the log of z is
• standardized rn,plTIriu»'c For every standard deviation increase in ci-
linearly related to the x's. Sínce the logit models of Chapters 3, 4, and
tations, the of thc first job is expected to increase by .15 standard
dcviations, all other variables constant. 6 and the count models of Chapter 8 are log-linear models, it is worth
considering a simple method of interpretation that can be used for any
standardized and y-standardized coefficients are used to in- log-linear modeL
many of the models in Jater chapters. Since exp( a + b) exp( a) exp( b), Equation 2.2 can be written as
) indica tes the value of z when x 1 has a given value. Consider we assume that E(e I x) = O. Consider a simple modification where we
1 to xlI: now assume that E(e Ix) = o. Here o is an unknown, nonzero constant.
We can modify Equation 2.3 so that the new error will have a zero mean:
y=(/30+0)+/3¡x¡ +"'+/3KxK+(e o)
) exp(/3¡) exp(/32x2) exp( e)
= /3'0 + /3Jx¡ + ... + /3KxK + 8*
is the multiplicative factor change in z
for a unit We have subtracted the mean of e o) from e to create a new error
e* with a zero mean. (Show that the mean of e* is O.) To maintain the
exp(/3¡ ) equality, we al so added o which is combined with /30 and relabeled as /3 0,
The resulting equation has all of the properties of the LRM, including a
This leads to the following interpretation: mean of Ofor the error e*. Consequently, we can use OLS to obtain best,
linear, unbiased estimates of /3'0 (not /3() and the /3k's. Thc expected
• For a unit increase in xl' is expectc~d to change by the factor exp(/3¡), value of fio is a combination of the intercept /30 and the mean of e:
al! other variables constant. E(fiü) /30 + o. No matter how large the sample, it is impossible to
disentangle estimates of /30 and o. More formally, /30 and B are not
in z for a unit change in x ¡ can be computed
as identified individually, although their sum /30 + o is identified.
Since the idea of identification is essential for understanding models
Z(X¡)] for CLDVs, it is worth reinforcing the key ideas with Figure 2.2. Assume
-(-) = lOO[exp(/3¡) 1] that the sample data, which are indicated by the dots, are generated by
Z Xl
the model y = a + /3x + e, where 8 is normally distributed with mean o.
This can be as: The solid ¡ine represents E(y I x) = a + /3x. As would be expected, the
unit increase in Xl' is expecte:d to change by lOO[exp(f3]) 1]%,
all other variables constan!.
~ote that other nonlinear models do not have this simple interpretation
m tcrms of a factor or a change.
In the x
+ [2.3] Figure 2.2. Identificatíon of the Intercept in the Linear Regressíon Model
REGRESSION MODELS Continuous Outcomes 25
observed data are loeated approximately o E( si x) units above the but that we have estimated the model:
line. The OLS estima te of the regression Hne is the dashed
line that nms. the observations, with intercept (?* and slope í3. [2.5]
The estnnate of the slope appears unaffeeted by the nonzero The error lJ absorbs the excluded variable x2 and the original e:
mean of the errors, and is approximately egual to f3. Consistent with
our o
argument, the estimated intereept is about units above
the o: as a conseguence of the nonzero mean of the
o o
errors. While neither o: nor is identified, the sum o: + is identified If Xl and X2 are correlated, then lJ and Xl must be correlated. (H'J¡y must
and can be estimated íi~. this be the case?) Consequently, the OLS estimate of f31 in Eguation 2.5
This illustrates a number of critical ideas related to is a biased and inconsistent estimate of f31 in Eguation 2.4.
the of identification, First, a parameter is unidentified when it
is to estimate a parameter regardless of the data available.
Identification 15 a limitation 01' the model that cannot be remedied by 2.6. Maximum Likelihood Estimation
the size. Second, models beco me identified by adding
. . The i8 identified if we assume that E( si x) O; If we assume that the errors are normally distributed, the LRM can
wlthout thls it i5 unidentified. Third, it is possible for so me be estimated by maximum likelihood (ML). While the OLS and ML
~ara~~ters to be identified while others are not. Thus, while f30 i5 not estimators of J3 are identical for the LRM, 1 introduce ML estimation
~dent:~ed u~less the. value 01' Ix) is assumed, f31 through f3K are within the familiar context of regression to make it easier to understand
¡dentIÍ¡ed wlthout tl11s assumption. Finally, while individual parameters the application of ML to the models in later ehapters.
may not be combinations of those parameters may be identi-
~~d. while neither o nor f30 is identified, the sum f30 + o is iden- 2.6.1. Introduction to ML Estim¡¡tion
tlÍled. These ideas are important for understanding how we identify the
models in later Consider the problem of estimating the probability oí' having a given
number of men in your sample. The binomial formula computes the
probability of having s men in a sample of size N with the population
2.5.2. The x's ¡¡nd 8 Are Correl¡¡ted
parameter 1T indicating the probability of being male:
The Ix) O implies that the x's and s are uncorre- N!
Pr(s 11T, N) - - - - 1 T S (l [2.6]
¡ated. 1:1 there are several reasons why the x's might be corre-
lated wlth lhe er,rors, including reciprocal effeets among variables, mea-
surement error, mcorrect functional form, and f3's that differ across ob- where k! k· (k - 1) .. ·2·1. For example, the probability of having
servations 1986, pp. 334-350). Bere I consider the effect of three men in a sample of 10 with the probability of being a male equal
a variable since this will help us understand the tobit model in to .5 is
7.
lO! 3 7
If we estimate a model that excludes an independent variable which is Pr(s = 311T = .5, N 10) 3!7!.5 (1-.5) 0.117
c~rrelated with included independent variables, the OLS estima tes are
blased and inconsistent. Kmenta (1986, pp. 443-446) shows that this is This is a typical problem in probability. We know the formula for the
due to the eorrelation between the error and the independent variables probability distribution and the values of the parameters 1T and N. We
in the model. To see why they are correlated, assume that want to know the probability of a particular outcome s. In statistics, we
by the model: know s and N, but want to estimate 1T from the sample information. The
ML estima te is thal value of the parameter that makes the obse/1Jed data
y [2.4J most likely.
REGRESSION MODELS
Continuous Outcomes 27
O
a11"'
This is represented in Figure 2.3 by the líne with O located
at 11"' .3.
Tile value that maximizes the likelihood function also maxímizes the
log of the Iikelihood. Since it ls easier to solve the of
the log likelihood than the likelihood itself, the ML estimate is
computed by solving the equation:
aln
---'---'---------'- = O
0.8
For our example,2
1í
2.3. Different Values of 11"
aln s = 3, N =
---~~-------~=--~~--~--~--~
a11"'
To continue our assume that we know that S 3 and N 10, a11"' a11"'
but do not know 11"'. What value of 11"' is most to have gen- O a31n 11"' a7
the observed 2.3 the probability of oh~",r,,¡nn + a11"' + --....,...-'------'-
three 10 tries for all values of 11"'. The 'U"'F.""'U
curve shows that the pnJO,IOlJIUV occurs at .3.
3 7
=
our ML estimate. 11" 1-11"'
Since ~ is unknown, we writc the líkelihood function as O, 1, and 2. These are represented in the figure as solid circles. The
four panel s correspond to a sequence of guesses for the value of ~ that
IYI' (T 1) I~, (T 1) maximizes the likelihood. In panel A, the normal curve is centered on
For three ~ 2. The likelihood of each point is indicated by a vertical line, with
the likelihood is the product of the
individual llKI~1ll10()QS the overall likelihood equal to the product of the lengths of the lines:
3
L(~ 21 y) .005. Panel B computes the likelihood for ~ re-
3
suIting in L(~ -11 y) .0001. To increase the likelihood, we need a
nL(~IYi' (T
1) = nf(Y¡ I~, (T 1) value of ~ somewhere between 2 and -1. Panel e shows ~ 1, result-
i=l i=]
ing in L(~ 11 y) = .023. When we increase the mean slightly to 1.2
and the is in panel D, the likelihood is reduced to .022. Of our four tries, ~ 1
3
produces the ¡argest likeIihood. Tentatively, we conclude that ílML 1.
In y, In practice, ML is more complicated. First, we would usually have
1) InL(~ Iy" 1) = Llnf(Yi I~,
(T
(r (T = 1)
1=1
more observations. Second, we would often be estimating more than one
parameter (e.g., ~ and (r). Finally, we would have to consider all possible
The ML estimate is the value íl that maximizes this equation.
values of the parameters being estimated, not just the four values in our
To a better sense of how the ML estimate is determined, con- figure. Still, the general ideas are the same.
sider 2.4. Suppose that there are three observations with vaIues
Panel A: I y)=.005 Panel B: L(ft =-1 I y)=.0001 2.6.4. ML Estimation for Regression
o o
'"o '"o Maximum likelihood for the LRM is a direct extension of fitting a
normal distribution to a set of points. Consider estimating the simple
regression y a+ f3x + e using three observations: (x], y]), (X2' Y2), and
(x3, Y3)' PaneIs A and B of Figure 2.5 compare the likelihoods for two
sets of possible estimates. The observed data are indicated by circles. The
assumed distribution of y conditional on x is represented by the normal
curves which shouId be visuaIized as coming out of the page ínto a third
dimensiono The likelihood of an observatíon for a given paír a and f3
JI.. JI.. is indicated by the length of the line from an observatíon, indicated by
Panel c: L{~=1 I y)=.023 Panel D: L(~ =1.2 I y)=.022 a circle, to the normal curve. In panel A for a il and W, we find that
a o (x3' Y3) is very unlikely, while (x], y¡) ís quite likely. The likelihood of
'"O '"o aa and f3a is the product of the three lines in panel A. Clearly, a il and
W are not the ML estimates since ít is easy to find other estimates that
increase the likelihood, such as ah and f3b in panel B. The ML estimates
a
are those values and ~. that make the likelihood as large as possible.
Mathematically, we can develop the ML estimator for the LRM as
follows. Since y conditional on x is distributed normally with mean a+f3x
and variance (T2, the pdf for an observa tío n is
Figure 2.4. Maximum Likelihood Estimation of j.k From a Normal Distribution f(Yi 1 a + f3x" (T)
1 (1-2:
= (T~ exp :::':'--=-_'--;:---,---'-'-=-- [2.7J
Continuous Outcomes 31
30 REGRESSION MODELS
Taking logs,
Varea) -E[H(a)¡-1 There are hundreds of texts dealing with the linear regression model.
For our [n order of increasing difficulty, 1 recommend Griffiths et al. (1993) for
an introductory text; Kmenta (1986), Greene (1993), and Theil (1971)
as intermediate texts; and Amemiya (1985) for an advanced treatment.
_E(i!2In L(6») -J Manski (1995) provides a detailed discussion of the identification prob-
_E(i!2 In L(a») lem. Four recommended sources on maximum likelihood, in order of
eJaeJ{3 i!acJu
increasing difficulty, are: Eliason (1993), Cramer (1986), Greene (1993,
-E( lnL(a») _E(cJ21n L(a») Chapter 12), and Davidson and MacKinnon (1993, Chapter 8).
i!{3eJ{3 cJ{3cJu
2
_E(i! 1n L(6») _E(i!Z In L(f!2)
i!ud{3 cJucJu
VareO) are considered in Chapter 3.
BinGly Outcomes 35
able. While 1 do not recommend the LPM, the model iJlustrates the
8ina~y Outcomes: The Linear Probability, problems resulting from a binary dependent variable, and motiva tes our
discussion of the logit and probit models. The probit and logit models
Pro bit, and Logit Models are developed first in terms of the regression of a latent variable. The la-
tent variable is related to the observed, binary variable in a simple way:
if the latent variable is greater than sorne value, the observed variable
is 1; otherwise it is O. This model is linear in the latent variable, but re-
sults in a nonlinear, S-shaped model relating the independent variables
to the probability that an event has occurred. Given the great similarity
between the logit and probit models, 1 refer to them jointly as the bi-
nary response model, abbreviated as BRM. The BRM is also developed
as a nonlinear probability model. Within this context, the complemen-
tary log-Iog model is introduced as an asymmetric altemative to the logit
and probit models.
E(ylx) TABLE 3.1 Descriptive Statistics for the Labor Force Participation Example
o
Standard
Name Mean Deviatian Minimum Maximum Desenplian
LFP 0.57 0.50 0.00 l.OO 1 if wife is in the paid labor force; else O
K5 0.24 0.52 0.00 3.00 Number of childrcn ages 5 and younger
K618 1.35 1.32 0.00 8.00 Number of children ages 6 to 18
AGE 42.54 8.07 30.00 60.00 Wife 's age in years
WC 0.28 0.45 0.00 LOO 1 if wife attended college; else O
.... (1)
HC 0.39 0.49 0.00 1.00 1 if husband altended college; else O
LWG 1.10 0.59 -2.05 3.22 Log of wife's estimated wage rate
INC 20.13 11.63 -0.03 96.0ü Family iilcome excluding wifc's wages
Goldberger (1964, pp. 248-250) sllggested that the LPM could be cor-
Varíable
f3 rected for heteroscedasticity with a two-step estimator. In the first step,
Constan! Ll44
y is estimated by OLS. In the second step, the model is estimated with
K5 --0,295 -0,154
9,00 generalized least squares using Var(e) y(1 - 10 correct for het-
K618 -8,21
-0,011 -0,015 eroscedasticity. While this approaeh inereases the efficiency 01' the esti-
AGE -0,80
-0,013 --0,103 mates, it do es not correct for other problems with the LPM. l'urther, for
we -5,02
He
0.164
0,019
3,57 y < O or y > 1, the estimated varianee is negative and ad hoc adjust-
OAS ments are required.
0,072 4,07
NOImality. Consider a specific value of x, say x •. In Figure 3.1,
E(y Ix.) is represented by a diamond on the regression line. e is the
distanee from E(y Ix) to the observed value. Since y ean only have the
values O and 1, which are indicated by the open circles, the error must
There are several things to note b . ,
effeet of a variable is the a out these mterpretatIons. Fírst, the equal either el = 1 - E(y Ix.) or ea 0- E(y I Clearly, the errors
ame
""',"UJ""d. th~ effec: re~ardless of the vaIues of the other cannot be normalIy distributed. RecaIl that normality is not required for
of ¡he current valu~fo; ~~ltt ~<~r~~;~ f~; e:aarmiapblIee í'Sf athewoman
, . ( ,1
same
the OLS estimates to be unbiased.
, compared to no young children her redíct Nonsensical Predictions. The LPM prediets values of y that are neg-
oí employment decreases by 1 18 4 ' p . e.d ative or greater than 1. Given our interpretation of E(y I x) as Pr(y
TI' . x -.295), whlch 1S
lIS probl,em is co.nsidered in the next section. Fi- 11 x), this leads to nonsensical predictions for the probabilities, For ex-
,and y-standardlzed coefficients are inap ro ri- ample, using the means in Table 3.1 and the LPM estima tes in Table
.olltcome, and x-standardized coeflicients ar~ in~p 3.2, we find that a 35-year-old woman with four young children. who did
mdependent variables. not attend coIlege nor did her husband. and who is average on other
variables, has a predicted probability of being employed of ,48. (Verify
3.1.1. Problems With the LI'M
this result.) While unreasonable predictions are sometimes used to di s-
While the miss the LPM, sllch predieations at extreme values oí the independent
of the parameters is unaffected by h . variables are al so common in regressions with continuous outcomes.
several ass t' avmg
.• ump lOns of the LRM are necessarily
Functíonal Fonn, Since the model is linear, a unir increase in XI< re-
sults in a constant change of f3k in the probability of an event, holding
If a binary rando . bl
is 11(1 ) (P . . m vana e has mean /.L, then its al! other variables constant. The inerease is the same regardless of the
r /.L. rave thll') Smce the t d
x is xJ3, the condi!ional varian~~ of y de edxpec e value?f y given current value of x, In many applications, this is unrealistic. For exam-
pen s on x accordmg to the pIe, with the LPM each additional young child decreases the probability
oí being employed by .295, which implies that a woman with four young
VhrCvlx) Pr(y l/x)[l Pr(y l/x)] children has a probability that is 1.18 less than that of a woman without
. ' xJ3(J - xJ3)
WhlCh that the f h young children, alI other variables being held constant. More realistically,
not constant. (Plot the Var(y I x) o, t (le crrors ?epends on the x's and is each additional chiId would have a diminishing effeet on the probabil-
LPM is 1 O' as x p ranges from -.2 (o 1.2.) Since the ity. While the first child might de crease the probability by ,3, the second
• t le LS '. . .
standard errors are . 1" eS~In:ator of J3 IS mefficient and the child might only deerease the probability an additional and so on.
resu tmg m mcorrect test statistics, That is to say, the model should be nonlinear. In general, when the out-
eome is a probability, it is often substantive/y reasonable that the effeets
41
Binary Outcomes
40 REGRESSION MODELS
. . d application of the BRM is not
of independent variables will have diminishing retums as the predicted portant to realize that the denva;l;: ~:tion of a latent variable. Section
probability approaches O or 1. In my opinion, the most serious problem dependent on your accePBtaRn~ ~an :e derived as a nonlinear probability
with the LPM is its functional formo 3 4 shows that the sarne .
The binary response model has an S-shaped relationship between the ~odel, without in.voking thed it~e~~fli~~~tr~yn\:~:~~~I~~ the observed x's
independent variables and the probability of an event, which addresses The latent y' IS assume
the problem with the functiona l form in the LPM. In the following sec- through the structural rnodel:
tion 1 develop this model in terrns of a latent dependent variable. Section y; = x¡~ + B¡
3.4 shows how the logit and probit rnodels can also be thought of as non-
linear probability rnodels without appealing to a latent variable. And, in . bl • I·S Iinked to the observed binary variable y by the
The latent vana e y
Chapter 6, the rnodels are derived as discrete choice models in which an measurement equation:
individual chooses the option that rnaximizes her utility.
f1 if Y7 > T [3.1]
y¡ = tO if yj ::: T
3.2. A Latent Variable Model for Binary Variables . If ' then y - O If y' crosses
where T is the threshold or cutpotnt. y::: T, - m· e that T = O
. * ) h - 1 For now, we assu .
As with the LPM, we have an observed binary variable y. Suppose the threshold T (I.e., y. > T , t e.n ~d- t: in assumption in detail.
that there is an unobserved or latent variable y* ranging frorn -00 to 00 Section 5.2 (p. 122) dlscusses thlS.1 en ~e gobserved Y is ilIustrated in
d
that generates the observed y's. Those who have larger values of y* are The link between the la:e~ y a~x + B. In this figure, y' is on the
observed as y = 1, while those with srnaller values of y' are observed as Figure 3.2 for the model y - a + . l dashed line
y = O. vertic~l a~is, ~ith the*t~re~hold ~ int~i~a~~~_~~a~:;~~~:: which should
Since the notion of a latent variable is central to this approach to The dlstnbutlOn of y. IS s °twonf t~e figure into a third dimensiono When
deriving the BRM, it is important to understand what is rneant by a latent be thought of as commg ou . - 1
y' is larger than T, indicated by the shaded reglon, we observe y - .
variable. Consider a woman's labor force participation as the observed
y. The variable y can only be observed in two states: a woman is in the
labor force, or she is noto However, not all wornen in the labor force
are there with the same certainty. One woman rnight be very c10se to
the decision of leaving the labor force, while another woman could be E(y" lx)
very firm in her decision. In both cases, we observe the same y = lo
The idea of a latent y' is that there is an underlying propensity to work
that generates the observed sta te. While we cannot directly observe y*,
at some point a change in y* results in a change in what we observe, *~
nam ely, whether a woman is in the labor force. For example, as the
number of young children in the farn ily in creases, it is reasonable that T=O
a woman's propensity to be in the labor force (as opposed to working
at home) would decrease. At sorne point, the propensity would cross a
threshold that would result in a decision to ¡eave the labor force.
Can al! binary outcomes be viewed as manifestations of a latent vari- X1 x2
able? Sorne researchers argue that invoking a latent variable is usuaJly
X
inappropriate, others believe that an underlying latent variable is per-
fectly reasonable in all cases, while most seem to take a middle ground. Figure 3.2. The Distribution of y' Given x in the Binary Response Model
Regardless of your assessment of the use of a latent variable, it is im-
REGRESSION MODELS Bina/y Outcomes 43
For <:ACUl1IJl<:,
of the
1 egual 1, at X2 nearly 90% are Panel A: pdf's for logistic and normal distributions
LO
o
Since is continuous, the model avoids the problems encountered
with the LPM, sin ce the dependent variable is unobserved, the
model cannot be estimated with OLS, Instead, we use ML estimation, n
which about the distribution of the errors. Most o
the choice is normal errors whieh result in the prohit N
I~ 1 ex p ( -~)dt
These distributions are drawn with long dot-dashes in. Fig~re 3:3. The
The cdf JU".,w",,, the probability that a random variable is less than or
standard logistic pdf is flatter than the normal distributlon smce It has a
value, For example, cf)(O) = Pr( e ::'S O) S (Find this larger variance. . . k
If we rescale the logis tic distribution to h~ve. a u.mt vananc~, .. n~wn
model, the errors are assumed to have a standard logistic as the standardized (not standard) logistic dl~tflbutiOn, the IO~lstlC and
{lISlrUnl[¡!on with mean O and variance 1T 2¡3, This unusual variance is normal cdi's are nearly identical, as shown 111 pa.ne.1 B. of. FIg.ure 3:3.
LJV',",<tl,,,<: it results in a particularly simple eguation for the pdf: However, t h e p df an d cdf for the standardized loglstlc dlstnbutlOn wlth
a unit variance are more complicated:
y exp( ye) 5' exp( ye)
and an even eguation for tlle cdf: and A (e) = 1 + exp(Y8) [3.2]
"lllllJI'C;t
[1 + exp(ys)]2
where y = 1T¡../3, Because of the simpler eguations for the stan?~rd
(not standardized) logistic distribution, it is gene rally used for denvmg
44
REGRESSrON MODELS
Bina/y Outcomes 45
3.3, Identificatíon
z f3 z f3 z
xl3p + Sp Variable f3
4.94 1.918 5.04 1.66 0.98
Constant 3.182
,vhefe L model and P the probit model. Since YI. K5 -1.463 -7.43 --0.875 -7.70 1.67 0.96
IJlJ,o;)HJ1\. to determine their variances from the -0.95 -0.039 -0.95 1.67 1.00
K618 -0.065
-4.92 -0.038 -4.97 1.66 099
I3 L and I3p are unidentified. For both AGE -0.063
3.51 0.488 3.60 1.65 0.97
is determined by assuming the variance of WC 0.807
0.057 0.46 1.95
e. Since Var(eplx) (Why?) , it follows that SL "'" He 0.112 0.54
4.17 1.65 0.96
LWG 0.605 4.01 0366
The errors are no! identieal since the logistic and normal -4.20 -0021 -4.30 1.68 0.98
dÍstríbutions with unit variance are only approximately equal (see Figure
905.39
From
NOTE: N 753. fJ is an unstandardi7.ed coefficient: z is the z-test for fJ·
to a prohit coefficieilL
where 1.81. This transformation can be used to compare coef- of the error. The effects of the identifying assumptions about Vare s) are
ficients from a published analysis to comparable coefficients from se en by taking the ratio of the logit coefficients to the pr?l:it coefficients,
and vice versa. contained in the column labeled "Ratio." The logit coefficlents are about
I3 L 1.813p is based on equating the variances of 1.7 times larger than the corresponding probit coeffic~e~ts, wit~ t~e ex-
tlle and normal distributions. Amemiya (1981) suggested making ception of the coefficient for HC which is the least staÍlstlcally slgmficant
the cdf's of the and normal distributions as c10se as possible, not parameter. Clearly, interpretation of the f3's mus~ take the e~fects of the
their variances equal. He proposed that the cdf's were most identifying assumptions into account. This issue IS now comlldered.
similar when SL which led to his approximation: I3L "'" 1.613p·
own calculations indicare that the cdf's are c10sest when eL"'" 1. 7e p, 3.3.1. The Identification of Probabilities
corresponds to the results in the examplc 1 now
Since the f3's are unidentified without assumptions about the mea~ and
variance of s, the f3's are arbitrary in this scnse: if we change the Iden-
cmd Probit: Labor Force Participatíon tifying assumption regarding Vare e Ix), tlle f3's also change. .
the f3's cannot be ínterpreted directly since they reflect bolh: (1) the re~atwn
we have no! considered estimation, it is useful to examine ship between the x's and y*; and (2) the identifying assumptwns. Wlllle lhe
and probit estimates fmm our modeI of labor force participa- identifying assumptions affect the 13'8, they do not affect Pr(y . 1 Ix).
tlon. The model is More technically, Pr(y 11 x) is an estimable functton. An e~tJma?le
function is a function of the parameters that is invariant lo the ldentlfy-
Pr(LFP 1) + f3 1K5 + f32K618 + f33 AGE ing assumptions (Searle, 1971, pp. 180-188).
+ + f3 sHC + f36LWG + f37 INC ) Consider the logit model where
exp(x¡l3) 1
Estimates are in Table 3.3. The first tlling to notice is that the log Pr(Yi = 11 x¡) 1 + exp(x¡l3)
likelihood and z-tests are identieal. This reflects tlle basic simi-
for in the structure of the logit and probit models, (Prove the last equality.) The right-hand side is the celf .for the logistic
and the fact that these statistics are unaffected by the assumed variance distribution with variance (í2 1T 2 ¡3. We can standardlze e to have a
REGRESSION MODELS
Binary Outcomes 51
unit
dividing the structural model by 0-:
into the odds:
Xi~ e
=._+-'- Pr(y = 11 x) Pr(y 11 x)
o- o- o-
Pr(y=Olx) 1 Pr(y = 11 x)
has
logistic distribution with cdf (se e Equation 3.2): The odds indieate how often something (e.g., y = 1) happens relative
to how often it does not happen (e.g., y = O), and range from O when
Pr(y = 11 x) = O to 00 when Pr(y = 11 x) = 1. The log of the odds,
known as the logit, ranges from -00 to oo. This suggests a model that is
linear in the logit:
In =x~ [3.6]
-00
1
are f' ~ . lat lS to say, the proba- Another example is the complementary log-log model (Agresti, 1990,
. ' u n e t l O n s Further f .
IS 31so 1 . ,any unetlOn of the probabilities
. mportantly we ea . t . pp. 104-107; McCulIagh & Nelder, 1989, p. 108), defined by
and whieh are ratios o/ p1'oba~I:~e~rp;~~ e~anges I~ prob~bilities
but first we cOIlsider an alternative m h d' s :s. done m SeetlOn 3.7, ln( -ln[I - Pr(y = 11 x)]) = x~
bit models. et o of denvlllg the logit and pro-
or, equivalently,
where the index for multiplication indicates that thc product is taken
11 over only those cases where y 1 and y = O, respectively.
The f3's are incorporated into the likelihood equation by substituting
Q the right-hand si de of Equation 3.3:
L(P ¡y, X) = n
y=l
F(x¡P) np
y=O
F(XiP)]
L(P Iy, X) = nN
i=1
Pi (3.9]
it seems necessary to add a cautionary note since it is easy to get the
impression that ML estimation works well with any sample size. For ex-
ample, the 32 observations from a study by Spector and Mazzeo (1980)
are used frequently to illustrate the logit and probit models, yet 32 is too
more than one observation for each combination 01' values of independent
mmUllum estimaf b . small of a sample to justify the use of ML. The following guidelines are
nn<f'rvn,,,,,, '. . Ion can e used. Smce the requirement
H celI 15 rarely satlsfied in social scíence research 1 do not cons'd not hard and fast. They are based on my experience of when the models
anushek and Jackson (1977, pp. or Madd:¡la (1983, pp. 12:~ seem to produce reasonable and robust results and my discussions with
other researchers who use these methods.
55
REGRESSrON MODELS Binary Outcomes
,~t iS" to use ML with samples smaller tban 100, while samples
,> > on tbis guess by adding a vector ~o of adjustments:
over 500.se~m These values should be raised depending on
c~aractefl~tlCS of the model and the d~ta, First, if there are a lot of pa-
61 60 + ~o
rdmeters In t.be more observatlOns are needecL In the literature We proceed by updating the previous itera1ion according 10 the equation:
on the covanaI~ce structure model, the rule of at least five observations
p,er > IS ~ften . A rule oC at least 10 observations per pa- 0n+1 6 11 +~"
rameter seems redsonable for the models in this book, Tbis rule does Iterations continue until there is convergence, Roughly, convergence oc-
not tbat min~I?um of 100 is not needed if you have only two curs when the gradient of the log likelihood is close to O or the estimates
Jf the data are iIl conditioned (e.g" independent do not change from on~ step to the nexL Convergen ce must occur to ob-
, or íf there is little variation in the depen-
tain the ML estimator O.
dent vana:Jle, all oC the outcomes are 1), a larger sample ís The problem is to find a ~n tbat moves the process rapidly toward
, rhlrd, sorne models seem to rcquire more obscrvations. The a solution. It is useful to think of ~n as consisting of two parts: ~n
orchnal . model of 5'1S an
. example. In d1scussmg
' . tbe D" 'Y n' 'Y n is the gradient vector defined as J In L/ JO,1' which indica tes
use. M:~ lor small , . Allison (1995, p. 80) makes a useful point. the direction of the change in the log likelihood for a change in the
Wlule the standard a~vlce IS that with small samples you should accept parameters. Dn is a direction matm that reflects the curvature of tlle
. as eVld~nce tbe nul! hypothesis, given that the log Iikelihood function; that ¡s, it indica tes Ilow rapidly the gradient is
to wh:c~ ML estlmates are normally distributed in small samples changing. A dearer understanding of these components is gained by
It 15 more reasonable ro require smaller p-values in small examining the simplest metbods of maximization.
For tb~ L~~M, ML estimates are obtained by setting tbe gradient of An estimate increases if the gradient i8 positive, and it decreases if the
the hkellho.od to O and for the parameters using algebra. gradient is negative. Iterations stop when the derivative becomes nearly
solutlOns are possible with nonlinear models. Conse- O. The problem with this approach is that it considers the slope of In L,
:zun:zencal fllethods are used to find the estimates that maximize but not how quickly the slope is changing. To see why this is a problem,
v. j¡kej¡~ood function, NUI;nerical methods start with a guess of the
consider two log likelibood functions with the same gradient at a
"Iues of the and Iterate to improve on that guess. While point but with one function changing shape more quickly tl1an the other.
to dismiss numerical methods as an esoteric topíc
(Sketch these functions.) You should move more gradually for the func-
concern, programs using numerical methods for esti- tion that is changing quickly, in order to avoid moving too faL Steepest
incorrect estimates or faí! to provide any estimates, descent tends to work poorly since it treats both functions in the same
. and co~rect such problems, an elementary understanding of
numencal methods 18 usefuL 1 witb an introduction to numerícal way.
The next three commonly used methods address this problem by
methods. followed by advice on using these methods. adding a direction matrix that assesses how quickly the log likelihood
function is changing. They differ in their choice of a direction matrix,
In all cases, it takes longer to compute the direction matrix than the
3.6.1. Iterative Solutions
identity matrix used with the method of steepest ascenL Usually, the
A'lSUl?e lha~ ':: are to estimate the vector of parameters O. We additional computational costs are made up for by the fewer iterations
w1th an lI1ltIal guess 6o, called start values, and attempt to improve that are required to reach eonvergence.
56
REGRESSION MODELS
Billa/y Outcomes 57
No one method works best al! of th ' ,
one set of data may not wh ~ tIme, An algor~thm applied to
where In Li is the value of the likelihood function evaluated for the ¡th
the same data may 'dI FIle another algonthm appIied to
may occur' In rapI y. ~or a different set of data, the op- observation. This approximation is often simpler to compute since only
~n tbe . . the algonthm used in commercial software the gradient needs to be evaluated. Iterations proceed according to
. b of the programmer and the ease witb wbich
can . e programmed for a given modeI.
~)
the gradient or the Hessian, numerical methods can be used to estímate
InL Bada BaB{3 them. For example, consider a log Iikelihood based on a single parameter
(J, The gradient is approximated by computing the slope of the change in
( ('PlnL InL
d{3Ba In L when (J changes by a small amount. If /l is a small number relative
to (J,
If
as ~
2
more rela:ive to B In L¡ r7{3Bf3, tbe gradient is changing él In L In
to the 1"~Tlm'."" t lan as f3 changes. Thus, smaller a d'ustm d(J
~ ----------------
of iY would be indicated The N t R .J .ents
to ¡he equation:' ew on- aphson algonthm
Using numerical estimates can greatly increase the time and number of
iterations needed, and results can be sensitive to the choice of /l. Further,
InL)-ldlnL different start values can result in different estimates of the Hessian at
(
BO/lBO~ dO/l convergence, which translates into different estima tes of the standard
lVe errors. Programs that use numerical methods for computing derivatives
the inverse of the Hessian?)
should only be used if no alternatives are available. When they must be
The iVI('JtIf'fl used, you should experiment with different starting values to make sure
as lhe lnr,Ol71'latinn In some cases, th~ expectation of the Hessian that the estimates that you obtain are stable.
The method of can ?e easler to compute than the Hes~
uses ¡he mformatíon matrlX' as th e d'IrectlOn
,
which results in 3.6.2. The Variance of the ML Estimator
~l~t~l~ovarianC~)f~~~riX is often wrítten in an equivalent form using the 3,6.3, Problems With Numerical Methods and Possible Solutions
=
2;n!:í)-1
(t éiéi6éiO'
solution, and that is a maximum. This is the case for most of the mod-
els considered in this book. However, even when the log likelihood is
1=]
[3.12J globally concave, it is possible to have false convergence. This can oc-
lS d .h cur when the function is very flat and the precision of the estimates of
'j 1 use the Newton-Raphson algorithm Eqlla
Wlt the gradient is insufficient. This is common when numerical gradients
s lOWS t le b . -
. . etween the curvature of the likelihood are used and can also be caused by problems with scaling (discussed be-
and the vanancc of the estímator Tlle sl'ze of th . . low). Finally, in some cases, ML estimates do not exist for a particular
. 1 - " e vanance IS
le ated lo the. second derivative: the smaIler the second deriva- pattern of data. For example, with a binary outcome and a single binary
the ~anance. ~hen the second derivative is smalIer, the independent variable, ML estimates are not possible if there is no vari-
. IS flatter. It the likeIíhood f'
vanal1ce will be TI '. 1 I :1 ~gua Ion IS very fiat, the ation in the independent variable for one of the outcomes. You can try
the lIS s lOU ( match your mtuition that the fiatter estimating a probit model using: y' (O O 1 1 1) and Xl (l O 1 1 O).
. . the harder it. will be to nnd the maximum of the This works fine, since there are x's egual to O ane! 1 for both y 1 and
and the (
¡lave in che solution you obtain. I.e., the more variance) you should y O. However, now try to estimate the modcl for: y' (O O 1 1) and
.
x' (1 O 1 1). Your program will "crash" since whenever y 1, an x's
A third wÍlich is related to the BHHH . "
to compute sin ce it does not eval 't' f h algonthm, IS sImple are l's.
• ua Ion o t e second derivatives: When you cannot get a solution or appear to get the wrong solution,
the first thing to check is that the software is estimating the model that
éil~Li)--l you want to estimate. It is easy to make an error in specifying the com-
éiO' mands to estimate your model. If the modeI and commands are correct,
Whíle there may be problems with the data.
oí" the .covarianc~ matrix are asymptotícally
in
.sometlmes provlde very different estimates ¡nCo/Teet variables. Most simply, you may have constructed a variable ¡ncor-
when the
¡S smalI or the data are ilI conditioned. Con~ rectly. Be sure to check the descriptive statistics for all variables. My
ifyou
the same model with the same data using two experience suggests that most problems with numerical methods are
programs that use
you can get different results. due to data that have no! been "cleaned.
60
REGRESSION MODELS
Bína¡y Outcomes 6J
observations. a
"en era JIy occurs more rapldly
. when
t J1cre are more d h . differ at the first decimal digit as a result of the different methods used
. .' an w en the ratio of the number oí" ob-
servatlOns to the number of variables is Wh'l th . to estimate vard3').
¡¡ttle you can do about " . . 1 e cre lS generally
Slze, 11 can explam why you are having
your models to converge. Parameterizations of the Model. A more basic difference is found in the
• 1 variables. is a very common cause of probJems with numeri- outcome being modeled. While most programs model the probability of
~<¡ meth~d~. The . the ratio between the largest standard deviation a 1, so me programs (e.g., SAS) model the probability of a O. This is a
and the sm~tllest standard devnttion, the more problems you will have trivial difference if you are aware of what the program is doing. For the
wlth numencal methods. For example if you ha .
' . , v e mcome measured BRM,
m lt may ha~e a very standard deviation relative to other
variables.
. mcome to thousands of dollars, may solve the prob-
lem. . expenence suggests that problems are much more likely wh
Pr(Yi = OI x¡) = 1 Pr(Yi = 11 x¡) F(X¡j3) = F( -x;(3)
¡he ratiO betwcen the ' d ' '11 ' en
10. <in sma est standard devlation exceeds where the last equality follows from the symmetry of the pdf for the logit
Distributi(m outeo' If . and probit models, Thus, aU coefficients wiII have the opposite Note
o· . dI"
t me. . a '. proportlOn of cases are censored in the that this wiII not be the case for the complementary log-Iog model since
o ,lt ,mo . e or If one oí the of a categorical variable has ve
few Cdses, convergence may be difficult. There is little that can be d()n~ it is asymmetric.
wl!h such data limitations. With estimates in hand, we can consider the interpretatíon of the bi-
nary response modeL
model is methods for ML estimatíon tend to work well when your
for your d.at~, In such cases, convergence gener-
often wlthm five iterations. If you have too few 3.7. Interpretation
·
Jcm. l 11 such cases
or a poor m. odeJ, convergence may be a prob-
d' In this section, 1 present tour methods of interpretation, each of which
, . your ata can solve the problem. If that
¡cal you can try usmg a program that uses a different numer- is generalized to other models in later chapters. First, 1 show how 10
A problem ¡hat may be very difficult for one algorithm present predicted probabilities using graphs and tables. Second, 1 exam-
may work well for another. ine the partial change in y* and in the probability. Third, 1 use discrete
Whíle numerical methods generally work well I heart'l d change in the probability to summarize the effects of each variable. Fi-
10 ' " , 1 Y en orse
nalIy, tor the logit model, 1 derive a simple transformation of the param-
p, ) advlce: Check the data, check their transfer into
check th~ actual computations (preferably by repeating eters that indicates the effect of a variable on the odds that the event
a nval program), and always remain suspicious of occurred.
of the appea!." Since the BRM is nonlinear, no single approach to interpretatíon can
fulIy describe the relationship between a variable and the outcome prob-
3.6.4. Software Issues ability. You should search for an elegant and concise way to summarize
the results that does justice to the complexities of the nonlinear model.
related lo software for logit and probit that For any given application, you may need to try each method before a fl-
nal approach is determined. Por example, you might have to construct a
plot of the predicted probabilities before realizing that a single measure
The M ,. '. of discrete change is sufficient to summarize the effect oí' a variable, 1 il-
, .' . Dlfferent programs use dif-
ClXllnlZatzon.
~ methods of numencal maximization. In most cases estima tes of lustrate these methods with the data on the labor force participation of
t 1le parameters from the ' . ' women. You should be able to replicate many of the results using Tables
decimal . programs are Identlcal to at least four
oí the standard errors and the z-values may 3.1 and 3.3, aIthough your answers may differ slightly due to rounding
error.
63
REGRESSION MODELS Bínary Ol/teames
how the intercept and the slope affect the curve Panel A: Effects of Changing a
variable to the probability of an event. Under- o ,-~
,/
how the affect the probability curves is fundamental ~
-- I
I
I
I I
ea eh method of interpretation. I I I
I
~ I I I
.,.- I
3.7.1. The gffects of the Parameters I I /
I
11 I I I
Consider the BRM with a single x: >- LO f
., .1..
I ¡
'----"" o
L i I
I
I
Pr(y = ti F(a+{3x) Cl.. I I
I
I I
I I I
I
Panel A of 3.8 sbows the effect of the intercept on tbe probability / I I
I
0, shown tbe short dashed ¡ine, the curve passes / ,- ,/
~
~
~
0, the curves go through point (O, .5). The smaller the {3, the more
stretched out the curve. At 13 shown by the solid line, the curve 11 LO
°L-~~~~~~~~------~~----~20
around x O. For i1' (3 the curve would be near 1 at 0-20 O 10
and would decrease toward O at x = 20. X
a i8 also to understand how the probability curve general-
izes to more than one variable. 3.9 plots the probit model: Figure 3.8. Effects of Changing the Slope and lntercept on the Binary Response
llx, =<P(l+Ix+.75z)
Model: Pr(y 11 x) F(a + f3x)
Similar results for the logit mode!. The surface begins near zero
when -4 and z = -8. 11' we fu z = then d h' h ses tbe curve to shift to the
<P(l + Ix + [.75 x -8)) = <p( -5.0 + Ix) onl ( the i~~~c~to~~i::r~n~~8)'. ~h~ le~~~ of z affects the ínt~rcept of
Y
lehft see: but do es not affect the slope. Conver8ely, controllmg for x
curve along the x-axis. If we increase z by 1, t e curv , . b t the slope
which to the next curve back along the z-axis, then affects the intercepto oí t~e curve for z, ~~ no me~hods for inter-
With these ideas m mmd, we can conSl er
11 x, z (1)(1 + Ix + [.75 x -7]) <p( -4.25 + Ix) preting the binary response model.
REGRESSION MODELS BinG/y Outcomes 65
where mini indicates taking the minimum vaIue over all observations,
and similarly for max¡. In our example, the predicted probabiJities from
the probit model range from .01 to .97, which indicates that the nonlin-
earities that occur below .2 and aboye .8 need to be taken Ínto account.
If the coefficients from the logit model are used, the predícted probabiI-
ities range [rom .01 to .96. This illustrates the simiIarity between
the predictions of the logit and probit models, even for observations that
fall in the tail of the distribution. Consequently, in the remaÍnder of this
section, only the results from the probit analysis are shown.
Computing the minimum and maximum predicted probabilities re-
quires your software to save each observation's predicted probability for
further analysis. If this is not possible, or if you are doing a meta-analysis,
3.9. Plot of Probit Model: Pr(}' 1I ) the mínimum and maximum can be approximated by using the estimated
x. z <P(1.0 + 1.0x + 0.75z)
f3's and the descriptive statistics. The lower extreme of the variables is de-
3,7.2. Interpl'etation Using Predicted Probabilities fined by setting each variable associated with a positive f3 to its minimum
and each variable associated with a negatíve f3 lo its maximum. In our
di~~~~ most direct a~proacJl for interpretation is to examine the pre- exampIe, this involves taking the maximum number of young children
Wl 'h o an event for different values of the independent (since K6 has a negative effect), the mínimum anticipated wage
len t ere are more than two V' . bI " LWG has a positive effect), and so on. Formally, let
sible to the entire prob'1bTt f' ana es, .It. 18 no Ionger pos-
·1 ' 11 Y Sur ace and a deCISlon must be made
useful first
wh IC 1
is to .
to co d
mpute an how to present them. A
= {min Xik
if f3k > O
the <1nd the examme the ~ange of predicted probabilities within maxx¡k
í
if f3k O
ities. Jf ¡he mnge of to WhI~h each variable affects the probabil-
". 18 be:ween .2 and .8 (or, more con- and let be the vector whose kth element is . The upper extreme can
·~s and .7), ~he, relatlOn~hip between the x's and the be defined in a corresponding way, with the values contained in . The
. lmedr, and sImple measures can be d minimum and maximum probabilities are computed as
the results. Or" if the range of the probability is s~:~l
linear. For
the x s and the prob' bTty '11
the se . a 1 ¡ w~. al so be approxi-
' Pr(y = 11 ~) = F(~~) and Pr(y = 11 ) = FC7~)
.05 alld .10 is gme.nt of th~ probabll¡ty Curve between
pomts are lllustrated below. In our example, the computed probability at the lower extreme is less
than .01 and at the upper extreme is .99. While these values are quite
close to the minimum and maximum predicted probabilities for Ihe sam-
pIe, ~ and are constructs that do not necessarily approximate any
The
of an event given x for the ith individual is member of the sample. If they differ substantially from any Xi in the sam-
pIe, then Pr(y = 11 ~) and Pr(y = 1 1""t) will be poor approximations
= 11 Xi) = F(x¡~) of the probabilities min Pr(y = 11 x) and max Pr(y = 11x).
The
and maximum probabilities in the sample are defined as
Waming on the Use of Minimums anc! Maximums. The use of the mín-
min = 11 x) = min F(x¡~)
¡ imum or maximum value of a variable can be misleading if there are
max 11 x) max F(Xi~) extreme values in the sample. For exampIe, if our sample includes an
1 extremely wealthy person, the change in the probability when we move
67
REGRESSION MODELS Binary Outcomes
from the to the maximum income would be unrealistically Plotting Probabilities Ova the Range of a Variable
the mínimum and maximum, VOl! should examine When there are more than two independent variables, we must exam-
distributiOIl of each variable. If extre~e values are present,
ine the effects of one or two variables while the remaining variabl~~ are
you should consider lhe 5th percentile and the 95th percentile, for
held constant. For example, consider the effects of age and the wI~e at-
rather than the minimum and maximum.
",","UUlJ1'v.
tending college on labor force participation. The effects of both varIables
can be plotted by holding al! other variables at theÍr mean~ and allow-
The Each Variable cm lhe Predicted Probability ing age and college status to vary. To do this, let Xo contam ~he n;ean
of al! variables, exeept let we == O and allow ACE lO vary. Xl IS defincd
The next step is to determine the extent to which change in a vari-
able affects rhe probability. One way to do this i8 to allow one similarly for WC = l. Then
variable to vary from its minimum to its maximum, while aU other vari- Pr(LFP = 1 lACE, we = O) = <P(xolJ)
ables are fixed al their means. Let Pr(y = 11 X, Xk) be the probability
when aH variables except Xk are set equal to their means, and is the predicted probability of being in the labor force for women of a
80me value. For examplc, Pr(y = 11 X, min Xk) is the given age who did not attend college and who are average on ~ll ?ther
when xk its minimum. The predicted change in the characteristics. Pr( LFP = 1 lAG E, we == 1) can be computed SImIlarly.
as from its minimum to its maximum equals These probabilities are plotted in Figure 3.10. As suggcs~ed by Table 3.~,
the relationship between age and the probabili~ of bem~ employed IS
11 x, max approximately linear. This allows a very simple mterpretatIOn:
• Attel1díng college il1creases the probability of being employed by abou! .18
. For our the results are in Table 3.4. The range of pre-
for women of aH ages, holding all other vanables at thelr means.
dlcted. . can be used to guide further analysis. For example,
th~~c. lS ~lttle to be ¡earnee! by analyzing variables whose range of prob- • For each additional 10 years of age, the probabilíty of being employed de-
abIlltles IS su eh as He. For variables that have a larger range, the ereases by about .13, holding all other variables at their means.
ene! of the mnge aHect how interpretation should proceed. For
the probabilities for ACE range from .75 when age is
30 to .32 when age i8 which is a regio n where the probability curve is
linear. The mnge for [Ne, however, i8 from '()9 to .73, where non-
linearities are The implieations of these differenees are shown
in the next sectíon.
'JABLE 3.4 Probabilities of Labor Force Participatíon Over the Range of Each
nd,epc:ndcl1t Variable for lhe Probit Model
Range of lD
Al N
Afaximwn Jl;finil1lul1l Pi- o
K5 0.01 0.66 0.64
0.60 0.12 gL-____~____~----~--~~--~~--~
c:i 30 35 40 45 50 55 60
0.75 0.43
0.52 0.18 Age
He 0.59 0.57 0.02
0.17 0.66 Figure 3.10. Probability of Labor Force Participatíon by Age and Wife's Educa-
0.73 0.64
tion
REGRESSION MODELS Binary Outcomes 69
Predicted Probability
Number of
Young Children Did Not Attend Attended College Difference
1-
o O 0.61 0.78 0.17
..D
0.27 O.4S 0.18
2 O.O? 0.16 0.09
3 O.Ol om 0.02
e
Tables
al Selected Vczlues BXk
You can al so USe tables to present predicted probabilities. For exam- Since the model is linear in ,the partial derivative can be interpreted
the effeets of . children and the wife's education on the proba- as:
of are s?own in 111ble 3.5. The strong, nonlinear effect
• For a unÍ! change in Xb y* is expected to change by f3k units, all
lS cJearly evident. It also shows that the effect other variables constant.
REGRESSION MODELS Binary Outcomes 71
with this interpretation is that the varianee of . TABLE 3.6 Standardized and Unstandardized Probit Coefllcients for Labor
so the of a '. IS un-
. .' of /3 k m 18 unclear. This issue Force Partícipatíon
Wmslup and Mare (1984, p. 517) and M K 1
pp 114-116) . e e vey and Váriable f3 f3s
but their coneerns ' the or~mal regression model,
to the BRM. Smee the varianee of y* K5 -0.875 -0,759 --0.398
I
Wlen . dd d K618 -0.039 -0.033 --0.044 -0.95
'11 . are <i e to the model, the magnitudes
WI ev~n lf the added variable is uneorrelated with AGE -0,03R -(J.()33 -0.265
WC 0.488 0.424 0.191 3.60
ThlS makes ít .misleading to compare coefficients HC O.OS7 OJ150 0.024 0.46
.f of the mdependent variables. (Why is this LWG 0.366 0.317 0186 4.17
Wlt
d 1 lhe LRM?) To compare coefficients across equations ¡NC -0021 -O.0l8 -0.207 -4.30
adn proposed fully standardized coefficients wh¡'le'
an Mare * d ' ', Yar(y') 1.328
, y -stan ardlzed coefficients
IS rhe uncon~itjonal, standard deviation of y~: then the y*_ NOTE: N 753. is an unstandardized cocfficient f3sJ" is )1'" ~standard¡zed fully
lUIUUlrU'lZf'l1 Ior x k IS f3'
standardized Z IS the z-test.
Var(x) is the covariance matrix for the x's computee! from the observed
i3
data; contains ML estimates; and Vare e) = 1 in the probit model and
which can be as: Vare e) = 7T 2 ¡3 in the logit modeL
If you accept the notion that it is meaningful to discuss the latent
• F~r, unit in crease in is expected lo increase by
standard devi- propensity to work, the fully standardized and y* -standardized coeffi-
atlOns. al! otlJe!' variables constant. cients in Table 3.6 can be interpreted just as their counterparts for the
LRM. 3 For example,
, indicate the effect of an independent vari-
. _ UIllt o~ measurement. This is sometimes preferable for • Each additional young child decreases the mothcr's propensity to enter the
redsons and 18 fo b' . d labor market by .76 standard deviations, holding aH othcr variables constant-
' r mary m ependent variables.
If . also standardize the independent vari- • A standard deviation increase in age decreases a woman's propensity to
(Tk IS the standard deviation of:r then the fi II d d' en ter the labor market by .27 standard deviations. holding al! other variables
COeT/:Wll?nl for is . k, U Y stan ar. lzed
constant.
dF(xf3) dXf3
~ - = f(xf3)/3 k [3.141
Forthe
oi th e pro ba b'IlJty corresponds to the intersection of Iines within the figure. The partíal
. is the . curve relatin x
d Pr(y = 11 x, z)1 dX is the slope oi the line parallel to the x-axis at the
effecr is holdmg all other va.ri~ble~ con~tant. The sign ol th~
point (x, z); dPr(y = 11 x, z)/dz is the slope oi the line parallel to the z
of the /3k, smce j(x~) IS always positive. The axis at the point (x, z). For example, at (-4, -8), the slope wilh respect
01' xf3. Thís is in .on the magl1ltude oi /3k and the value to x is nearly O. As z in creases, the slope with respect to x incrcases
11 and the r 3.12, where :he solid line graphs Pr(y = steadily. At (-4, O), where Pr(y = 1/ x, z) is about .5, the slope is near
. me the margmal effect. The ma' .
at x whlch corresponds t P rgmal IS its maximum. As z continues to increase, the slope gradually decreases.
is symmetric around '. o, rey = 1 Ix) = .5. The marginal Hanushek and Jackson (1977, p. 189) show this relationship by taking
the symmetry oi .f. Therefore., the second derivative:
/;2
of the eifect d d
other and their . ep~n s on the values of the =f3kf3ePr(y llx)[l Pr(y llx)][1-2Pr(y llx)]
the Sl~C~ f IS computed at xf3. Conse-
of aH .1."'s. To 1 on the /3 s for all variables and the levels The f3's can also be used to assess the relative magnitudes of the
10W the value of the . 1 if marginal effect for two variables. From Eguation 3.14, the ratio of
on the level oi other variables cons'd maF~gma e ect ~i xk de- marginal effects for x k and x e is
the f '. 1 er Igure 3.9 whlch plots
01' x and z. Pick a point (x, z), which
chain rule: f(xf3)f3k
f(Xf3)f3f f3e
dXe
Thus, while the f3's are only identified up to a scale factor, their ratío
f(x) is identified and can be used to compare the efiects of independent
variables.
REGRESSION MODELS Binary Outcomes 75
Since thc value of the marginal d d
wc must . epen s on the Icvels of all two mea sures of change can be quite different. Second, the marginal
the effect. Oneo:~~~~~~ :a~~)e~~~pt~~ev~hnC"~balves to use when effect at the mean for AGE approximates the slope of the ¡ines in Figure
'''\?''''A~o.. . erage Over al! 3.1 O. If an independent variable varies over a region of the probability
curve that is nearly linear, the marginal effect can be L1sed to summarize
mean - -__-.....:---:.. the effect of a unit change in the variable on the probability 01' an event.
However, if the range of an independent variable corresponds to a
i=1 of the probability curve that is nonlinear, the marginal cannot be used
method is to o h " to assess the overall effect of the variable.
. c mpute t e margmal effect at the mean of the
where
11 x, - i) decreases the probability of employment by .33.
• A standard deviation change in age centered around the mean will decrease
is the standard
the probability of working by .12, holding all other variables at their means.
A From O lo j D· Vt· • If a woman attends college, her probability of being in the labor force i8
erete . . .. un/my ana bies. When eomputing a dis-
the variable d~!):~:a~~~~~~ r~l~~~~IS~ mhake certain that the change in
.18 greater than a woman who does not attend college, holding all other
variables at theír means.
For j'1" '. d es t at exceed the variable's range
IS a ummy .h .
will b e ' (l' eIÍ er + 1/2 will exceed 1 or Notice that the discrete change from O to 1 for WC and HC i8 nearly
measure of un ess 1/2). Consequently, a preferred identical to the effect of a unit change. This is a consequence of the near
for dummy variables is linearity of the probability curve over the range of these variables, and
will not necessarily be true in other examples.
= llx, xk 1) Pr(y llx, xk O)
This is the as
their means. goes from O to 1, holding all other variables at 3.8. Interpretation Using Odds Ratios
The idea of h' Our final method of interpretation takes advantage of the tractable
'. c ange can he extended in man form of the logit modeL A simple transformatíon of the {3's in the logit
on the appllcatlOn. If a change of a specific amount .Y model indicates the factor change in the odds of an event occurring.
other rhan 1 nr suc~::Sb~h:s~~~ition of four years of SChoolin~: There is no corresponding transformation of the parameters of the probit
model.
From Equation 3.6, the logit model can be written as the log-linear
Labor Force Participation model:
lhble 3.8 contains measures of discre 'h .
of women's lahor force participation S te c afnghe for the probIt model InH(x) = x~ [3.15]
. , ome o t e effects can he inter-
where
IABLE 3.8 Discrete
Model in ¡he Probability of Employment for the Probit 11 x)
H(x) = Pr(y = llx) [3.161
Pr(y OIx) 11 x)
is the odds of the event given x. In H(x) is the log of the odds, known as
the [ogit. Equation 3.15 shows that the logit model is linear in the logit.
Consequently,
-0.02
-0.12 J In H(x)
He 0.18 JXk
0.14 0.02
0.08
Since the model is linear, {3k can be interpreted as:
• For a unít change in Xk' wc expect the logit to change by f3 k , holding all
othcr variables constan!.
REGRESSION MODELS Binwy Outcomes 81
This
the is simple sinee the effeet of a unit ehange in xk on
011 the level of x k or on the level of any
"exp({3k) times smaller." For 8 = Sk' we have:
.most of us do not have an intuitive un"- • Standardízed factor change. For a standard deviation in lhe odas
m the logit means. This requires another are expected to change by a factor of exp( f3 k ), holding al! olher variables
constan!.
Notice that the effect of a change in Xk does not depend on the level of
exp(xf3 ) Xk or on the level of any other variable.
TARLE 3.9 Factor Change Coefficients for Labor Force Participation for the
the parameters can be interpreted in terms of odd . Logit Model
, s ratlOs:
3.
(1l1 the odds are expected to change by a factor o/" Logit Factor Slandard filelor
all other vanables constant. Variable Coefficienl Change Change z~value
of Tw? in the Odds With the Corresponding Factor The very early history of these models begins in the 18605 and is dis-
!TI the cussed by Finney (1971, pp. 38-41). The more recent history of the probit
model involves attempts to model the effects of toxins on insects. Work
by Gaddum (1933) and Bliss (1934) was coditied in Finney's influential
Probit Analysis (1971), whose first edition appeared in 1947. The logit
model was championed by Berkson (1944, 1951) in the 1940s as an alter-
native to the probit modeL Cox's (1970) The Analysis of Binary Data was
highly influential in the acceptance of the logit modeL Applications of
the logit and probit models appeared in economics in the 19508 (Cramer,
1991, p. 41). Goldberger's (1964, pp. 248-251) Econometnc Theory was
important in establishing these models as standard tools in economics,
while Hanushek and Jackson's (1977) Statistical Methods for Social Sci-
entists was important in disseminating these models to areas outside of
economics.
REGRESSION MODELS
and (1 CI '
. lapter 4) develop the logit and
~l
wlth severaJ alternatives within th f pro-
model Pudne ( 1 9 ' e ramework of the
, .' . y . 89, Chapter 3) derives these mod-
assumptIOns associated with t'l' '"
4) . U I Ity maxJmlzatlon. Hypolhesis Tesling and Goodness 01 Fil
m>h",,",_ 1 . presents both models with special attention
oglt and 1 r
data. the ntpr,w~+ f og- mear models for categoricaJ
been each 01' the met~o~l,e r~sults of th~se models has often
can be found in onc l' s of mterpretatlOn considered in this
treatments that . orm ?r a~other in earHer work. Recent
pp. on mterpretatlOn mclude . Hanushek. an d J ackSon
and pp. 97-117), LJao (1994), Long (1987),
85
REGRESSION MODELS
Hypothesis Testing and Goodness of Fit H7
\vhere Var(í3) is the
matrix for í3. For example, with three
distribution is used. Accordingly, sorne programs label this statistic a
z-test, while other programs label it a t-test.
Var (t)
{32
Example ot the z- Test: Labor Force Participatian
Accept
to test the constraint rhat which is distributed as chi-square with 1 degree of freedom if Ha is true.
= = 0, Notice that W is the square of the z-statistíc in Equation 4.1, which cor-
The
(~6n G:) ~ m responds to a chi-square variable with 1 degrec 01' freedom being equal
to the squarc of a normal variable. Sorne programs, such as SAS, present
a single degree of freedom chi-square statistic for individual coefficients,
rather than the z-statistic.
Q~ r can be tested with the Wald statistic: The same ideas apply to more complex hypotheses. Consider Ho: f31
=
f32 = O, which can be written as
W [QP rHQV;U:(p)QT][Qp r] [4.3J
(O
W
~umber. of c~nstramts
~s wirh of freedom equal ro rhe
the number_of rows of Q). The Wald statis- Ho:
10) (f3o)
001 ~~
tIC conslsts oí two components. Q~ r at each end of the formula
measures between the estimated and hypothesized values. QP - r is simply (liI li2)" The middle portion of the Wald formula is
reflects the variability in the estimator, 01', al-
the curvature of the likelihood function. To see this more
consider a example.
1'01' the model 11 x) (P(f3¡¡ + f3¡x¡ + with Ho: f3 1 = f3*,
Q~ r can be wríttcn as
To keep the example simple, assume that the estimates are uncorrelated.
(In practice, the estimates will be correlated.) Then
Q~
[4.4]
at the cnd of the formula, which squares the distance
\,;cU'.GIC;¡'
With uncorrelated the Wald statistic is the sum of squared Wald Test Tlzat Two Coefficients Are Equal. To test that the effect of
z's. Recall that a distribution with J degrees of freedom is the husband's education equals the effeet of the wife's education, define
defined as the sum of J independent, squared normal random variables.
Q=(OOOOl-100) and r = (O)
When rhe estimates are which is normalIy the case, the re-
formula is more but the general ideas are the same.
Substituting these matrices into Equation 4.3 and simplifying results in
the usual formula:
the vVald Test: Labor Force Participation
To iIIustrate the Wald test, consider the logit model:
1) + /3¡K5 + /3zK618 + /33AGE [4.5] Then W = 3.54 with 1 degree of freedom. There is 1 degree of freedom
since there is a single restriction, even though that restriction involves
/34WC + /3sHC + /3óLWG + /37 INC )
two parameters. We conclude:
U'ctld Test To test Ho: /3! = O, let • The hypothesís that the effects of the husband's and wife's educatíon are
equal ís marginally significant at the .05 level 3.54, 1, P = .(6).
Q 1 O () () O O O) and r = (O)
4.1.3. The Likelihood Ratio Test
Then W which is the square of the z-statistic for K5 in Thble
3.3. We describe the result as: The LR test can also be used to test constraints on a model. While in
its most general form these constraints can be eomplex and nonlinear, 1
• The effee! of young ehildren on the probability of enteríng the labor only consider constraints that involve eliminating one or more regressors
force is s¡grul1eant at the .01 level 55.14, 1, P < .01). from the model. For example, consider the logit models:
The is ofren used rather than W since the Wald statistic has MI: Pr(y = 11 x) = A(/3o + /31 Xl + /32 X2)
a dístribution.
M2 : Pr(y = 11 x) = A(/3o + /3¡x¡ + /32 x 2 + (33 X 3)
Wald Test That Two Are O. The hypothesis that the effects M 3: Pr(y = 11 x) A(/3o + (3¡x¡ + (32 x 2 + /34 X 4)
of the husband's and wife's education are simultaneously Ocan be written
M4 : Pr(y = 11 x) = A(/3o + /3¡x¡ + + /33 X 3 + /34 X 4)
as: = O. 1b test this hypothesis, let
Model MI is formcd from M2 by imposing the constraint (33 = O, and
Q
'00001000)
( 00000100 and r = (~) M is formed from M3 by imposing the constraint /34 = O. When one
m~del can be obtained from another model by imposing constraints, the
constrained model is said to be nested in the unconstrained model. Thus,
Then W 17.66 with 2 of freedom. We conclude: M¡ is nested in M2 and in M3 . However, M2 is not nested in M3 , nor is
fwnoth,'~í< that the effeets of the husband's and wife's education are
M3 nested in M2 . (Which models are nested in M4?) .
sIrrmll'an,em¡slv enmll to Ocan be at the .00level (X 2 := 17.66, df = 2, The LR test is defined as follows. The constrained model Me wlth
p parameters ~e is nested in the unconstrained model M u with parameters
~u. The nuIl hypothesis is that the constraints ímposed to create Me are
Q and r to test the hypothesis that al! of the coefficients except the true. Let L(Mu) be the value of the likelihood function evaluated at the
inf.,Yrrpnf are ML estimates for the uneonstrained model, and let L(Me) be the value
95
REGRESSION MODELS Hypothesis Testing and Goodness ot' Fít
" a To te~t l~o: ~l O, the model M[K5] TABLE 4.1 Comparing Results From the LR and Wald Tests
18 where the subscnpt mdlcates that K5 is excluded
fmm th.e unconstrained model. The LR chi-square and deviance for the LR Test Wald Test
constramed model are Hypothesis df (;2 W p
P
= 58.00 and D(M[K5]) = 971.75 /3, O 1 66.5 < 0.01 55.1 < fl.Ol
/34 /3, O 2 18.5 < 0.0] 17.7 0.01
AlI slopes (J 7 124.5 (101 95.0 0.01
2 2
G (M[/) G (M[K5J) = 66.48
= D(M(K-'i]) D(Mu) = 66.48 4.1.4. Comparing the LR and Wald Tests
We conclude:
Even though the LR and Wald tests are asymptotically equivalent, in
• The effec! 01' young children is significant at the .01 level (LRX2 = finite samples they give different answers, particularly for small samples.
66.5, In general, it is unclear whether one test is to he preferred to the other.
Rothenberg (1984) suggests that neither test is uniformly superior, while
. that .1 have used rather than in presenting the result. Hauck and Donner (1977) suggest that the Wald test is less powerflll
ThlS makes It that a likelihood ratio test i8 being reported. than the LR test. In praetice, the choice of which test 10 use is often
determined by convenience. While the LR test requires the estimation
To test the hypothesis that the effects of two models, the computation of the test only involves subtraction. The
and wife's education are simultaneously O, Ho: /34 = Wald test only requires estimatíon of a single model, but the computation
O, the model is estimated, resulting in
of the test involves matrix manipulations. Which test is more convenient
105.98 and depends on the software being used.
D(M¡wC.He¡) = 923.76
Table 4.1 compares the results of the LR and Wald tests for our ex-
The test statistic is
ample hased on a sample of 753. For al! hypotheses, the conclusions
= G 2(Mu) 2
G (M[wc'HC]) = 18.50 from both tests are the same. Note, however, that the values of the LR
statistics are larger than the corresponding Wald statistics.
= He¡) - D(Mu) = 18.50
We conclude: 4.1.5. Computational Issues
nvrJottJes:ls tha! the effects of the hushand's and wife's education are There are two important computational considerations that must be
lo O can be rejected at the .01 level (LRX2 = 18.5, taken into account when computing Wald and LR tests. If they are not,
you run the risk of drawing the wrong conclusions from your tests.
LR Test That All
Are O. G2(Mu) = G 2 (MaI M u) can be
used to test the Computing the LR Test
that none of the regressors affects the prob-
of
the labor Formally, Ho: /3 1 = /32 = /33 = /3 4 = The LR test requires using the same sample for all models being com-
pared. Since ML estimatíon excludes cases with missing data, it is com-
u "~~'''~,''0 that all coefficients except the intercept are O mon for the sample size to change when a variable has been excluded.
124.5, 7, p < .01). For example, if Xl has three missing observations that are not missing
for any other variables, the usable sample inereases by 3 when Xl is ex-
While a Wald test could be used to test this hypothesis, the LR test is
more used. cluded from the model. To ensure that the sample size do es not change,
you should construct a data set that excludes every observation that has
REGRESSION MODELS
Hypothesis Testing and Goodness of Fit 99
values for any of the variables used in any of the models b
tested . , e- While X 2 is sometimes reported as having a chi-square distribution, Mc-
: " nllssmg values can be imputed using methods
Ul:'i,-"U""C::U m LJttle and Rubín (1987), CuIlagh (1986) demonstrated that when the data are sparse when
there are continuous independent variables), X 2 has an asymptotic nor-
curtunW'!1p the Wald Test mal distribution with a mean and variance that are difficult to compute.
McCullagh and Nelder (1989, pp. 112-122) recommended that X 2 not
The matrix cOI?putations for the Wald test can accumulate appreciable be used as an absolute measure of fit. Hosmer and Lemeshow (1989, pp.
error lÍ yOl~ do not use the ful! precision of the estimated 140-145) propose an alternative test constructed by grouping data tbat
and covananee matrix. PracticaIly speaking this means that can be used with sparse data.
you should use a program in which the estimates can b~ stored and then While Var(Y¡-1T¡) 1Ti(1-1Ti), Var(y¡-11'¡) #- 11'¡{1-11'J Consequently,
the rounded values listed in the output can resuIt ' the variance of r¡ is not 1. To compute tbe variance of the estimated
values for the test statistic, m re.'liduals, we need what is known as the hat matrix, so named because it
transforms the observed y into y in the LRM. For the BRM, Pregibon
(1981) derived tbe hat matrix:
Residuals and Infiuence
When a m d 1 't' f
flt ~. h o e, I IS use ul to consider how weIl the model
1S
h Cd.C case and how much influence each case has on the estimates of where V is a diagonal matrix with )11'1(1 - 11'1) on tbe diagonal. Since
t e parameters me asure t1le d'ff . only the diagonal of H is needed, we can use the computationally simpler
. ' ¡ erence between the model's
tor a , case and the observed value for rhar case w¡'th formula:
. l' . that fit poorIy th ough t 01' as autliers. Influence is the 'effect
o <In on of tI d l'
01' fit The '. l.e mo e s parameters or measures
. . of reslduals and mfluence is well developed for the where Xi is a row vector with values of the independent variables for
and 1 assume that yOll have sorne 1'amiliarity with this material the ith observation and Var(P) is the estimated covariance of the ML
and . 1980, Chaprer 5, for good introductions). estimator Ji.
Using 1 - h¡¡ to estimate the variance of r¡, the standardized
Preglbon's (1981) extensions 01' these methods to Pearsan residual is
A large value of DFBETA ik indicates that the ith observation has a large
inftuence on the estimate 01' {3k'
__ J _ __________• __________ __ _
A second measure summarizes the e1'fect of removing the ith observa-
~
- --r---- .: ___ __ _
~~ ~
u
c:i
4.3. Scalar Measures of Fit for CLDVs, they often produee different values and thus provide differ-
ent mea sures of fit. 1
In addition to the fit of each observation, it is sometimes Let the structural model be y = xl3+t:, with K regressor§.t an interc~pt,
useful to have a number to summarize the overall goodness of fit and N observations. The expected value of y is y = xl3, where 13 .IS
of mode!. Such a measure might aid in comparing competing models the OLS estímator. The coefficient of determination can be defined In
in a final model. Within a substantive area of each of the following ways. Derivations of these formulas can be found
measures of tit ean provide a rough index of whether a model in Judge et al. (1985, pp. 29-31), Goldberger (1991, pp. 176-179), and
For if prior models of labor force participation Pindyck and Rubinfeld (1991, pp. 61, 76-78, 98-99).
have values of .4 for a given measure of fit, you would expect
tha! new with a different sample and perhaps with revised mea- Tite Percentage of Explained Variation. Let RSS be
variables would result in a similar value for that the sum of squared residual s, and let TSS 2.:;':1 be the total
or smaller values would suggest the need to sum of squares. Then R 2 is the percentage of TSS explained by the x's:
reassess the made in the new study.
While tlle desirability of a scalar measure of fit is clear, in practice R 2 = TSS - RSS = 1 RSS = 1 [4.7}
their use is First, 1 am unaware of convincing evidence that TSS
a model that maximizes the value of a given measure of fit
results in a model that is optimal in any sen se otller than the model The Ratio of Var(y) and Var(y) The ratio of the variances of Y and y
a value of that measure. While measures of fit provide is another definitíon:
some information. it is only partial informatíon that must be assessed
Var(Y) [4.8J
within the context of the motivating the analysis, past research, =-:-:-
and the estimated of the model being considered. Second, Var(y)
while in the LRM the coefficient of determination R 2 ls the standard
measure of there is no clear choice for models with eategorical out- A TransfOlmation of the Likelihood Ratio. If the errors are assumed to
comes. There have been numerous attempts to construct a counterpart be normal, then R 2 can be written as
to in the LRM, but no one measure is c1early superior and none has
L(Mc,)J2IN [4.91
the of a clear Ínterpretation in terms of explained variation.
[ L(Mp )
Orher measures have been constructed based on the ability of a model
to tbe observed outcome. FillalIy, the Bayesian measures AIC
where L( M a ) is the likelihood for the model with just the intercept, and
whieh are useful for comparillg nonnested models, are inereas-
L( M (3) is the likelihood for the model including the regressors.
while 1 approach scalar measures of fit with some
and proliferation makes a review useful.
A Transformation ofthe F-Test. The hypothesis Ho:
2
= .,. = f3K. = O
can be tested using an F-test, with the test statistic F. R can be wntten
in terms of F as
4.3.1. R 2 in the LRM
likdihood ror the full model; In La ís the log likelihood for the model
which Cramer (1991, p. 90) ca lis the "maximum probability rule." This
other measures. allows us to construct atable of observed and predicted values, such as
Table 4.3, which is sometimes called a classification table.
approaclles tlle fit 01' MI> [i.e., as L(Mj3) -+ L(McJ],
The eount R 2• A simple and seemingly appealing measure based on
. Maddala (1983, pp. 39-40) sllows tllat R 2
reaches a ma.,,(lmum 01' 1 L(Ma:)ZIN. Tllis led Cragg and Ull~~ the table of observed and expected counts is the proportion of correct
to suggest the normed measure: predictions, which Maddala (1992, p. 334) refers to as the count R2:
1 [L(M"JIL(Mj3)fIN
--~""7-- = 2
RCount = N1 'L.
"
1 L(M,,)2I N J
Síncc both and are defined in terms of the likelihood 1'unc- where the n¡/s are the number of correct predictions for outcome j,
can be applied to any model estimated by ML. which are located on the diagonal cells in Table 4.3.
's: Labor Force Participation The Adjusted eount R 2• The count R 2 can give the faulty impression
that the model is predicting very well, when, in faet, it is not. In a binary
'lb iIlustratc scalar measures of fit, consider two models. Model M model without knowledge about the independent variables, it is possible
I
has the of independent variables: K5, K618, AGE to correctly predict at least 50% of the cases by choosing the outcome
. and INC. Model M 2 adds a squared age termAGE2 ancl category with the largest percentage of observed cases. For example,
thc vanables He, and LWG. The resulting measures of fit 57% of our sample were in the paid labor forcc. If we predict that all
for rhe LPM and models are given in Table 4.2. Notice tllat for a women are working, we would be correct 57% of thc time. Accordingly,
. model many of the measures are identical for tlle LPM but not
for the " model ") h Id try to reproduce these measures
. • LC U S ou
' using
TABLE 4.3 Classification Table of Observed and Predicted Outcornes for a
the hkehhoods for the full and restrieted models. Hinary Response Model
2 Predicted Outcome
4.3.3. Pseudo-R ,s Using Observed Versus Predicted Values Obselved
Outcome .51=1 Y O Row Total
. Another
.' to goo d ness O f filt 1Il
. models wlth
. categor-
lcal.outcomes IS to .c~mpare the observed values to the predicted values. y=1 fl ll :: correct 11 12 :: incorrect n¡¡
y=o incorrect
thls lde~ for models witll two outcomes, it can be easily
1!21 :: /1" :: correet l12+
Whlle 1 .
to models wlth J ordinal or nominal outcomes. Column Total ni ¡ 11+2 N
REGRESSION MODELS Hvpothesis ami Goodl1css Fit 109
Predil'led Outcome
Ohst'lwd ()utcome
) 1)
Ro\" 554
) is the maximum mw y tlli 342
lhe oulcome with lhe most ob- Row 20.1 79.9
of correct guesses
by choosing lhe Co!unm '¡istal 266 41'\7 753
Row 35.3 64.7
Table 4.4 shows the values from the logit Akaike's (1973) information criterion is defined as
AGE, Wc, HC, LWG, and
indicate lhe of a given outcome 2P
AIC [4.121
tha! were to be cither 1's nr O's. They show Ihat Ihe model is
are predicted corrcctly) 'than O's
where 1(M fl) is the likelihood of the model and P is the number of
are lhe count R2 is
parameters in the model (e.g., K + 1 in the binary regression model
where K is the number 01' regressors). While Akaike (1973) formal1y
= ._---" .69 derives AIC thmugh the comparison 01' a given model to a set of inferior
alternative models, here 1 only provide a heuristic motivation for the
reasonableness of the formula.
which can be of the cases that were observed as 1's. L( M fl) indicates the likelihood of the data for the model, with larger
On Ihe other R 2 is Le
values indicating a better fit. - 21n Mfl) ranges from O to +00 wirh
smaller values indicating a better fit. As the number 01' parameters in
the model bccomes larger, - 2ln L( M fl) becomes smaller since m~re pa-
rameters make what is observed more likely. 2P is added to -21n L( Mfl)
as a penalty for increasing the number of parameters. Since the num-
lha! the models reduces lhe errors in prediction by ber of observations affects -2 In L( M [;), we divide by N to obtain the
REGRESSION MODELS
111
-2 In L(lH{l)' AH else being Bre fllt'aSIJ[(:
a oelter model.
used lO eompare models aeross dífferent samples or to
nonm~sted models that cannot be compared with the LR test. InN [4. 13 1
tht~ model with lhe smaller AlC is considered the
Since the saturated model O (¡~fhy nmst this be the
~he salurated model i5 when BI(' k O, When BrC 0, M
A k
IS wlrh the more the BlC\ the better the fiL
Crilerion A ~econd version nf BIC 15 based on the LR chi-square in Equation
information criterion has heen proposed by Raftery 4.6 wlth ro tlle number of regressors (not in the
model:
Ihe literature cited as a measure to assess the overall
mode! ami to allow the of both nested an? nonnes:ed
( ! 996), which derives the for- InN [4.14J
of models. Consider models If ¡he null model without anvJ regressor5, thel1 BIC:, i5 O. The
"",,'/pr71),. odds of M ¿ relatíve to ,"1 J equal
nul! model is when BIC~ 0, that M k includes too
MI lllany parameters or variables. When < 0, then M tS preferred
k
with the more the BIC k the the tit, Basically, ElC~ as-
sesses whether M k tits (he data well lo the number of
parameters tha! are llsed,
the observed data is greater than the Either BIC k nr BIC' can be used to compare models, whether or not
M 2 would be preferred. Under they are nested. (1996) shows that
)/Pr(M j ) of the two models
have no for one model over the other), 21n
theorem can be used lO show that the posterior odds equal BICI BIC 2 [4.15J
Thus, ¡he in the BICs from two models indicates which model
ís more likely lO have generated the observed data. Further, it can be
shown that
.~f, \1.
n
dI
p
Ale L2~3
link between mc and othcr measures of fit, consider
1 p. 19) for eomputing me' in
InN
formula for BIC' in the LRM can also
outliers and i~tluential observations apply only to models with binary
wirh CLDVs by R~ll from Equation outcomes. Wh¡le some of the scalar measures of goodness of fit are only
appropriate for models with outcomes, others apply with minor
adJustments to any model estimated with ML.
MI'Clsures: Labor Force Participation
AIC ane! BIC measures. the logit model l'vl) with the
variables: KÓ18,AGE, WC, HC,
4.5. Bibliographic Notes
which adds a age term AGE2 and drops
The tests presented in this chapter have a long history. R. A. Fisher
HC, ami UVG were estimated. Table 4.6 contains the
introduced the LR test in lhe 1920s, and A. Wald proposed the Wald
wíth the (hat are used to compute them.
the AIC and mc, it is important test in the 1940s. Further details on these tests can be found in most
econometrics texts. Godfrey (1988, pp. 8-20) and Cramer (1986, pp. 30-
listed statistics using the formula in
42) contain thorough discussions of the fOllndations of these tests. Buse
BIC, and BIC. modell'vf¡ is favored by (1982) provides an informative interpretation. Maddala (1992,
pp. 118-124) presents an discussion within the context of the
difference in BIC',
linear regression model. Regression ' for the binary response
--4.024.87 = -4.79 model were dcveloped by Pregibon (1981). Amemiya (1981) and Wind-
73.32 =-4.79 meijcr (1995) have revÍews of measures of lit. Hosmer and Lemeshow
1hble Ihe evidence favoring M 1 over M 2 is positive hut (1989, Chapter 5) provlde further detalls on diagnostícs and tests of fit.
The AIC was proposed Akaike (1973). The BIC has been advoeated
by Raftery in a series of papers summarized in Raftery (
developed from Schwarz (1 and (1961). See et
Conclusions (1985, pp. 870-875) 1'Ol" a discussion of these and related measures.
00
~<~--~------------+--------r-------~ y*
< 2 3 4 >y
curves that are The solid Une lhe latent variable . The are indi-
SectÍon The name continuous cated by lhe horizontal lines marked ,T2, and T,. The values oí' the
between an underlying eontinLlous ohserved variable r over lhe nmge 01' are rnarked helnw with a dotted
variable. 1 with this view of lhe line.
rnodeL As with lhe binary response modeL lhe structural modcl is
:
.·1
Panel A: oí Latent y* ij' tht' thresholds are aIl ahout the same apart. When !his is not
lhe case. lhe LRM can very misleading results,
Figure 5,1 also i!lustrales an importan! of lhe ORM,
In panel A, you could add another
ing lhe structural mndeL Imagine a horizontal Une hetween TI
am! T2' This would correspond lo another lo (he ordinal
scale, such as the "Neutral" between
The regression line for on ,1: would not be affeeted. In
new category would correspond lo a new horizontal row oí' ob ..
servations, which would affect the reslllts 01' lhe 01' v on x
$(8) = f~ ex p ( - 2) dt
For the ordered logit model, E: has a logistic distrihlltiol1 with a mean of
O and a variance of The pdf 15
x
A(E:)
Latent Variable to the Regression of
Pr(y, = 1 ¡ xJ = a
Pr(y¡ 21 x¡) (l>( T2 a
*
Pr(y¡ 31 x,) <1>( T.l a
4
Pr(Yi = 41 ) tl>( T.J
Predicted
Probability x 40 x = 80
Pr(y 1 I x) 0.68 0.20 (UlO
Pr(y 2 I x) 0.32 0.77 0.44
Pr(y 3 I x) 0.00 0.03 0.47
Figure 5.2. Ordered Model Pr(y 4 Ix) 0.00 0.00 0.1)9
GRESSION MODELS Ordinal Olllcomel'
5.3. Estimation
[5.4]
m is identical, Let P be the vector with parameters from lhe structural model with
the intercept f30 in the tirst row, and let T be the vector containí~g lhe
threshold parameters. Eith¡;r f30 or TI is constrained to O to identifv the
mi f3x) [5.5] mode\. From Equatíon 5.3. ~
[S.6J
np¡
\
[5.S]
programs should produce the same estimares 01' the parameters up to
f1ve signiJkant digits, but the standard errors ami lest statístics can differ
substantially, especially with small or wil h iIl-conditioned data.
5.3.2. Example of the ORM and the LRM: Altitudes Toward Working Mothers
pp. presents the gradient
estimation ami reviews Pratt's (l9S 1) In 1977 and 1989, the General Social asked lo
tha! Newton-Raphson wiU converge to a global evaluate the following statemcnt: "A working mother can establish just
asymptotically normal, as warm and secure a relationship with her children as mother who
does not work." Responses were coded in the variable 1'~~1RM as: 1
Strongly Disagree (SD); 2 Disagree (D); 3 = (A); and 4
Strongly Agrce (SA). With a sample of the marginal percentages
5.3.1. Software Issues
are 13, 32, 37, and J 8, respectively. The variables used in our
There are several issues relatcd to software that should be considered are described in Table 5.1. See Clogg and Shihadeh (1994, pp. 158-1
lhe ORM. for an alternatíve analysis 01' the same data.
lhblc 5.2 contaíns the estimates from four models. Column I contains
the Jlodef. The most important issue i5 knowing OLS estimates for the LRM:
your program uses. Programs such as LIMDEP
W4RM f3(J f31 YR89 f32j~fALE f3, ltHITE
assume that TI O and estima te whilc programs such as Markov,
SAS's LOGlSTlC, and Stata assume that O ami estimate TI' The + f3.y1GE f3sED f36PRST t;
¡í¡riahll'
A:·I SA rR89 f3
!()H9; ()
remede .HAU: 1I
¿
nonwhite
Agc in WlJITE
¿
Year, of "ducation
Occupati<mal A(;E:
F-:!)
PRSI
Ihe two parameterizatíons of the If the idea of a continuOllS, latent variable makes substantive sense,
identieal, while ¡he and thresholds dif- simple interpretations are possible lhe latent variahle to a
constraints on ¡he intereept and unit varíance and computing y' ··standardized and flllly standardizcd co~
in ¡he modeL efficients. When concern is with the observed of
whether a latent variable is methods from
¡hese cocfficients can be interpreted in be extended to the case of mllltiple outcomes:
and how can be used to compute the the observed olltcomes can be m rabies or plots, partial élnd
discrete in probabilities can be and lhe ordered
RESSION "10DELS Ordinal Olllcnmes 129
where V;¡r(x) i!;,. th(~ covariance matrix for Ihe x's computed fmm lhe
observed dala; Il contains ML am! I in the orden~d
probit model and Vare t:) in lhe ordered mode!.
The codfkients in TabIe 5.3 were computed fmm the coefticients
in l1lble 5.2 amI the statistics from TabIe 5.1. The variance
be as: of was estimated using Equation 5.9, 3.77 for the
ordered logit model and 1.16 for lhe ordered modeL Notice
• {3, units. all
that 3.25. which close to the ratío of lhe assumed variances:
Var( fe p)/ ) 3.29. The difference in the variance of for
from [he nbserved data. the the two models is retlectcd in the magnitudes 01' lhe unstandardized {3's.
undeaL As diseussed by McK- where the coefficienls from lhe ordered logit model are 1.6 lO 1.8 times
ami Winship and Mare (19H4. ¡arger than those for lhe ordered probit model. The fully standardized
fullv standa rdized eoefflcients and y* -standardizcd cocfficicnts are nearly identical across models sinec
¡he seale of y* has bcen elíminated by díviding by . (vVlly are no!
dcviation of lhe latent . then cxactly equal?)
If you are willing to consider the
ing an unclerlying measure of the
~íJria/J/e ¡3 13
YR89 0524 0.31 C)
.\1ALE -O T1' -(1.'711
WHlTl:, -O ..V!! -0.202 -0.210
AGE -0.022 -(UlI j -(l -IIO! J 191
En 0067 () 109 IUlY! 0.1
PRST o.nO!) 0.0113 0.045 O.IlOj 0.003 (1.O44
Ordinal Owcollles
RE R E S S ION M O D El .. S
TABLE 5.4 Prcdicted Pmbabilities 01' Outcomcs Within ¡he rhe
coefficicnts can he interprctcd as Ordcred 1 Modcl
follows:
than in 1977.
SD D /\ 'lA
The that m x is With a single independent the entire probabílity curve can be
plotted. When there are more variahles, the effect of a variable
can be examined while the remaining variables are held constant. For
example, the effect of age on the probahility of ordinal outcomes can be
can he used in a of ways to show the relation- plotted by holding all other variables constant and allowing age to vary.
variables and the dependent categories. To do this, let x. contain a 1 in the first column for the intercept, a 1
in the second column to specify the survey year 1989. a O in the thifd
the Mean and Predicted Prohahilíties column to seleet women, and the means for the variables fm age
in the remaining columns. Then
the mean, minimum, and maximum
Pr(WARM mi x*) = - x.,l)
Pr(y m I x) lOl
(Prove this equality.) In our would
be the probability of
80
of strongly or 21 and so on. These
probabílities can be plotted to uncover overall trends. The cumulative
Panel B: Curnulative probabilities from ntlr example are plotted in B of 5.3.
Nntice that the cumulatíve probabilíties "stack" the prob-
abilities from the top and show Ihe overaIl increase with age in
negative attitudes toward Ihe statement that a mother can es-
tablish just as warm and secure a relationship with her child as a mother
who does not work.
T,\HU: 5.6 at
I'Ílriahle
AUE
l:f)
PRST
xf3) xf3) 1
Oro lhe marginal can be evaluated al other valucs. For . Table 5.h
contains the partial in for women in 1989. Thesé
are computed ,HALE at O ami YR89 at 1 wltb lbe other variables
xf3) al theÍr means.
In general. ¡be marginal dfcet floes not indicate
probability that would be obselVcd for a unit
xf3) Tm xf3) if an independent variable varíes over a
rhat is nearly linear. the effcet ean be uscd lo
-- xf3) 1m xf3} 1 effeet oí" a unit ehange in rhe variable on the
For example, given the nearly linear
of the CUlVe relating
probability of disagreeing shown in panel A of
other variables constant. The 01'
the same as the of {3. ~ince • For womCI1 in 1989, caeh additional 10 years oí" íncrcases lhc
Indeed. it is possible for uf Iha! a mutht~r can estahlish warm ami seeure
changes. This is seen a rdationship with her child as a mother who docs not work .032.
(shown with
curve is positive, The value .032 is 10 times cffeet
effect of age. Around age 40. the marginal oí' age for outcomc D. Beware tbat this
age decreases the probabilíty of effect is only possible when the
depend on the purpose the change when the variable goes from it5 mínimum to íts maximum
chojces include: value. For example,
is found letting XI change from its mini- • For each additional year of education. the prohability nI' strongly
mum lo incrcases .fll, holding al! other variables constan! al theír means.
variable is obtained lettíng XI change from () lo 1 • For a standard deviatíon increase in age. Ihe of in-
The effect of computed by changíng from ro creases by .05. holding al! other variables constan! al their means.
can be computed by changmg
• Moving from the mínimum lo ¡he maximum lhe
predieted probability of .06, holding al! olher variables
in Xk is computed by changing constan! al lheír means.
is computed by changmg
The effects of a variable can be summarized hy the av-
erage of the absolute values of the across all nf the outcome
Tahle 5.7 contains measures oi discrete change for our example using categories. The absolute value is taken since the sum 01' the
the ordered model. For variahles, 1 present the change in without taking the absolute value is O. The average absolute
lhe when the variable changes from O to 1. For discrete chmzge equals
For variables that are not should examine the change in the
for a unit centered around the mean, the These values are listed in the column labeled 3: in the table. Clearly, the
deviation centered around the mean, and respondent's sex, education, and age have the strongest ef1'ects on atti-
R (, R E S S 1 O '1 \1 O DEI S Ordinal ()lIICmlli'S
139
where lhe model is referrcd to as • Fm an increase of In lhe odtls 01' an outeome less than or
to m are changed the faetor -8 ), holding al! other variahles
1990, p. 322: McCullagh & Nelder,
constan!.
model is uften interpreted in terms
cllmulalive The cumulative probabilíty If xk changes 1, lhe odds ratio equals
tha! lhe oulcnme ís less rhan or
1)
Pr(r m 1 x) I i x) for 1. J
----'-------- = exp( llJ
In )
Thc coefficient for age is J)2 with a standard deviation .1'4 16.8.
the results in lhe equation: Thus. J (lO! exp( ) i] 44. which can be as:
In =7 • For a standard deviation increase in age, lhe odds of SD versus D, A, and
SA are inen.:ased 44(0", holding al! other variables constan!.
Discussions Ihe onJcred model that do not use a latent variable lhe odds of SD and D versus A and SA are grcaler for every standard
10 lhe model often with this eguation. In such cases, the deviat¡on incn:ase in age: amI the odds of SD. D. and A versus SA are
model is referred to the cumulative logit modelo greater.
REGRESSION MODELS 141
Assumption
mlx) F(T m -xl3) 121 For cxample. Figure 5.4 plots the cumulative probability curves when
there are four ordered categories, rt:sulting in duee curves with inter-
is the cumulative distribution function F eval- eepts T¡ T2 f3o, and T3 f3o. To see why these curves are parallel,
The
xl3. Since 13 is ¡he same for aH Equation 5.12 defines a pick a value of the outcome probability. For example, the probability .5
response models with different intercepts. To see thlS, note is indicated a dotted, horizontal line. When we examine the slope of
the three probability curves at this point, we find that
xl3 (T",
[5.13]
Thus, \ve arc mmll:l:
2.XO
¡.(lX4
V IIl¡X) X,~ ) 14]
-U.6'1l
X.XH Tht: scort: tt:sl t:valuatt:s huw tht: Iikelíhood 01' lht: ORM would
14 -O}93 if lhe c\ll1straint in 5.13 was rt:moved. The
-2,24 -2..14
tt:st statistic is distributed as with K(I of free-
--().O19
- 4.94
dom. Ihat litis is ¡he correel Ilwnber count-
1).053 005H ¡!lg lhe tlumber conslrainls tested) For our the score
2.17 tt:sl t:quals 48.4 with 12 of fret:dom (p .OO!). This
0.006 strong t:videnct: that ¡he parallt:l is violated.
1.14
A Wald Test. The scorc test is an omnibus test tha! does not show
whether ¡he parallel assumption is violated for a11 indepen-
dent variables or only sume. A Wald test proposed by Brant (1990)
allows both an overall test that aH 13m'8 are equa] and tests of lhe equalily
01' cocffieients for individual variables. While this lest is not implemented
by commereial programs, it is albeit tedious, lo com-
consistent estimate of the 13 in Equation 5.12 (Clogg
pute with programs that indude a matrix SAS, LlMDEP,
pp. the similaritics 'lnd differ-
GAUSS). The lest is constructed as foIlows.
from prohits) and 13 from the
assessment of the parallel
l. l;'slimate 13m ami Var(Bn) Run J hinary on the outcomes
delined hy
In
Y m
wíth estimated ,10 pes Bm and covarÍance matrices vare Bm)' Then the esti-
mal\;d probability tha! 1 XI is
Ix)
REGRESSfON MODELS Ordinal O/ltcmnes 145
J44
score test (S The tests nf the equality of cocfficients ror each
~' (~'§ W.1 I
~~~
a¡j(j
variable examined il1dividually show, as Tahle 5.8, tha!
(
1)
,) there is strol1g evidenee for the violation of the
variables bUI nol for others.
My experienee Ihat Ihe parallel
for some
is fre-
J ¡) quently violated based on either an informal test, Ihe seme test, or ¡he
Wald test. When Ihe assumption of parallel is alter-
are lhe matrices fmm eaeh
native models shollld he considered thar do no! lhe constraínt of
"U""',""'" clements wcre dcflned in step 2.
parallel These models are cOl1sídered in lhe next
I!J I I'!lis cnrresponds
(!¡ ~~ ~~ :)
5.6. Related Models for Ordinal Data
D
While ordered logit and ordered prohit are the mosl frequel1tly used
models for ordinal outcomes in the social sciences (wilh lhe ex-
() () -1
ception of the misuse 01' the LRM), there are a numbcr of other models
) 1 matrix and () is a l ) + 1) matrix that are also available.
thal ¡lIi" matrix results in {he ¡mear combmatlOfl
The Wald slatistic takes the standard form
5.6.1. The Grouped Regression Model
lt i1' $100,000
Such variables are often analyzed by reeoding thcir values lo the mid-
point of each interval, with some rcasonahle value tlsed for the
ll.O! and lowest The problem is that there is only weak
13.01 0.01
< 0.01
tion for lhe reeoded va/ues. Alternatívely. Stlch variahles are sometimes
1.27 2. treated as lhough they are ordinal and the ORM is used Ander-
WHlTE
son, 1(84). However, sil1ce the clltpoints are known, they do not need
FD lO be estimated. Further, with known eutpoints it is possible to estimate
Var( whieh must be assumed in the ORM. Stewart (
REGRESSION MODELS
an cxtcnsion of the tobit model Chapter 7) and devel- variahles and Ihe
both two-stage and ML estimators. The ML estimator is available ¡ion difticult. Further. must be taken lo ensure ¡hat ¡he
LIMDEP , 1995. p. Stata (1 and SAS's LIFEREG variahle and that the
,,11t:l
outcOInes burh with nwdels fOl' ordinal data and with models for nominal
5.6.2. Otber Models fur Ordinal Data considered in
The model
Notes
, xJ3
Th(: ordered probit mndel grew out of AÍlchison and
the outcome is the of the odds of category m versus category work in where ¡he latent continuous variable ,vas an
Unlike rhe ORM. this mode! is a spedal ease of the multinomial lolerance to sorne exposure. such as a poison. The tok~rance could nol
model eonsidered in the next The continuatíon ralio model be observed. bu! ¡he state of the as
( p. 110): or dene! could he assesscd. Their model was limited
to independent variable. The of the ordered logit model
call be found in Snell (1964). McKelvey and Zavoina ( in a papel'
In
written for social scientists. extended the work of Aitchison and Silvey to
Ihe case where there are multiple independent variables and
is the of the odds of eategory m versus categories
a computationally efficient method of estímation. Independently, the or-
In this estimates will differ if adjacent categories
dered logit and probit models were developed by McCullagh (1980),
are proposed the Slereotvpe model:
whose were limited lO a single independent variable. His fo-
cus was 011 Ihe ordered logit model, which he referred to as the propor-
tional odds modeL McCullagh's work stimulated a great dea] of research
in biostatistics. al! of which scems to be unaware of the earlier work
where constraints are on the ,'s to ensure ordinality and the McKelvcy amI Zavoina.
J3's differ thus the parallel regression as- Several authors provide a review of modcls for ordinal variables.
related to the multinomial logit model Agresti (1990. pp. 318-336) and Clogg and Shihadeh (1994. Chapter
discuss models for ordinal variables with particular attention lo their
(1990. pp. 318·-336). ami Clogg relationship to models. McCullagh and Nelder (1989, Chap-
review these models with an empha- ter 5) discuss several of these models in the context of the
011 their to models. Greenwood ami Farewell linear modeL Winship and Mare (1984) reviewed models for ordinal
compare several of these modeIs in an analysis of medical data. variahles with in
Conc1usions
such rnodcls
mB the Thercfore, sorne of the comparisons are redundant: i1' you know lhe
reslllts for Ihe binary logÍl 01' A versus B, and lhe results frorn lhe
1] logit of B versus C. you ean derive the results for the logit 01' A versus C.
There is, however, one complication. The in 6.5
descrihe necessary relationships among ¡he parameters in the
of Ihe odds of A versus B. The f3
tinn. They will not hold with sample estimates from the three binary
A B lo indicate that
its. (Try this using ,/our ul1m data.) The reason is lhe ¡hree
can bíC
are hased nn differen! samp!es. The tlrst sample has N 4 ohserva-
tions, lhe second has observations, and the third has IV /1
in tlle same way. For out- observations. In the l1lultinomial logit model, all of lhe
and estimate the mated simultaneously, which enforees lhe Ingical
parameters and uses the data more cfticiently. Nonethelcss,
lhe l1lultinomial logit model as a linked sel of binary is
eorreee
Then select ¡he observations for the 6.2. The Multinomial Logit Model
eX [6.31 The formal presentation of the MNLM begins by specifying the prob-
ability of each outenme as a non linear function of lhe x's. After issues of
If we know how x afIeets the identification are resolved, r show tha! the non linear probability model
Are all tiuee leads to a model that is linear in the log of lhe odds; lhis is lhe form
of A versus and how affeets Ihe odds of B versus C, it seems
would tell us about how of the model that we have just considered. Two methods of interpreta-
reasonable that this tion are reeommended: discrete change in lhe probabilities and factor
affects Ihe odds of A versus e Indeed, there is a neeessary relationship
change in the odds. While lhese methods are basically the same as those
aI110ng rhe ¡hree used for the binary logit model, the number of probahilities and odds
involved requires graphical methods lO summarize the results. lb make
[6.4J
the discussion concrete, 1 use lhe example of occupational attainment
Occupati,m: "4 rnenial; While lhe now sum lo 1. ¡hey are since more
B I'llue collar; e eraft; Ihan one of parameters generales the same of the oh-
W whitc collar; P professíonal 1
nutcmnes. 'TI) see this i5 the case, we can rnultíply
LO Raec: 1 white; () nonwhites
Education: nurnbcr nf years of multiplying L the value
formal edllcatíon
Pnssible vears of work experienclC:
age mim;s of edllcation
minus 5
1, Ihe mndel is eommonly written as Sincc 131 O, rhe 1m tlw with outcome 1
to
In
is Ihe efket
for In
•
~t:-'¡LM as an Odds ~lodel
in terms nf the odds, as was done in This since the efreet of unít in
In versus outeome 17 x, indicated or on the level 01'
sínee ir is hard
nf the odds.
Alternative mcthods 01' are diseussed in Section 6.6.
This is the multinomial model Ihat was by Theil (
exp(xJ~m)
----- who derived the model in mueh the same way that 1 have
exp( x¡l3n) The model can be derived diserete choice which is now
considered.
to the odds equation: 6.2.3. The Multinomial Logit :vlodel as a Disrrete Choke Mudel
the
exceeds that of all other
where is the product over all cases for which y, is to m. Tak-
The form of lhe discrete choice model is determined by the ing logs, we obtain ¡he Iikelihood equation which can be maximízed
assumed distribution of 8 and ¡he of how /J-"" lhe aver- with numerical methods to estimate lhe In nY'.l['t",P
for relared measured variables. 'TI) obtain the tends to be very quiek. The resulting estimates are """'ele'""
MNLM, let the be a linear combination of the character- ically normal, and asymptotically efficient.
of an individual: shows Ihat under conditÍons thal are likely to
lihood funclíon is globally con cave, Ihe
:::: x¡Pm estimates.
McFadden (1 Iha! the MNLM results if and only if the t:'s 6,3,1, Software Issues
are 1 extreme-value distriblltion:
Different programs analyzing the same data uppear to
-8 exp( feren! results. This can be understood most
In 81 + 8
WHITE
The va riance for ¡he new es timate is 6.5.1. Testing That a Variahle Has No Ellec!
This section presents two tests that are very lIseful when lIsing the
MN LM. The tl rst is a test that the effect of a variable is n. The second 1f the null hypo thesis is true, then Wx is distributed as chi-square with
Ís a test of wheth e r a pa ir of outcome calegori es can be combined. Since J - 1 degrees of free dom.
ir is impo rta nt lO understan d how rhese tests can be implemented with
th e output from youf software , the tests a re presented in terms of the Example uf Wald ami LR "Tests. Table 6.3 contains the Wald and LR
contrasts wit h outcome r . tests for eaeh variable frol11 our example. The LR test [or th e variable
R (iR SSI N MODELS Nominal Olltclimes J63
cstimated:
10 that ror the ORM, The absolute value is taken before since rhe sum lhe
Not w(luId otherwise O.
Discrl'te AUainment
6.6..t Uiscrete Change
Table 6.4 eontains estimates 01' discrete from OUT model of
occupationaI attainment. consider lhe variable Wl!rn::.
Holding alI other variables at ¡heir means, white decreases ¡he
probability of a menial job .13 ¡¡nd increases lhe
a professional jon .1 ó. By lhe average absolute
1'or a standard deviation in edueation is 16, and is .1)3 for
perienee. The cffeet of education is 01'
where the
in education is (138.
the
While ir is to examine these a
discrete plot quickly summarizes the information, 1 shows
• the change in the probability along the horizontal with each variable
¡isted on the vertical axis. The letters to the
show the
come for a in the
variables held at theÍr means. (Remember that different results would
in he obtained if the variables were held at other It ís easy to see
variables, Most often, other
(hat the effects of a standard deviation in cducation are
variables held
with an ¡nerease of ovcr 35 for The effects
mind Iha! the amount and even
on lhe values a! whích ¡he
I!1tJle.pen1ctel1t variables are held constant.
Your choice of ¡he amount of in ¡he variable assessed
TABLE 6.4 Díscrete in ror Multinomial Model
of Occupations. Jobs Are Classified as: M Menial; Craft;
on the purpose 01' the B Rlue Collar; W White and P Professional
variables should be from O
vÍlriahle Change X M B e w p
WHITE o 0.12
ED ':;1 O,()6
variable from lts mínimum to lts .:;" 0.16
Sections 3.7.5 ami 5.4.4. ':;Range
EX!' ~J lUlO
~"""~"Tn the discrete
.:;" 11.03
':;Range
absolute diserefe Prohability at Mean
.la
REGRESSION MODELS Nominal Outcomes 169
P
lf x k is changed by O, then
0,25 0,50
Change in Predided Probability The effect of x k can be measured by rhe ratio of the odds before ami
after the change in x k:
Plot fOl' lhe Multinomial Logit Model of Oc-
Held al Their Means, Johs Are Classified nm Il(X' Xic + o)
B = Blue Collar; W White Collar; and 'I1(X,Xk)
of raee are also with average blaeks being Iess Iikely to en-
ter blue-collar, jobs, The expeeted changes
due to standard deviation in are much smaller and AH terms cancel except for exp(¡3k.ml o), which the odds ratio,
show that increases the probabilities of more highly skilled The odds ratio can be interpreted as:
the in the probability is a use fuI way to assess When o= 1, the unstandardized odds ratio can be interpreted as:
of effeets in the MNLM, it is limited in two ways. Fírs!,
• For a unit change in Xk' ¡he odds are to hy a factor of
indicates ¡he for a particular set of values of the
exp(.Bk, holding all other variahles constant.
variables, At diHerent levels of the variables, the changes
wi!l be different. Seeond, measures of diserete change do not indicate When o is the standard deviation of xb the x-standardized odds ratio can
Ihe among the outcomes, For example, a deerease be interpreted as:
in edlleation ¡nereases the of both blue-eollar and eraft jobs.
But how dOes it affeel lhe odds 01' a person choosing a craft job relative • For a standard deviation in x b ¡he odds are to change
blue-eollar To answer tllis of qllestion, we need to consider a factor of exp(.Bk,m ), holding all other variables constant.
the odds formlllation of lhe model.
Very important~y, the factor change in lhe odds for el in xk does not
1 Iha! the MNLM can be written as
depend Of1 lhe level of x k or O!7 the leve! any other variable.
While the interpretation of each odds ratio is the number of
exp(xl3m ) comparisons makes the task difficult. To appreciate the problem, con-
( RFSSION MODELS ,Vominal ()11f(OmeS
The column
of B relutive lo A are cut in has no drect on in-
creases the ndds factor A doublcs the odds.
The of lhe effeets of I ami dircc-
1/ The effeet of m
on the udds
as a (lile uníl inerease in
Tb plot these ,'"''''''0'''
) as lhe dístance
the dístanee. If increases the odds of A
over B. rhen A would be lO lhe of B and vcrsa.
ure 6.2 plots the coeffkienls fmm Table scale in
• menial occupa- the units oí' rhe The are relative LO category A
celucatíon anel A is located al () on ¡he botlOm scale indicate
tha! a does not lhe
otlsly). B is localed at unít increase in
XI decreases the logít of B versus /1 While ! he cocfticients are
10 blue-collar
plotted relative to outcome A. coulel relative lo
outcol11e B.
Since our ¡nterest is in lhe factor
scale is printeel al the top of the seale with
each value equal to the
For the B for
a uni! in XI
coefllcients are
X1
X2 el,1
X3
X4 8 NOTE
X2
X3
.00 -0.67
Co .. ffTcitlnts Seol"
re!ative 10 a menial When using an odds ratio plot, ir IS essential lO understand that lhe
x-standardized coefficients have been plotted fm edueation and expe- substantive meaning of a factor of a is depen-
r¡enee. the effect a standard deviation change in edueation is dent on the predicted probability or odds. See fm a
rhan the effeet índieated the greater spread in the detailed discussion of this point. For example, if the odds inerease a
jobs are not differentiated factor of 10 but the eurrent odds are 1 in 10,000, then the substantive im-
effeet of education, as wouId be pact is smaI!. Consequently, the odds ratio plot must be interpreted while
the educational of professional jobs. Educa- keeping in mind the base probabilities and the diserete in the
differentiates white-collar jobs from menial. eraft, probabilities. The plot can be modified to tbis information
meniaL and blue-collar jobs themselves are by making the height of the leHers in the odds ratio proportional
RECiR SSION MODELS Nominal OlllCOIrII'\
4. 2
Pr(y¡ = miz¡) = _.~~'T_~~0!!'l..._. [6. 14J
AGE S .1
Lj~ l exp( z,; 'Y )
whieh should be eompa red to the M N LM:
EO S
:3 Pr(Yi = m I Xi) = ~XP( Xi ~m ) whe re ~1 = O [6, 151
2- '4 L 1= 1 e xp(x, ~j )
-1. 50 -1 .00 -0. 50 0 .00 0. 50 1.00 1. 50 In E qu a tio n 6.15, the re are J - 1 para me ters {3 k m fo r caeh Xk ' but on ly a
Lo gi! Co effi c ia nts Seata single va lue o f X k fo r e aeh individu a l. In E qu atio n 6.1 4, th ere is a sin gle
'Ik for e aeh va ri able zb bUl ¡here a re J values of th e va riab le for eaeh
Figure 6.6. Enhanced Odds Ratio Pla t for the Multinomial Logit Model of individu al.
Altit udes Toward Working Mothers. Discrete Changes Were Computcd With An exa mple of how the da ta are eonstructe d for the CLM is useful
Al! Variab les Held at Their Means. Categories Are: 1 = Stro ngly D isagrce ; fo r unde rsta nding th e mo de l. Assume the re is a single inde pendent va ri -
2 = Disagree; 3 = Agree; and 4 = Strongly Agree ab le z a nd three outcom es . Fo r fo ur individu als, the d ata might loo k as
fo llows:
[n rhe M N LM , e ach exp lan atory vari able has a differe nt dfeet o n e aeh
o u tco rne . For example, the dfee t of X k on o utco me m is {3 km' while the
effect o n o utcome n is (3 kll ' The eo nditio nal logit mode l (CLM), so rne-
I
1
2
3
O
O
2 '1
z" = 7
z" = 3
=1
I
times refe rred to as the Luce mode l or (eo nfusingly) the rnultino mial 2 1 In = 5
lagi t mode l, is a closely related mode l in which the coefficients for a I 2
2
2
3
()
()
z:: 0= 1
va riable are lhe same for each ou teo me , bu t the values uf ¡he variables Z2J = 2
d iffe r [or e aeh o uteome . For exam ple, if we a re trying to expl a in a eom-
mU ler 's choice of transporta tío n amo ng th e o ptio rls o f train , bus, and
!, .l
3 2
I
()
ZJI
Zj:
= 3
== ()
! 3 3 () ziJ =1
p ri vate automob ile , we might conside r lh e amount of time or the eost 4 () z" = 3
pe r tri p fo r eac h optlon . T he effecl of tim e wo uld be th e samc fo r eaeh 4 2 O :42 = 2
4 .3 I Z '¡3 :::'::: 7
mode of travel, bu t the amount of time wo uld di ffc r by the mode o f
tra nsporta tian.
T he CLM was deve loped by MeFad de n a nd o th e rs, la rgely within the Fo r each individual, th e re are three observatio ns correspo nding to the
eontext of rese arch o n travel de mand o McFadde n (1 968) used the mo del three possible o utcomes. The diffe rences in the val ues of z fo r rhe d iffer-
to study er ite ria used by a state highway dep artme nt to se lect urban e nt o uteo mes dete rmine th e p robab ilities of vario us cho iees. To re fl ect
REGRESSION MODELS Nomlllal Olllcomes 181
1 have listed such lhar the largest value of z i5 To transform this into the CLM, we eonstruet z vectors with four ele-
assocíated with ¡he oulcome that is (hose n lhe individual. ments:
The models can he in terms of lhe odd5 form of the
(0000)
model. In ¡he Cl.M. the odds 10 lhe difference in the
values of lhc associated with ¡he two outcomes: (1 O O)
ZIII ]y) (O O 1 )
In the MNl.M. lhe odds lo the di1'ference in the coe1'- The first subscript for z is the observation nurnber; ¡he second is the out-
ticíents for the IVlO come (either 1, 2, or 3), and the third is the variable number 1 through 4.
Zil is a vector of 0'$ for all observations, which to the con-
Pm Pn ]) straint that PI = O. Within z2' the first elements ís 1, tlle second
Boskin's ( oí' the CLM lO oceupational attainment is Xi' and lhe last two elerncnts are always O. Within z" the first two ele-
useful lo OUT using the MNLM. In the ments are always 0, the third is always 1, and lhe last is x. To see how this
MNLM we examined how mee, and experienee affeeted the eonstruction of the z's leads to the MNLM, define 'Y y.
odds of different Eor a individual, the values of the Then
regressors were lhe same for al! outcomes. For example, a person's raee
did not with the choice of an oeeupatlon. In Boskin's CLM, the
Zil'Y=(O ,820)+(ÜX )+(üx +(0 )=0
variahles were the eosts and benefits for each oe- zi2'Y=(1 x ,820) + (X íl x )+(0 )
Eor for each person he eomputed the present value of
the wages in that oecupation times ,820 + ,821 Xíl
ICA,}\Ol.lIOU number 01' hours the person will work in the future) for ZiJ'Y = (O x ,820) + (O ,821) + (1 x +
The effeet of lhe value is the sarne for eaeh oe-
value itself differs by oeeupation. For a given = ,830 + ,8JJ Xil
value of a professional oceupation will exceed the Substituting ¡nto the equations for the CLM,
valuc 01' a menial thus making a professional oecu-
more , all else equa!. Pr(Yi = 1 1 z¡) = __,---:c...:......:..:.-=-:-_
The eonditional ami multinornial models retleet different aspects
nf the processes which individuals attain oecupations. 1 Sllspect that
at sorne the mast useful models for the analysis of nominal out-
comes will combine eharaeteristies of the rnultinomial and conditional
models. To how these models can be eornbined, we can take Pr(y¡ = 21 z¡) = --_:...:......:.::.:..:-
the of the CLM and the MNLM
p, lb illustrate this equivalenee, consider a MNLM
variable and three dependent categories. Then
1
Pr(Yi = 31 z¡) = -~--'--'-"--'--.
6.8.1. Testing HA
in the IS
n1e Nested Logit Mm/e/. The nested model divitks the choices ¡ntn
a hierarchy of levcls and thus avoids ¡he HA
Amemiya (1981, pp. Cramer (
note that H HA ean he (1995, Chapters 24 and Maddala (1 pp.
"'''''''n,~
semidefinite. examine an (1981, p. 238).
lO ensure tha! the difference is positive
is evídence that HA holds. Models Jor Ranked Data. Models for rankecl data are also similar 10
the MNLM. Rank data occurs when an individual ranks fmm
a set of choices. For example, a person indicat¡: lh¡: rank order 01'
preference for tbree eandidates rullning for o!l1ce. References include:
Allíson and Christakis ( el al. ( 1), ami flausman and
Ruud (1987),
6. Modt:els
]I
whose derivation
7 Limited Outcomes: The Tobit Model
In the linear regression model, the values of all variables are known
for the entire sample. This chapter considers the situation in which the
sample is limited by censoring or truncation. occurs when we
observe the independent variables for the entire sample. but for sorne
observations we have only Iimited information about the dependent vari-
able. For example, we míght know that the variable is less
than 100, but not know how much less. Tnmcatio!l limits the data more
severely by excluding observations based on characteristics of the de-
pendent variable. For example, in a truncated sample all cases where the
dependent variable is less than 100 would be deleted. While truncation
changes the sample, censoring does not.
The c1assic example 01' censoring is Tobin's study 01' household
expenditures. A consumer maximizes utility by purchasing durable
under the constraint that total expenditures do not exceed ineome. Ex-
penditures for durable goods must at least equal the cost of the least
expensive ítem. If a consumer has onJy $50 Jeft after other expenses and
the Jeast expensive item costs $ J00, the consumer can
on durable goods. The outcome is censored since we do not know how
much a household would have spent i1' a durable couJd be pur-
chased for less than $100. Many other examples of censored outcomes
187
REGRESSION MODELS l.inllfed ()w('o lt/es
ca n be fo und : hOll rs wo rked by wives (Ouester & Greene, 1982), 5ci- Panel B pto ls ¡he cc nsored va riable y \vi th censored cases stacked at O.
e ntific pub licatio ns (Stephan & Levin, 1'-)92), extra m ar it al affairs (Fair, T he bar conlains cases fmm lhe shaded region in panel A. Pa nel e plots
1978), fore ign trade and investme nt (Euton & Tamura, 1994), allster- [he trunca ted va riahle y I y > 1 ( i.e ., y givcn th at y > 1), which simply
ity pro tests in Third World cOlln tries (Walto n & Ragm, 1990), damag~ delctcs lIJe s hatlcd region fmm panel A.
caused by a hurricane ( Fronst in & Holtmann, 1994), and IRA c~ntn To sc e how ccnsnri ng and truncatian affect the LRM, consider the
bu tions ( LeCl e re , 1994). Amem iya (l985, p. 365) lists many addltlonal 1 model y' = 1.2 + .ORx + e, where alt of th e assumptio ns of the LRM
apply , including the normal ity of the errors . Panel A of Figure 7.2 shows
e xa m p les. .
H a usman and Wi se's ( 1977) analysis of the New Jersey Negatlve 1n- .....1 a sample of 200 with no censoring. The solid líne is the OLS estimate
come Tax Expe riment is an e arly a pplication of models fo r truncated
·1 ? = 1. 1R + .ORx. ff y' were censored belnw at 1, we would know x for al!
da ta. In this study, famílies with incomes more than 1.5 times the poverty obsc rva lio ns, hu I ohse rve y' on ly for y" > 1. In panel B. values of y' at
level were exduded from the sample. T hus, the sample itself is affected or befow 1 a re censo red wi lh y = O for censored cases. These are plotted
and is no lo nger re prese nta tive of th e population. . ' wit h tri angles. The lhree thick Iines are the rcsu lts of three approaches
M a nv models have been developed for censoring and truncatlo n. Thls to est imatio n .
chap te ; focuses on the most frequently used model fo r censoring, the On e way to estima te rhe parameters is wi th a n OLS regression of y
tobit m odel. Ser tion 7.6 briefly reviews related models for truncatlon, 011 X ror all nbservations, \virh the ce nsored data included as O's. The
mult iple censoring, and sample selection. res ulting estimate y = .95 + .11 x is the long dashed line in panel B. The
censo red observa tio ns on lhe Idt puB down that cnd of the line, resulting
in underestimates of the interce pt and overestimates of tbe slope. This
7.1. The Problem of Censoring a pproach to ce nso ring produces inconsistent estimates.
Sinc\:! induding censored observations causes problems, we might use
Le t y' be a dependent va riable that is nOl censo:ed. P~nel A of Fig- OLS to estima te the regression afte r truncating the sample lo exclude
ure 7 .1 shows the distribution of y' , where the he tght ot the curve m- cases with a ce nsored de pe ndent variable. This changes the proble m of
dica tes the relative frequency o f a give n value of y*. If we do not know censoring into t he problem of a tru ncated sample . After dele ting the
the value of y* when y' ::: 1, corresponding to the shaded regíon, then cases at y = 0, lhe OL') estim atey = 1.41 + .61x overestimates the in-
y' is a lalerlt variable tha t cannot be observed over its e ntire range. The terce pt and underestimales the slope, as shown by the short d ashed line.
~'ensored variable y is detlne d as The uncensored ohse rva tions at the left have pulled the line up, s ince
those obse rva tions with large nega tive e rro rs h ave b ee n de le ted. Trunca-
if yi > I tion ca uses a corre latinn hetween x and e which produces inco nsistent
if y' ::: 1 estim a tes.
A th ird approach is to estimate the tabú model, som e times refened to
Panel C: Truncoted as the censored regression m odel oT he tobit model uses all of th e info rma-
Panel A; Lot ent Pa nel B; Censored
tion , including informa tion about the censoring, and provides consistent
stim ates o f the parameters. ML estimares for the tobit mode l a re shown
>, hy t he snlid line , which is indistinguishable fro m the estimates in panel A
~
(J) E
o
wh e re the re is no censoring.
e Uncensored
Q) '"c: Uncensored
a u"
o 2 .:1 o 1 2 .:1 Example of Censot7ng and 71uncalion: Prestige of ¡he First Job
Y yly> 1
C ha pter 2 used as a n example the regression of the prestige of a sci-
Figure 7.1. Latent, Censored, and Truncated Variables en tist 's lirst acade mic jobo (See Table 2.1, p. 19, for él description of the
REGRESSION MODELS Umilc¡/ ()wcomes 191
FE,,!
-0.1':14
--2.1)5
PHD
0252
1.52
0.325
(URO 0.267
Censoring and Truncation
ART O.O()(i 0.034
ClT 0.005
R2 0.210 0.201
cate the sample by ¡he cel1sored cases. The OLS estimates from
x the tnmcated sample are in the column "OLS with a Truncated Sample."
I\ilodel Wílh ami Wíthout Censoring and Trunca- Finally, tobit estimates are listed in the column "Tobit "
The most important difference between the results of the tobit anal-
ysis ami the two OLS concems the effect of In the
tobit the effeet nf a woman is and negative.
of the job was unavailable for graduate the effect is substantíally smaller
and for departments without graduate pro- and not significant. In ¡he truncated the effect is positive, al-
lO 1.0 ami OLS was used to estímate though not significant. Thus, a key substantive result is dependent on the
from are reproduced in the column method of analysis. Other differences in relative and level of
in Table 7.1. Altematively, we eould trun- significance are also found.
REGRESSION MODELS UlI1íred OWcomes
193
ami Distrihutions When J.1. () and (r 1, lhe standard f/omw!
the simplitied notation: is written in
Before the tohit model, we need somc results
truncatecl amI censored normal distrihutions. These ,1IstnhutlOns
I J.1. 0, Ir
the fmmdatioll nf most models for truncation and . Re-
'-",'<n.n"" and truncation on the lefr, w!llch translates <f>(y' ) = I¡..t 0, ir = 1)
from helow in Ihe tonit model. formu-
and tmncalÍon 011 ¡he ami hoth on the le!"! ami Any normal distribution, regardless of lIs mean ¡..L and varíance
, can
are available. For more see Johnsol1 et al. (1 be written as a function oí' ¡he standard normal The
be written as can
or Maddala ( pp.
Pr( Y* S )=(f>( (T )
SO that,
in A and I ¡..t, (T) is the These results are often used to simplify the in this
For example, Equations 7.1 and 7.2 can be wríuen as
yly>T y
When values below Tare deleted, the variable y y > T has a trun-
Distribution Wi¡h Truncation and cated normal distribution. In terms of panel A oí' 7.3, we want
to consider lhe distribution of in the unshaded while
RECiRESSION MODELS
is crealed dividing
tn ¡he nf T. This
an arca (}f l'
"
¡he solid hne. The ""
lo ¡he
This is se en
the tnli1cated distrihution lo the dotted
the results
wri!e the truncated distribution as More 3
less
Alills ratio.
in this chapter that it de-
A and íts components 4>
IT,
if
if
T
\ ir y; T
~
X
!
') (T x¡a)
<1) . - - -
(r
lowship on lhe probability that the prestige of the first ís censored
are íIlustrated in Figure 7.6. The solíd line with open squares shows the
probability of censoring for women who weTe not fellows. Fcmale fel-
lows are less likely to have lhe of their first job censored
to have a first job with prestige below 1), as shown the solid line with
REGRESSION MODELS
201
woman is shown in lhe column "OLS with
Dala" in '[¡¡blt: 7. L
nf Ihe of for
1.
3 4 5
Ph.D. Prestige
variable is crcated in which al! censored observa-
Ccnsorcd Gendcr, Status, and the value T v. The model is cstimated with OLS the
doser and doser lo T. Given the differ- censored cases have not been eliminated, bUl only all llnrcalistícallv
it is clear why OLS produces ¡arge value, [1' we use OLS to estimate a usmg the entíre
when lhe is truncaled. sample after T r to censored observatlOlls, tbe estimares are
> i
[7.llJ p.12]
i
[Pr(Censored I XI)
) x iy] [7.14J
[7.15J
Taking logs,
In L( (3, [y, X)
.
'\'
L..t ln -1 <f¡
(T
(V
,1
¡xl xp
is
7..1. L Violations of i~~'HIIIIIWIIJ
The ML estimator is inconsistent whcn the errors ¡he arguments in Section 2.2.1 (p, 1 for the
& Schmídt, 1(82). Estimation of the tobit standardized and semi-standardized coefficients can be com-
rnatrix among Ihe x's and - effect 01' a unit incrcase in un value of the truncated
()utcmne. The partial dcrivative of T, x) with to Xk is
7.6, Extensions
j x. 1) ~ x, comprehensive,
if T
McDonald and Moffitt of This model can be obtained from the model with lower simply
in rhe censored outcome. The sim- by changing the sign of y. Censoring y from aboye at T is identical to
i5 to differentiate Equation 7.13 censoring y fmm below at --T. Since this simple has subtle
rule. After deal of effects on the signs in many formulas, 1 present rcsults here. The
probability of censoring is
T,X)
Pr( Censored Ix) (1)( 8)
I
The formulas fnr truncated amI censored outcomes are
T¡ of those discussed above. The value for Ihe tnmcated outcome
IS (Maddala, 1 pp. 160--1
variable is truncated
1()Wn 01' the time
bilis Ihat were never
were more than imlicated by
lo lhose who never voted If xk is diehotomous,
Fronstin and Holtmann (1994)
in a development that were clam-
ami Worden (1990) in their
will lile for hankruptcy,
I TU Y TL' X, = O)
With two Iimits. ¡he likelihoocl function incluclcs componcnts for upper
DI (TI XP)/!T For the observed outeome:
E(y I x) h, Pr(y TL x,)l ¡TU TU Ix í )]
[E(y I TL TU' x) Pr( TL <: TU I Xí)]
<1>( TI <1>( o¡) T (j )
10
Union
of lhe struclural equation for the tohit
The likelihood of eaeh ohservation is IIlnt:ss
lhe same for uncensored observations in the tobit modelo that
the likclihood must the area 01' the normal distribution Age
that has been trunca!t:d:
NOTE:
O1.S.
v on x for obscrvations
inconsistcnt estima tes since A has been
involves first estimating the
moclel in
7.7. Conclusions
touched on the rich set of moclels that cleal Variables that count the number 01' times that sornething has happened
with truncation. ancl selection. In a11 01' these moclels. are common in the social sciences. Hausman et aL ( examined the
is the same. Due to some data collection mechanism. ef1'eet of R&D expenditures on the number of patents receíved by U.S.
data are on some uf the observations in a systematic way. As a companies; Cameron and Trivedi (1986) analyzed factors affecting how
consequence. the LRM biasecl and inconsistent estimates. frequently a person visited the doctor; Grogger (1990) studied the deter-
rent effects of capital punishment on daily homicides; and King (198%)
examined the effeets of allianees on the number 01' natIons al war. Other
7.8. Bibliographic Notes count outeomes include derogatory reports in an individual's eredit his-
tory (Greene, 1994); consumption of (Mullahy, 1986); iIlnesses
While censored and tnmcated distributions have a long history in bio- eaused by pollution (Portney & Mullahy, 1986); party switehing by mern-
and within the social sciences structural bers of the House of Representatives (King, 1988); industrial injuries
and truncation oríginated with Tobin's (1958) ar- (Ruser, 1991); the emergenee of new companies (Hannan & Freeman,
tiele on household for durable goods. Indeed, this entire 1989, p. 230); and police arrests (Land, 1(92).
elass 01' models is sometimes referred to as tabit models, a term coined Count variables are often treated as though they are eontinuous and
to stand for "Tobin's probit." In the 1970s, a series the linear regression model is applied. The use 01' the LRM for count
Tobin's model appeared that stimulated a great outcomes can result in inefficient, inconsistent, and biased estimates.
and theoretical work. These inelude Grounau (1973), Fortunately, there are a variety 01' models that deal explicitly with char-
am! Hausman and Wise (1977). See Amemiya acteristics of count outcomes. The Poisson regression model is the m08t
fnr an extensive review of this !iterature, and Breen basic modeL With this model the probability of a count is determined by
a Poisson distribution, where the mean of the distribution is a 1'unction
217
R (fR SSI N MOl)ELS Cmml ()II(('O/l1CS 19
a
0 0 -'---_ ...
2 3 4
0 1 2 :3 4
II
5 6
I
7 8 9
Let variable lhe number nf times that an
Y y
event interval of time. ji has a Pnisson distrihu¡ion
with Figure 8.1. Poissnn Dislrihution
1'01' 1, 2, [8.IJ
ís called o\'crdispcrsiofl. The of many models for count data is
exp( an attempt to account for
and for y
3. As M ¡nereases. the pwbahility 01' 0\ deereases. For M .8, lhe
K 1 plots the
of a O is AS; for M 15, it is .22; for M 2.9, it i5 and for M 1O.5,lhe
and illustrates
probability is .00002. Eor many eount variables, there are more observed
(see Taylor &
()'s than predieted by lhe Poisson distribution.
4. As M inereases, lhe Poisson distrihution a normal distrihution.
rhis is shown in panel D where a normal distríbution with él mean and
varianee of 1n.5 has heen on the Poisson distribution.
since it i5 lhe number of The Poisson distribution can he derived from a simple stochastic pro-
time, M can also be lhought cess, known as a Poisson process, where the outcome is the number of
times that something has happened Taylor & Karlin, 1994, pp. 252-
258, for a formal derívation of the Poisson distrihution). A critical as-
sllmption of a Poisson process is that cvents are This means
that when an event occurs it does not affect the prohahility of the event
is known as In occurring in the future. For example, consider the publication 01' articles
than lhc mean, which by sCÍentísts. The assllmptíon of independence implics that when a sci-
Count OU!con!cs 221
RFGRESSION \10DELS
Panel A: tor x=o to 25 The distribution of counts around the conditional mean of y in panel
A of Figure K3 reflects the characteristics of the Poisson distribution
that were discussed using 8.1. Indecd, 1 construcled 8.3
so that the means al x egual to (l, 5, 10, and 20 correspond to the
means in the earlier You can see tha! as J.L íncreases: (1) lhe
conditional variance of v (2) the 01' O's
!1'l'r,>''''f'~' and (3) the distribution around the value becomes
approximately normal.
The figure also shows why the PRM can be of as él non-
linear regression mode! with errors lO e }' yI While
the conditional mean of B i5 0, the errors are heteroscedastic since
Var( 8[ E( Y I exp( xl3). however, that ¡f your data are Hm-
ited to a range of x where the relationship is linear. the
LRM is a reasonahle approximation to the PRM. This is shown in panel
B which expands that portion of the in A between x 15
Panel B: for x=15 to 20 and x 20. The relationship between J.L and x is
are approximately normal, and there is only slight
o
N
8.2.1. Estimation
L(131 y, X) nN
Pr(.Vi J.L¡) = n --'---'-:-'.:. :. . .:. .
N
1=1
f8.3]
where J.L = exp(xl3). After taking the log, numerícal maximization can
he used. The gradients and Hessian 01' the log likelihood are by
OL-______ ______ ______ ______ ______ Maddala (1983, p. Since the likelihood function is globally concave,
~
~
.
~
~
~
15 16 17 18 19 20 if a maximum is found it will he unigue.
x
8.2.2. Interpretation
.-igure 8,3, Distribution Counls for the Poisson Regression Model
The way in which you interpret a count model depends on whether
you are interested in the expected value of the count variable or in the
= .78. Using this value of J.L, the
distrihutíon of counts. lf interest is in the expected count. several meth-
these values. )
ods can be used to compute the change in the expectation for a
01 .46 Pr(y = 1 1 J.L) .36 in an independent variahle. If ¡nterest is in the distrihution of counts or
21 .14 Pr(y = 3 I J.L) Jl4 perhaps just the prohability of a specific count, the probability 01' a count
for a given level of the independent variables can be computed. Each of
Other can he simiJarly. these methods is now considered.
RES ON MODELS ('Olml Olll(,ol/les 225
[K4]
Tlle IX
cxp(
exp(x~
For specific values ni' 8:
v
o
the estimates from the LRM ean be of roughly
M
as the eoefficients from the PRM. This is n
more counts are frequent.
The the results of the PRM is by lIsing the
count. For example, the eoefficient foc ...o N
O
...o o
O
• llumher of artielcs hy a !.....
al! othcr variahles constant. o...
o
Or,
Mudel where 8, is defined lo equal exp(B,). Recall tha! Ihe LRM was nol iden-
tificd until an assumption was made about lhe mean of lhe error
Section 2.5.1). For simílar reasons, Ihe NBRM is not identifled without
an assumption about the mean of the error termo The most convenient
lhe estimates assllmption is that
el al., 1(84).
18.91
This assllmption implíes that the count after the new
source of variation is the same as it was fOl' lhe PRM:
)=
COJ1straining the
The distribution of observations x and 8 is still Poisson:
mean IS to compare rhe
x IS conditionally dístrihuted
wíth the Even though 101
ít allows rhe variance of the errors to
In the PRM. has a condi- However, since 8 is unknown we cannot compute Pr(y 1 x, 8) and instead
function 01' the x's need to compute the distrihution of y given only x. To compute Pr(y x)
cxtension nf lhe PRM adds without conditioning on 8, we average Pr(y I x, 8) the probability of
to exceed the con- each value of o. If ¡:; is the pdf for 8, then
model, hereafter
NBRM can be derivcd in several ways, 1 con- Pr(y¡ I x¡) r
}o
[Pr(y¡ 1 X" )1 111
sider ¡he most ,'(\r1<ty"m motivation 01' the model in terms of unobserved
To c1arify what this important equation is assume that has
mean of x is known: fl exp(xJ3). two values, d 1 and The counterpart to Equation 8.11 is
with Ihe rando!1l variable
Pr(y,lx¡} lPr(y,lx¡,o¡ ci¡) d¡ )} 121
[8.81 [Pr(y, 1 xl' 8, )]
REGRESSION MODELS
COW¡[ Outcomes 233
Since f.1 and l' are positive, ¡he conditional variance y in the NBRM
must excced the conditional mean exp(x~). (What mus! lo ¡, to
reduce IIw variana tu that the PRI\4'!)
ÓLO------~2~~3====4==--5---6 The conditional variance in y increases Ihe relative
2 :5 4
uf low and high nmnts. This is seen in H.6 when.: ¡he Poisson
Ó Ó
and NB distrihutions are compared for means 01' 1 and 10. The NB dis"
Figure K.5. FUllctioll ¡he Gamma Distrihution trihution eorreets a number of sourees nf pom tit tha! are often found
when the Poisson distribution is used. First. lhe varianee 01' the NB dis-
tribution exeeeds the varianee of the Poisson distributÍon for a
mean. Second, the increased varianee in ¡he NBRM reslIlts in substan-
tially larger probabilities for small counts. In A, Ihe probability 01'
of Y as a mixture uf two
a zero count inereases fmm .37 for the Poisson distribution to
and .H5 as the varianee of the NB dístribution inereases. Finally, there
lb solve 11, we must the [orm of the peif for o.
are slighlly larger probahilities for eOllnts in the NB distribution.
While several eiistributions have been the most eommon as-
While the mean strueture Is fully by 8.14, the vari-
is that o, has a gamma distrihution with parameter /)i:
anee is unidentified in Equation 8.15. The problem i8 that if ¡J varies
individual, then there are more than observations. The mos!
o, ) ) for /), O 13] eommon identifying assumption is that /) i5 tlle same for al! individuals:
for a ()
t" le -( dt. It ean
that 8, has a gamma This assumption simply states that Ihe varianee of 8 is eonstant. (We
1, as Equation and Vare 8 1 ) = set the varianee to ( t ' 1 rather than a lo simplify the formulas that 1'01-
/) also affeets the shape of the distribution, as shown low.) a is known as the dL5persion parameter sinee a inereases
the distribution beeomes inereasingly bell the conditional variance of y. ThÍs Ís seen by 0'-1 into
and eentered around . Equation 8.15:
hereafter NB, pmbahility distríbution is ob-
1¡ H.1O for Pr(y I x, S) and [H.161
Cameron & Trivedi, 1996, for details):
o
N
eO
....-...,
>-
o
o...
N eO
o l·.
l.
o o
00 O 5 10 15 20 25
2 3 4 5 6
X
Y
Panel B: NBRM with a=1.0
Panel B: O eO
N
1
r<)
o o
N I ¡
•
¡
! f
eO
•
N
o >- t•
o ,
r
~
¡.
o... ~
l-
o eO
o
o O 5 10 15 20 25
00 5 10 15 20 X
y Figure 8.7. Distribution of Counts for the Binomial Mollel
Figure 8.6. ofthe Binomial and Poisson Distributions
920. Alternatively. the NB distrihu- PRM are ineflicient \:vith standard errors
proeess known as colltagiorl. usíng an ap- thal are Trivedi. ¡ If software is
and Pólva in 191.1 COlltaginn oceurs available estimate Ihe NBRM. a nne-tailed z-test of HI): IY = n can
set of x's initially have lhe same probabil- he llsed when (t is zero ¡he NBRM
but Ihis prohability as events necur. reduces 10 lhe PRM. Or. a LR test can he If In is ¡he
that there are two scientísts \vho have identical likdihood fmm Ihe PRM ami In likelihood from the
have the same late 01' productivity ¡.L If the ~BRM. ¡!len 2(1n In ) is a test nf No: (f O. To test
article. her rate nf productivity Íncreases hy a al ¡he should he used sinee
result 01' contagion from the imtial publi- n must be Camernn and Trivedi severa) tests
she receive additíonal resources as a result based 011 Ihe [(:siduals from lhe PRM that do not estimation oí'
and these rcsourees may increase her rate of publicatíoll. lhe ~BRM.
sci.:ntist's rate would stay the same as long as he did not
The process is in the sense that success in publish- 8.3.4. Interpretation
the rate future publishing. Thus. contagion víolates the
of the Poisson distribution. Methods
ami contagion can generate the same NB
eounts. Consequently. heterogeneity is some-
or contagion, as opposed to
With eross-seetional it is impossible to determine [8.17]
¡he observed distribution of counts arose from true or spuriolls
NBRM modd can be estimated ML. The Iikelihood equation Estimates 01' the NBRM for published artides are in Table 8.2.
The can be interpreted in lhe same way as the from the PRM
v considered aboye. There is evidenee that there is
131 y, n SiOIl. The (J is positive wlth 8.45. which is
a LR test can he eomputed:
N V In L PRV!) which is even more highly signifi-
n J.LJ '
cant. Notice that ¡he z-values for the NBRM are smaller than those for
the PRM. which would be with overdispersion.
8.4 shows that lhe NBRM does a much hetter job than the
After the likelihood equation can be PRM in predicting the eounts from O to 3. Another way to see the dif-
with llumerical methods. Lawless (1987) provides gradients ferences between the two models is to compare their predictions for the
prohahility of not publishing as the level of other variables change. In
Figure lhe probability of having zero publieations is computed when
8.3.3. Tcsting for Overdispersion eaeh variable the mentor's number of articles is held at its mean.
For hoth modcls, lhe nf a O deereases as the mentor's arti-
lo test for overdispersion if you use the PRM. Even eles but the of predicted O's is significantly higher
01' the mean structure, estimates from lhe for the NBRM. Sinee hOlh mndels have the same number of
REGR SSION MODELS
Cowu ()¡i/COII/¡'s 239
for the zero truneated count cnnvergcs In the ex- Similarly. for the truncated binomial
without truncal ion Ihe 01 a zcro count
counts cxcluded, lhe vafiance is less than that
L(f3. a Iy. X) = n
J
Pr(y¡1
! x) exp(xf3)
01 =1 (1 (~!-ii)
i~l
O, X¡ ) n
,V
i=1
---,-_'!-':---'.-.:--'-'---:-: [8.2 41 f3. Gurmu and Trivedi (1992)
cated count model.
tests for in lhe trun-
REGRESSION l\10DELS
243
Mouels R.
Comparable formulas are available fOl' the other zero intlated moe!els.
[8.281
ror y, :> O
For ¡he ZIP mude!, oí' a zero count are basee!
(ZI model is created by replac- on Equation 828:
NBRM, with corresponding adjustments lo
}' () ¡x)
Greene (1994) shows that
Similarly, for the ZINB modeL ¡he predieted probability of a zero eount
is
standard PRM. Otherwise, lhe variance excceds
ZINB
o x) = ¡Ji (1
¡\Jodels: CO/mlS Third. the magnitudes and Ievels nI' lhe pa-
rameters for the proeess are different from the
from the count process. The level of
strongest cffect in tbe hinary process
from nonpuhlishers. None of the other variables is in lhe
portion nf lhe ZIP mode!, although heing married makes ir more
Ihat a scientist has ¡he lo publish in the ZINB molle!.
Figure 8.9 plots the difference between the observed proportíons for
each count and the mcan probabílity from lhe four models. We see im-
mediately that the major failure nf the PRM ís in the number
-2A8
of zeros. with an underprediction of about .1. The ZIP does much het-
0.098 ter at predicting zeros, but bas poor predictions for counts one
1.14 three. The NBRM predicts the zeros very well and aIso has much bet-
0,628 152 ter predictions for the counts from one to three. The ZINB slightly nver
11'<+81-0.116 predicts zeros and under predicts ones, with similar predictions ro the
1 -2,81
·0,001
NBRM for other counts. Overall, the NBRM model the m08t
.. 0001 accurate predictions, which are slightly better than those for the ZINB .
012 ·-0.02 We can a1so test differences between pairs 01' models. Section 8.3.3
Il.025 showed that the PRM and the NBRM can be by the
0.235 dispersion parameter a. Since the NBRM reduces to the PRM when
·-233 7.97
0.377
a O, the models are nested. For our example, we found that a was
746 significant, with a Wald test of Za 8.45 and a LR test of 180.2.
There is clear evidence supporting the NBRM over the PRM. The ZIP
O,iOO 0312 and ZINB mode1s are a1so nested, and we can test }(o: a O with
3099.98
either the z-statistic for a in the ZINB or a LR test:
2(ln L Z1NB In L Z1P ) 109.6. There is evidence that the ZINB improves
the tit over the ZIP mode!.
R RESS10~ MODEI S 2-19
9
nI ¡he Poissoll distrihu-
nmdeL
Conclusions
considered a varíetv uf pseudo
cstímatinn \vbich were illustrated with
('ameron and Trivedi (¡9X/)) presented
and tests 1m cnunt mockls. King
count mode! lo politieal seiel1ce ami
and binomial mod-
count modcls grcw out of work 011
in Johnson et al.. 1992. Chapter 4).
\vi¡h zeros modd to lieal Wilh lhe frequency
(¡mm\! (1991 and and Carson
the Pois50n model to cleal wilh Imn-
Lambert ( developed an extension
referred to as the zero intlated Poisson
and relatively nontechIlica! in- We have ennsídered many models in lhe last 250 pages. If yO\! are en-
rdated to lhe Poisson distribution. ¡hese modds for lhe tina il may be hard to track
information and also considers rdated of the differenees and similarities. StilL ¡here are many importan! mod-
els that have not heen sorne 01' which are very closely related
to the rnodels in Ihis book. In this brief ehapter. 1 try to ad-
dress both issues. First 1 summarize Ihe eonneetions arnong the models
from the prior three distinet but eomple-
mentary linear model, and
probabílity ll1odels,
eovered and wil! into Ihe rnodels. Second, 1 show
the eonneetions between the models we have studied and two importan!
classes of models. ;nodels and ll10dels for survival analysis.
This bríef discussion wí!l be most useful for those who are famil-
lar with and survíval models and who are inlerested in lheir
eonnections to lhe ll10dels in Ihis book, Additional links between the
ll10dels in Ihis book ami mndels for ordinal variables are found in C10gg
and Shihadeh (1994. Chapter Heinen's (1996) book on latent class
and discrete latent Irait models also i!lustrales many links between these
models and lhe rnodels \Ve have díseussed.
251
R RESSION \10 D F LS (
were on a structural
latcnt:
xll 1]
for al!
We can also compute the probability 01' a case
This leads to the linear r'"',np'C~I"n model: corresponds to afea B in 9.1. For a x. the nrtln:'n! ¡hat
is at or below T is
y xll
Pr(y' Tlx)= T xlllx)
OLS 01' ML can be used lo estímate ¡he . Since is observcd,
of ¡he unstandardized
for a unit change T/¡e Hinary Probit Model. The binary probit model can be thought 01'
in as a tobit model in which values both aboye ami below Tare censored.
The measurement model is
Tlle Ibbit ,"lode!. The tobit mode! is formed that when
is below some value we do not know its value, but only that y* 15 al or if T o
our measurement model is y
jf T O
if T
Therefore, is not observed for any observations, and we assume that
if T T O.
Interpretation of the probit model often
This rnodel is iIlustrated in l. Consider the dístribution of y* at that el I was observed (i.e., is above the
of ¡he distributioll labeled A. y is egual lo the latent was observed (Le., is at or below the ~~,~,·".r'~,n
is censored and all we know is tha! T. Since
for the probability of a n iIIl1strates the close
and probit and tobit models. From Chapter 3.
»r,·,,'p·">d in terms the units
Pr(y Olx) xlll x)
R GR S 10 MOD LS
ro
T xi31 x) r9 .41
identical to
in the tobit
to as "Tobín's
1,
coded as 1 and 1
the variance of can be
reftect the scale of
scen the T'S. If TU
modeL As TU goes lo 00, we
from below. As TL goes LO -00,
R 1 ; R E S S ro" :\1 O DEL S CUIl<111.\irIlIS
257
(1995, pp. revícw these models which are known mean and
covariance ~tructures wilh l10nmetric
¡
variables. Versions 1'or
ir lhe Ingit model are IlO! available ,ínee ¡here ís no to the
mullivariate Ilormal lbtrihution 01' 8 whích is used to allow correlations
across
¡'L
rhe ordered mude! is identical ro For example, in the LRM, is assumed lo he distrihuted conditionally
that lhe values of lhe thresholds normallv with mean 11. The 01' lhe GLM assumes that
unknown, we have no way to link our
consequenlly, lhere is no way
us to assume ¡he variance where 1/ is called lhe linear value f.L is linked lo
can interprel fuUy standardized and tbe linear prediclor through lhe link
¡he unslandardized
TI
of these models has a logit counterpart The 01' the random error and the link function defines
¡hat the ermrs are distrihuted as logistic Ibe model. For if ¡he link is ¡he identity function r¡ f.L and
¡he errors are normal, we have Ihe LRM:
TAnLE 9.1 DC~lh To see rhe link between log-linear models ami Poisson rp('rp',,¡rln
fine three dummy variables that 1 lo indicate Ihat observation
in level 2 of a given variable:
So :;
132
q
r: if D
i1' D:::: 2
r:
52
ifV
). i1' V 2
=l~
i1' P
the i1' P 2
In =A Thus, whenever you are in a cell where D i 2, rhen equals 1
For example, xf~ 1 = O, = 1, and O. Then
Notice that lhe means for all cells where D 1 indude the parameter DVP
al! cells where V 2 indude the parameter ; and so on. For 1n fJ..1]k f30 + + +
for cell 1, 1. 1),
specifies a Poisson regression model. Consider several cells of the tabIe,
corresponding to Equation 9.6,
DVP
1nfJ..Ill =
while for cell 2.
In f30 +
In DVP
1n fJ..222 = f3
() + f3 D +
To rhe constraÍnts are imposed on the parameters, Iden- The estimates from Ihis model are identicaJ lO those from a log-linear
tificatiol1 in Ihis model is similar to the situatiol1 with dummy variables model, where
in lhe LRM. For vou cannot have (me parameter for being a
female and anoth;r for being a male (e,g., As wlth dummy f30 A f3v =
variables in the LRM. we identify lhe model by assuming that lhe first These parameters can be interpreted in the same way as the
leve) of each group of is fixed al O: for the Poissún regression modeL
Interactions are added to the model to allow the in some
::;: O; () Af =0 combinations of cells to be more likely than would be if the
Wirh these variables were independent of one another. For
1 DVP \ ,D
n fJ.. iJk = 1\ + 1\ i + + [9.7J
In [9.6]
In Tú identify the model, we as sume that o for al!
i. j, and k equal to 1. Equation 9.7
In + + Af model with the interaction variables:
and so on.
REGRESSION MODELS Conclusiolls 263
Then where
In /30 = A~ /3v
where This is simply the logit model of Chapter 3, and it can interpreted
predicted probabilities and factor changes in the odds in the same way.
/3vp = While my discussion of log-linear models oversimplifies a number 01'
Notice tha! the variable has been the number of observa- important issues, the basic ideas should be clear. See Agresti (1990) for
tinns in each cel!. our substantive fncus lS likely to be on the effects a comprehensive discussíon of log-linear models, or (1996) for
of !he defendant's and the victim's mee on the sentence received. The an excellent introduction.
effects of race can be the difference between the log
two counts when P 2 and P 1:
DVP)
In In = In
( J.Lb~
J.Lijl
This i5 the of the odds, or logit, of not giving the death penalty
the faces 01' the defendant and victim. Taking the difference of
9.7 for a combination of D and V for two levels of P,
In In
In ~ In = [9.8]
1b show the link to the model, define sorne new dummy variables:
{~
if D
if D =2
In +
Appendix A. Answers 10 f);:ercises 265
J~
Chapter 2: Continuous Outcomes
Page 23:
E(ylx)
,-
<
..
"¡r""
;1>"""
I
I
.•..•..
,- " I
""
I
withín the texto I
""
I
I
" ""
1: lmroductiofl I "" "
""
I
I .1'
6: If I .1'
,.'" "
I
I
....... ". .-
..
,,;1>;1'
--
then 50 100
X
¡he ratio and denominators, Page 39: From Tables 3.1 and 3.2, Pr(y = 1) :: 1.144+ x 4) + (-.011 x
1.35)+(-.013 x +(.J64xO)+(.019xO)+(.123x 1.10)+ .007 x
exp(a' + ¡3*x + 8'd) 20.13) -.51 based on rounding to three decimal digits. Using the fuI!
precision sto red by lhe computer, the probability is -.48. Note that you
¡he should use full precision in making these computations.
Page 42: The point is located at (0.0,0.50).
in ) a*+ x+8'd. Page 45: At XI' y;- = a + ¡3x¡, al X2' Y2 in the expected
value of y* is Y2 - 51;' + - Xl)' At XI'
REGRESSION MODELS A. Answel:\' /O E>::erdw:s 267
LO
OLC)
..Do
O ,,
the ---lLC) ,
55: The so it would have a
N ,,
.. .,1.
Ca
to find the max- '--'"
imum, LO
DL O~__~__~____~__-L__~____~~~~~
o O 1 2 3
Number of Children
Therefore,
P(l P)
fmm lhe symmetry (Jf lhe prohahility curve Page 133: Consider m = 3. Then
Pr(y ::: 31 x) = Pr(y = 1I x) + 21 x) + = 31 x)
From Equation 5.6, Pr(y == m I x) xll) - - xll). Sub-
4: and Goodness Fít stituting and noting that F( 7'0 xll) O,
uf In L ís nut 1.110'¡¡15"1115'
Pr(y:::3 x) = F(7'¡xll) Fh xll) + - xll)
88: therehy increasing the ahso- - F(7'¡ - xll) + - xll) - F(7'2xlJ)
lute value of lhe second derivative.
= F(7'3 - xll)- xlJ) xlJ)
91. lf lhe vafiance i8 we are less confident in lhe estimate and,
Mepn""nt"" would wanl to it less weight in making a decision. Page 138: Combining Equations 5.2 and 5.10,
92: Let O be a 1 vector of O's; lel 1 be a 7 x 7 identity matrix. Then
let Q rOl 1and r o. Pr(y ::: m Ix) = -...::...;.....":'--..:.....:...-
are nested in M4'
y2 Summing just the cases where Then
Summing the cases where y = 1,
xlJ) 1
. Combining the two sums, I)Yí - 51)2 = Pr ( y > m 1)
x = 1 - 1 +exp( 7'm -
:---=-~--"-,,=
exp( 7' m - xll) 1+
= n¡ 2nf/N +niN/N 2 =
Dividing,
111. This follows since D( MI') °and O.
Pr(y:::mlx) = exp(7'm- xll)/[l+exp('Tm -xll)] =e ('Tm-xll)
Pr(y>mlx) l/[1+exp('Tm -xll)] xp
Page 143: Excluding the intercept, Ilm has K coefficients for each oí the l 1
5: Ordinal Outcomes
binary logits, for a total of (J - 1)K coefficients; IJ has K coefficients
121: For 15, excluding the intercept. Therefore, we are imposing (l - 1)K - K
K(J - 2) constraints.
I x <1>[.75 .052(15)] = 0.68 Page 144:
<1>[3.5 - .052(15)]
- .052(15)] = 0.00
(he In Pr( B I x)'s, have: 173: Notice tha( ¡he es Iined up. Thc n:lative lucatillll uf lhe other
is
ED A:::" ...... :
X3 "'8"
0.725
0607
-1.50 -1.00 -0.50 0.00 0.50 1.00 1.50
0.690
Lagit Coelficient. Seal.
O.6M
the binary
Chapter 7: Limited Outcomcs
Page 199: Since ¡he normal distribution is <1>( o) <1>(
due lo symmetry, 1>( -o) 1 1>(0).
m Ix,) Page 203: Jf 1>( o) = 1, then O and O, so Ix) xrt That
is, there is no censoring. Ir <1)( o) O, then all cases arc censored and
E(y I x) T"
Let ean be computed as
Page 207: Consider a with two values of x, with (he conditional
distributions A and B. The marginal dislribulion of y' would combine
A and B lo form a single marginal distribution indiealed the two
peaks el and b, The marginal distrihution has a mueh varianee .
O () O O OO
Q~(
O O () O (J 1 1
OO )
OOOO O OO
(J OO 1 1 OO OO
O () O O OO OO OO OO O O -1 1
10 12
y y
.46
1I = .36
21 ,14 Pr(y 31 ¡;.)
Berkson. J.
Berndt, E. R. Reading, MA:
Addison-Wesley.
John Wiley. Berndt, E. R" Hall, B. H., Hall. R.
New York: John Wiley. in non-lino:ar structural MeaSlIremeflt, 3, 653-
response hy maximum indican!.
Mllrkct and labor supply. Ewnometrica, 42, Lesaffre, E., & Alhert, A (1989). Multiple-group lngistic regression diagnostícs. Applied
Statistics. 3R, 425-440.
,tatísticlll models of lruncation, sample Líao, T. E (1994). lmerpreting prohabi/ity models: Logit, prohit and otller genera/ized linear
simple eslimator for such models. models. Thousand Oaks, CA: Sage.
475-492. Little, R. .l. A., & Rubín, 1), B. (19!l7). Statislica/ ana/ysi.> wilh míssing dala. New York:
John Wiley.
Long, J. S. (1983). Confirmatory factor analysís. Newbury Park, CA: Sage.
Long, 1. S. (1987), A graphical method for the interpretation of multinomiallogit analysis.
Sociological Methods and Research, 15, 420-446.
n:grcssion modds. In R. Gilchrist (Ed.), GLlM /1)1:12: Long, J. S. (1990), The origins of ,ex differences in science, Social Forces, 1297-1315.
O/l Gcnemhzed Linear ,Hodels (pp. 109- Long, J. S. (1993). ¡\1ARKOll.' A statistical environmentfor GAUSS. ~ersion 2. Maple Valley,
WA: Aptech Systems, lne.
Long, J. S., Allison, P. D., & MeGinnis, R. (1980). Entrance into lhe academic career.
American Review, 44, 816--830.
Applied regression. New York: John Wiley. Long, 1. S., & McGinnís, R. (1981). context ami scientific
/JfCI(",,,m'y (2nd ed.). Ncw York: Oxford University Press. American Review, 46, 422-442.
(1994). Continuolls univariate dístributions, Longford, N. T. (1995). Random coefficient models. In G. C.e & M. E,
Sobel (Eds.), Handbook of slatís/ieal mode/íng for the ami behavioral sciences
Univariate discrele distributiolls. New York: (pp. 519-578). New York: Plenum.
Luce, R, D. (1959). Individual choice behavior. New York: John Wiley.
& Lee, T.-C. (1985). The theory and practice of Maddala, G. S. (1983). Limited-dependent and quali/ative variables in econometrics. Cam-
Ncw John Wiley. bridge: Cambridge Press.
R. L. (1980). file s/místical ana(ysis of fai/ure time data. Maddala, O. S. (1992). Introduction lO econometrics (2nd ed.), New York: Macmillan.
John Wiley. Maddala, G. S., & Nelson, E D. (1975). Switching regression models with exogenous and
Comparing dfecls in diehotomous !ogistic regression: A variety of endogenous switching. Proceedings of lite American Statistical Association (Business
Social Stienee Quarterly, 77,90-109. and Economics Seetion), 423-426.
Statisticaj models for polítical science event counts: Bias in conventional Maddala, G. S., & Trost, R. 1'. (1982). On measuring discrimination in loan markets. Hous-
evidence for lhe exponential Poisson regression model. American ing Finance Review, 1, 245-268.
838--863. Magee, L (1990). R 2 Measures based on Wald and likelihood ratio joint significance tests.
metlwdology: Tite Iikelihood theory of statistical ínference. American Statístician, 44, 250-253.
Press. Manski, C. E (1995).ldentification problems in the social sciences. Cambridge, MA: Harvard
University Press.
Marcus, A, & Oreene, W. H. (1985), The determinants of ratíng assignment and performance,
Working Paper, No. CRC528. Alexandria, VA, Center ror Naval Analyses.
McCuUagh, p, (1980). Regression models for ordinal data (with discussion). Joumal of
Royal Statistical Society, 42, 109-142.
poísson regression with an application to defects in McCullagh, p, (1986). The conditioIlal distríbution of statistics fm discrete
1-14. data. Jouma/ of ¡he American Statistieal Association, 81, 104-107.
en;lflo,rne,rnc Clllalysis of (ransttion data. New York: Cambridge McCullagh, P., & Nelder, J. A (1989). Generalized linear mode/s (2nd ed.). New York:
Chapman and Hall.
criminal careers: So me suggestions for moving beyond the McDonald, J. E, & Moffitt, R, A (1980). The uses of tobit analysis. Review of Economies
curren! Criminology, 149-155. and Statístics, 62, 318-321.
J. M., Pregibol1, 1)" & Shocmaker, A C. (1984). Graphical methods for as- McFadden, D. (1968). Tlle revealed preferences of a government bureaLlcracy. Working PapeL
«"0'''''':<'''''' model" Joumal (he American S/atis/ieal Associa/ion, 79, Berkeley, CA: University oí California, Department of EcoIlomics.
McFadden, D. (1973). Conditional logit analysis oC qualitative choice behavior. In
binomial and mixed POiSSOll rcgression, Canadian Journal 1'. Zaremhka (Ed.), Frontier:5 of econometrics (pp. 105-142). New York: Academic
Press.
'111e dccompm,ition cocffkients in censored n:gressíon models: McFaddcn, D. (1981). Econometric models of probabilistic choice. In C. E Manski &
indepcndent variahles on taxpayer behavior. Na/ional D. McFadden (Eds.), ,)'tmctural analysís of díscrete data (pp. 198-272). Cambridge,
MA: MIT Prcss.
R GRESSION MODELS 281
for lhe
TX: Stata
(3rd
non-nested hypotheses.
REGRESSION MODELS
amlllc,21it;'!1 10 labor
';'\ 'l1/lfor "lrc ll:.;th uf Ihe !JOok i, the ¡l'fll/ 1/1111 il is urXll/lIzed. 1'17,' cha ¡ller a[lOut ('l/eh
Ictl1l1if/uc h wriftCí! m illl/:~h lil Ofg lllll Zed ami I'a m 11 el fOlll1ll 1. Tirsl lIle , tallsl lcal /1a,si:;
awi I¡';"I/1/1ptioIlS for tire par/iC1lll/r mudel i5 dl'I'eloped, 111m C:ilillllll lt¡;¡ ¡",;¡¡es are
( lJ lTsic/cred, t/¡nr iS.':' lIc:> uf tes /ing l./Il d Ill laprdaiitm (lre t'f)/LSidcrcd, t,l¡C I1 ¡'ariatimls and
"X/('1/,siuI¡'; {lre i'xplurcd, '.
-·-R()b~rt L. KJufman, The Oh;o Stal. UlIil'er5ilv