Binary Dependent Variables

• 9.1 Homogeneous models

– Logit, probit models
– Inference
– Tax preparers
• 9.2 Random effects models
• 9.3 Fixed effects models
• 9.4 Marginal models and GEE
• Appendix 9A - Likelihood calculations
9.1 Homogeneous models
• The response of interest, yit, now may be only a 0 or a 1, a binary
dependent variable.
– Typically indicates whether the ith subject possesses an
attribute at time t.
• Suppose that the probability that the response equals 1 is denoted
by Prob(yit = 1) = pit.
– Then, we may interpret the mean response to be the probability
that the response equals 1 , that is, E yit = 0 Prob(yit =
0) + 1 Prob(yit = 1) = pit .
– Further, straightforward calculations show that the variance is
related to the mean through the expression Var yit = pit (1 - pit ) .
Inadequacy of linear models
• Homogeneous means that we will not incorporate subject-specific
terms that account for heterogeneity.
• Linear models of the form yit = xit  + it are inadequate because:
– The expected response is a probability and thus must vary
between 0 and 1 although the linear combination, xit , may
vary between negative and positive infinity.
– Linear models assume homoscedasticity (constant variance) yet
the variance of the response depends on the mean which varies
over observations.
– The response must be either a 0 or 1 although the distribution of
the error term is typically regarded as continuous.
Using nonlinear functions of
explanatory variables
• In lieu of linear, or additive, functions, we express the probability of
the response being 1 as a nonlinear function of explanatory variables
pit =  (xit ).
• Two special cases are:

– the logit case 1 e
π( z )  z
 z
–  (z ) as a cumulative standard
1  e normal
e  1distribution function, the
probit case.
• These two functions are similar. I focus on the logit case because it
permits closed-form expressions unlike the cumulative normal
distribution function.
Threshold interpretation
• Suppose that there exists an underlying linear model,
yit* = xit  + it*.
– The response is interpreted to be the “propensity” to possess a
– We do not observe the propensity but we do observe when the
propensity crosses a threshold, say 0.
– We observe
0 yit*  0
yit   *
1 y it  0
• Using the logit distribution function,
Prob (it*  a) = 1/ (1 + exp(-a) )
• Note that Prob(-it*  xit ) = Prob(it*  xit ). Thus,

Prob( yit  1)  Prob( y  0)  Prob(  xit β) 
it   (xit β)
1  exp(xit β)
Random utility interpretation
• In economics applications, we think of an individual choosing
among c categories.
– Preferences among categories are indexed by an
unobserved utility function.
– We model utility as a function of an underlying value plus
random noise, that is, Uitj = uit(Vitj + itj), j = 0,1.
– If Uit1 > Uit0 , then denote this choice as yit = 1.
– Assuming that uit is a strictly increasing function, we have
Prob( y it  1)  Prob(U it 0  U it1 )

 Probu it (Vit 0   it 0 )  u it (Vit1   it1 )   Prob it 0   it1  Vit1  Vit 0 

• Parameterize the problem by taking Vit0 = 0 and Vit1 = xit β.

• We may take the difference in the errors, it0 - it1 , to be normal
or logistic, corresponding to the probit and logit cases.
Logistic regression
• This is another phrase used to describe the logit case.
• Using p = (z), the inverse of  can be calculated as z = -1(p)
= ln ( p/(1-p) ) .
– Define logit (p) = ln ( p/(1-p) ) to be the logit function.
– Here, p/(1-p) is known as the odds ratio. It has a convenient
economic interpretation in terms of fair games.
• That is, suppose that p = 0.25. Then, the odds ratio is 0.333.
• The odds against winning are 0.333 to 1, or 1 to 3. If we bet $1, then in a
fair game we should win $3.
• The logistic regression models the linear combination of explanatory
variables as the logarithm of the odds ratio,
xit  = ln ( pit/(1-pit) ) .
Parameter interpretation
• To interpret  =( 1, 2, …, K), we begin by assuming that jth
explanatory variable, xitj, is either 0 or 1.
• Then, with the notation, we may interpret

 j  xit1  1  xitK  β  xit1  0  xitK  β

 Prob( yit  1 | xitj  1)   Prob( yit  1 | xitj  0) 
 ln    ln 
• Thus,  1  Prob( yit  1 | xitj  1)   1  Prob( yit  1 | xitj  0) 
   

e 

Prob( yit  1 | xitj  1) / 1  Prob( yit  1 | xitj  1) 
• Prob(
To illustrate, if y =  1 | x then
j it itj / 1 Prob(
 0)exp( ) = 2.y
j it  1 | xitj  0) 
– The odds (for y = 1) are twice as great for xj = 1 as for xj = 0.
More parameter interpretation
• Similarly, assuming that jth explanatory variable is
continuous, we have
d d  Prob( yit  1 | xitj ) 
j  xit β  ln 
dxitj dxitj  1  Prob( yit  1 | xitj ) 

 
Prob( yit  1 | xitj ) / 1  Prob( yit  1 | xitj ) 

Prob( yit  1 | xitj ) / 1  Prob( yit  1 | xitj ) 
• Thus, we may interpret j as the proportional change in the
odds ratio, known as an elasticity in economics.
Parameter estimation
• The customary estimation method is maximum likelihood.
• The log likelihood of a single observation is
ln(1  π( xit β)) if yit  0
 
yit ln π( x it β)  (1  yit ) ln(1  π( x it β))  
• The log likelihood of the data set is ln π(xit β) if yit  1

 y it
it ln π( xit β)  (1  yit ) ln(1  π( xit β))
• Taking partial derivatives with respect to b yields the score equations
π(xit β)
x it  yit  π(xit β) 
π(xit β)1  π(xit β) 

– The solution of these equations, say bMLE, yields the maximum

likelihood estimate.
• The score equations can also be expressed as a generalized estimating
equation: 
it it
 y  E y it 
E y it Var y it 1

• where

E yit  π(xit β) E yit  x it π(xit β)
Var yit  π( xit β )1  π( xit β) 
For the logit function
• The normal equations are:
 x y
it it   ( xit β)   0

– The solution depends on the responses yit only through the vector of
statistics it xit yit .
• The solution of these equations, say bMLE, yields the
maximum likelihood estimate bMLE .
• This method can be extended to provide standard errors for
the estimates.
9.2 Random effects models
• We accommodate heterogeneity by incorporating subject-specific
variables of the form:
pit =  (i + xit  ).
– We assume that the intercepts are realizations of random variables
from a common distribution.
• We estimate the parameters of the {i} distribution and the K slope
parameters .
• By using the random effects specification, we dramatically reduced the
number of parameters to be estimated compared to the Section 9.3 fixed
effects set-up.
– This is similar to the linear model case.
• This model is computationally difficult to evaluate.
Commonly used distributions
• We assume that subject-specific effects are independent and come from a
common distribution.
– It is customary to assume that the subject-specific effects are normally
• We assume, conditional on subject-specific effects, that the responses are
independent. Thus, there is no serial correlation.
• There are two commonly used specifications of the conditional
distributions in the random effects panel data model.
– 1. A logistic model for the conditional distribution of a response. That is,

– 2. A normal model for the conditional distribution of a response. That is,

Prob( yit  1 |  i )  π( i  xit β) 
1  exp ( i  xit β) 
– where  is the standard normal distribution function.

Prob( yit  1 |  i )   ( i  xit β)

• Let Prob(yit = 1| i) =(i + xit ) denote the conditional
probability for both the logistic and normal models.
• Conditional on i, the likelihood for the it th observation is:
yit (1 yit ) 1  π( i  xit β) if yit  0
π( i  xit β) (1  π( i  xit β)) 
π( i  xit β) if yit  1
• Conditional on i, the likelihood for the ith subject is:

 π i  xit β  it 1  π i  xit β  it

y 1 y

t 1

• Thus, the (unconditional) likelihood for the ith subject is:


 πa  xit β  1  πa  xit β 1 y

li  it
φ(a )da
t 1

– Here,  is the standard normal density function.

• Hence, the total log-likelihood is i ln li .
– Note: lots of evaluations of a numerical integral….
Comparing logit to probit specification
• There are no important advantages or disadvantages when
choosing the conditional probability  to be:
– logit function (logit model)
– standard normal (probit model)
• The likelihood involves roughly the same amount of work to
evaluate and maximize, although the logit function is slightly
easier to evaluate than the standard normal distribution function.
• The probit model is slightly easier to interpret because
unconditional probabilities can be expressed in terms of the
standard normal distribution function.
• That is,

 x β 
Prob( yit  1)  E Φ( i  xit β)  Φ it 
 2 
 1    
9.3 Fixed effects models
• As with homogeneous models, we express the probability of the
response being 1 as a nonlinear function of linear combinations of
explanatory variables.
• To accommodate heterogeneity, we incorporate subject-specific
variables of the form:
pit =  (i + xit ).
– Here, the subject-specific effects account only for the intercepts and
do not include other variables.
– We assume that {i} are fixed effects in this section.
• In this chapter, we assume that responses are serially uncorrelated.
• Important point: Panel data with dummy variables provide inconsistent
parameter estimates….
Maximum likelihood estimation
• Unlike random effect models, maximum likelihood estimators are inconsistent in
fixed effects models.
– The log likelihood of the data set is

 y
– This log likelihood can
ln  (  x β)  (1  y ) ln(1   (  x β))
still be i it
maximized to yield it
maximum i
likelihood it
– However, as the subject size n tends to infinity, the number of parameters also tends to
• Intuitively, our ability to estimate  is corrupted by our inability to estimate
consistently the subject-specific effects {i } .
– In the linear case, we had that the maximum likelihood estimates are equivalent to the least
squares estimates.
• The least squares estimates of  were consistent.
• The least squares procedure “swept out” intercept estimators when producing
estimates of  .
Maximum likelihood estimation is
• Example 9.2 (Chamberlain, 1978, Hsiao 1986).
– Let Ti = 2, K=1 and xi1 = 0 and xi2=1.
– Take derivatives of the likelihood function to get the
score functions – these are in display (9.8).
– From (9.8), the score functions are
L e i e i  
 yi1  yi 2  i
 i  
– and  i 1 e 1 e
L  e i   
   yi 2  i   
β i  1 e 
– Appendix 9A.1
• Maximize this to get bmle
• Show that the probability limit of bmle is 2 , and hence is an
inconsistent estimator of .
Conditional maximum likelihood
• This estimation technique provides consistent estimates of the
beta coefficients.
– It is due to Chamberlain (1980) in the context of fixed
effects panel data models.
• Let’s consider the logit specification of , so that
pit  π( i  xit β) 
1  exp ( i  xit β) 
• Big idea: With this specification, it turns out that t yit is a
sufficient statistic for i.
– Thus, if we condition on t yit , then the distribution of the
responses will not depend on i.
Example of the sufficiency
• To illustrate how to separate the intercept from the slope
effects, consider the case Ti = 2.
– Suppose that the sum, t yit = yi1+yi2, equals either 0 or 2.
• If sum equals 0, then Prob (yi1 = 0, yi2 = 0 |yi1 + yi2 = sum) = 1.
• If sum equals 2, then Prob (yi1 = 1, yi2 = 1 |yi1 + yi2 = sum) = 1.
• Both conditional probabilities do not depend on i .
• Both conditional events are certain and will contribute nothing
to a conditional likelihood.
– If sum equals 1,
Prob  yi1  yi 2  1  Prob  yi1  0 Prob  yi 2  1  Prob  yi1  1Prob  yi 2  0 

exp i  xi1β   exp i  xi 2β 

1  exp i  xi1β 1  exp i  xi 2β 
Example of the sufficiency
• Thus,
Prob yi1  0 Prob yi 2  1
 
Prob yi1  0, yi 2  1 | yi1  yi 2  1 
Prob yi1  yi 2  1

exp i  xi 2β  expxi 2 β 

 
exp i  xi1β   exp i  xi 2β  expxi1β   expxi 2β 

• This does not depend on i .

– Note that if an explanatory variable xij is time-constant (xij2
xij1 ), then the corresponding parameter j disappears
from the conditional likelihood.
Conditional likelihood estimation
• Let Si be the random variable representing t yit and let sumi be
the realization of t yit .
• The conditional likelihood of the data set is
 piy1i1 piy2i 2  piTyiT 
  
i 1  Prob( S i  sumi ) 
– Note that the ratio equals one when sumi equal 0 or Ti.
– The distribution of Si is messy and is difficult to compute
for moderate size data sets with T more than 10.
• This provides a fix for the problem of “infinitely many
nuisance parameters.”
– Computationally difficult, hard to extend to more complex
models, hard to explain to consumers
9.4 Marginal models and GEE
• Marginal models, also know as “population-averaged” models,
only require specification of the first two moments
– Means, variances and covariances
– Not a true probability model
– Ideal for moment estimation (GEE, GMM)
• Begin in the context of the random effects binary dependent
variable model
– The mean is E yit =  it   it (β,  )   πa  x it β  d F (a)

– The variance is Var yit = it (1- it ).

– The covariance is Cov (yir, yis)
 πa  x ir β  πa  x is β  d F (a)   ir  is

GEE – generalized estimating equations
• This is a method of moments procedure
– Essentially the same as generalized method of moments
– One matches theoretical moments to sample moments, with
appropriate weighting.
• Idea – find the values of the parameters that satisfy
0 K   G  (b EE , EE )Vi (b EE , EE )  (y i  μ i (b EE , EE ))

i 1

– We have already specified the variance matrix.

– We also use a K x Ti matrix of derivatives

 μ i (β, )   μi1 μiTi 
G  (β, )       
 β   β β 
– For binary variables, we have

it  xit  πa  xit β d F (a)
Marginal Model
• Choose the mean function to be  it  Φx it β 
– Motivated by probit specification
 x β 
Prob( yit  1)  E Φ( i  xit β)  Φ it 
 2 
 1    

• For the variance function, consider Var yit =  it (1- it).
• Let Corr(yir, yis) denote the correlation between yir and yis.
– This is known as a working correlation.
• Use the exchangeable correlation structure specified as
1 for r  s
Corr ( y ir , y is )  
  for r  s

• Here, the motivation is that the latent variable i is common to

all observations within a subject, thus inducing a common
• The parameters τ = (, ) constitute the variance components.
Robust Standard Errors
• Model-based standard errors are taken from the square root of
the diagonal elements of
 n 
 G  (b EE ,  EE )Vi (b EE ,  EE )  G  (b EE ,  EE )  
 i 1

• As an alternative, robust or empirical standards errors are

1 1
 n
  n
 n

 G  Vi1G    G  Vi1 y i  μ i y i  μ i  Vi1G   G  Vi1G  
 
    
 i 1   i 1  i 1 

• These are robust to misspecified heterscedasticity as well as

time series correlation.

