Ch. 5. - Models of Discrete Choice

Chapter 5
Models of Discrete Choice
5.1 Introduction
So far in our analysis, the dependent variable has been a continuous numerical variable.
In this chapter we turn to our attention to models where that is not the case, and we
actually have discrete dependent variables. These variables reflect decisions like whether
to work or not, the choice of commuting option to get to work, whether to buy a house
or rent one, whether people drink Pepsi or Coca-Cola, and the like.
In general, for these models, the following requirements about the different alterna-
tives must be met:
1. the alternatives must be mutually exclusive, i.e. selecting one necessarily exempts
the selection of any other alternative;
2. the alternatives must be collectively exhaustive, i.e. all possible alternatives are
included so the decision maker necessarily chooses one alternative; and
3. the number of alternatives must be finite, i.e. there is a limited amount of choices.
There are two ways to motivate these models: random utility and latent variable.
1
2 CHAPTER 5. MODELS OF DISCRETE CHOICE
5.1.1 Random Utility
A decision maker, identified by i = 1, . . . , n, faces a choice over a limited number of

alternatives, identified by j = 1, . . . , J. The utility that decision maker gets from alter-
native j is Uij and it has two parts: one observable by the researcher Vij and one not
observable by the researcher and assumed to be random, εij , so
Uij = Vij + εij . (5.1)
The researcher does not observe the utilities of the different alternatives, but rather which
alternative it is chosen. Since we assume that the decision maker is a utility maximizer,
she would choose alternative j if and only if the utility that alternative provides her with
is greater than the utility of any of the other alternatives, that is if
Uij > Uik ∀k 6= j;

Vij + εij > Vik + εik ∀k 6= j;
εik − εij < Vij − Vik ∀k 6= j.
Since we don’t observe the utilities across decision makers, we can no longer estimate
what the expected utilities would be, i.e. the Vbij . What we observe is whether the
individual selected an alternative over the other ones. This means that we can estimate
the probability that an alternative would be selected.1 The uncertainty in the model
comes from the unobserved utility parts, the εij . Let ε0i = (εi1 . . . εiJ ) be the vector
of unobservable utilities, so we let the joint density of εi = f (εi ). Also let I (·) represent
an indicator function that takes a value of 1 if the expression within it is true, and zero
otherwise, then if Pij represents the probability that individual i will choose alternative
j,
Pij = Prob (Uij > Uik , ∀k 6= j)

= Prob (εik − εij < Vij − Vik , ∀k 6= j)
Z
= I (εik − εij < Vij − Vik , ∀k 6= j) f (εi ) dεi , (5.2)
ε
1
Notice that if we identifity the alternative chosen by a 1, and the rest by a zero, the average of the
dependent variable is itself a probability, because it is the sample proportion that chose that alternative,
so in fact estimating the probability of an alternative is not that different from estimating the expected
value of a variable. We will make this even more clearly shortly.
5.2. IDENTIFICATION OF CHOICE MODELS 3
where equation (5.2) recognizes that Pij is the expected value of the mentioned indicator
function over the density of the unobserved (uncertain) factors. Notice that the indicator
function will return 1 when alternative j is selected, and 0 otherwise. If we, thus setup
a variable y such that yij = 1 when alternative j is selected, and zero otherwise, then
modeling equation (5.2) is equivalent to modeling Pij = Prob (yij = 1), since yij = 1 if
and only if εik − εij < Vij − Vik , ∀k 6= j.
5.1.2 Latent Variable
Like in the case of utility in the random utility models, there are many cases where
we don’t observe the variable we want, but rather whether the variable has overcome a
certain threshold. For example, we don’t observe an individual’s wage rate but rather
whether he is employed or not, or we don’t observe the individual’s actual income but
rather whether he belongs to a certain income group. Let us consider the example of the
reservation wage for simplicity.
Let yi∗ represent the actual variable we would like to have but do not observe, i.e.
the latent variable (in our case the wage rate). If we were able to model the expected
value of the latent variable, we would be able to do so by setting up a model like
yi∗ = Vi + εi . (5.3)
Instead we observe yij which reflects whether the individual is employed, and thus the
wage rate is above a certain level w̄, or he does not, and the reservation wage is below
w̄. We really cannot measure E [yi∗ ], but if we let yij = 1 if the latent variable falls in
the j th category, e.g. if the wage rate is large enough so that we observe the individual
employed, and 0 otherwise, we can model, once more, Pij = Prob (yij = 1).
5.2 Identification of Choice Models
Having seen how these discrete choice models can arise from either a random utility
or a latent variable framework, from now on I will use the random utility motivation
to illustrate different characteristics of these models. The presentation in this section
follows Train (2009, pp. 19–29) very closely.
I would like to specify Vij to be linear in parameters with an alternative specific

constant αj , for simplicity of illustration, and since it is the most common specification
in any empirical analysis, so let Vij = αj + x0ij β such that
Uij = αj + x0ij β + εij . (5.4)
The identification topics that we are going to discuss can be summarized in two easy
statements: “only differences in utility matter,” and “the overall scale of utility is arbi-
trary.”
5.2.1 Only Differences in Utility Matter
The absolute level of utility is irrelevant to both he decision maker’s behavior and
the researcher’s model. The decision maker will choose the same alternative when
utility is measured by Uij ∀j or when it is measured by Uij∗ = Uij + c ∀j, for any
constant c. From the researcher’s perspective it doesn’t matter either. Notice that
Pij = Prob (Uij > Uik ∀k 6= j) = Prob (Uij − Uik > 0 ∀k 6= j), which depends only on
the difference in utilities. Consider what this means when using the typical linear in
parameters specification in equation (5.4):
Pij = Prob (Uij − Uik > 0 ∀k 6= j)

= Prob αj + x0ij β + εij − αk − x0ik β − εik > 0 ∀k 6= j

= Prob (αj − αk ) + (xij − xik )0 β + (εij − εik ) > 0 ∀k 6= j .

(5.5)
Alternative-Specific Constants
The alternative-specific constant αj , captures the average effect on utility of alternative

j of all the factors that are not included in the model, very much like a constant does
on a regression model. When alternative-specific constants are included, the unobserved
portion of utility, εij , hence has zero mean by construction.2 It is thus reasonable to
include an alternative-specific constant in Vij . However, since only differences in utility
matter, only differences in the alternative-specific constants are relevant, not their abso-
lute values. This is clearly seen in equation (5.5), where we see that what matters about
2
If alternative-specific constants were not included, E [εij ] = αj .
the constants is αj − αk ∀k 6= j, i.e. only the difference in the constants matters. This
means that any model with the same difference in constants is equivalent.
In terms of estimation, this means that it is impossible to estimate all j constants,

since an infinite number of values for the j constants would provide the same differences,
and thus result in the same choice probabilities. To account for this fact, the researcher
must normalize the absolute levels of the constants. The standard procedure is to nor-
malize one of these constants to zero, and the reason is twofold: first normalizing one
constant to zero means dropping a parameter to be estimated, i.e. we only have to esti-
mate J − 1 constants; and second, our estimates of the other J − 1 constants are easily
interpreted as how much higher or lower the average effects of the non-included factors
are for the different alternatives relative to the alternative for which the constant was
dropped.
Individual-Specific Variables
When dealing with these models we have two types of variables: those that reflect
attributes of the alternatives (alternative-specific), and vary across alternatives and most
likely also across individuals, and those that reflect the attributes of the individuals
(individual-specific), and vary only across individuals and not across alternatives. For
example, consider an individual making the decision of whether to take the bus or the
car to work. The cost of the trip to work depends on the alternative chosen, bus fare
and gas, so it will vary across the alternatives. The cost will also vary across individuals
because the cost of taking the car depends on the distance of the commute, and different
individuals will have different distances. The cost of the commute is thus an alternative-
specific variable. Now consider the age of the commuter. The age is the same independent
of any alternative that the commuter may take, and thus it only varies across individuals:
age is an individual-specific variable. Consider then that the only two alternatives are
car (C) and bus (B), and that the only two variables explaining the utility of each
alternative are the total cost (Tij ) and the individual’s age (Ai ).3 The utilities thus are
UiC = αC0 + TiC β1 + Ai β2C

0
+ εiC ,
0 0
UiB = αB + TiB β1 + Ai β2B + εiB .
3
Notice that T has an ij subscript, whereas A has only an i subscript. This is to reflect that T varies
across individuals and across alternatives, since i identifies the individual and j the alternative, while
A varies only across individuals.
Notice that the alternative-specific variable has the same slope in both equations, i.e.
T is always multiplied by β1 , whereas the individual-specific variable has a different slope
0 0
in each equation, β2C in the car equation and β2B in the bus equation. This is because
T varies across alternatives, and thus the variation is reflected within the variable itself,
whereas A doesn’t vary across alternatives, and the variation is thus allowed through
a different slope. Remember that to model the probability of choosing an alternative,
for example car, we only care about the difference in utilities. So taking the difference
between both utilities,
0
− αC0 + (TiB − TiC ) β1 + Ai β2B 0 0

UiB − UiC = αB − β2C + (εiB − εiC ) .
We have just seen how the alternative-specific constants had to be normalized because
0
any model that produces the same difference as αB − αC0 will be equivalent. Therefore
we would set one constant, say αC0 = 0 and estimate the other as the difference from
the one dropped. With individual-specific variables we have the same problem with the
coefficients. Notice that since the different effect of age on the utility for the individual
comes via the different slope, when we take the difference between the two utilities for
0 0
the individual we have the variable multiplied by the difference in the slopes (β2B − β2C ).
Again, any model that produces the same difference is equivalent, so we have to normalize
the slope coefficients on individual-specific variables as well, so we would set one of them,
0
say β2C , equal to zero, and estimate the other to be the differential effect of age on the
utility of bus compared to car. Notice that this doesn’t happen for alternative-specific
variables, since the difference in the utilities come from the variation within the variable,
not the variation on the coefficients. The specification of the utilities we would estimate
thus would be
UiC = TiC β1 + εiC ,

UiB = αB + TiB β1 + Ai β2B + εiB .
0
where αB = αB − αC0 , and β2B = β2B
0 0
− β2C .
Number of Independent Error Terms
Remember equation (5.2)

Z
Pij = I (εik − εij < Vij − Vik , ∀k 6= j) f (εi ) dεi .
ε
This probability is a J-dimensional integral over the density of the J error terms in εi .
However, since only differences in utility matters, with J errors there are only J − 1
error differences. Let ε̃ikj = εik − εij ∀k 6= j, then we have a J − 1 vector of error
differences ε̃0ij = (ε̃i1j . . . ε̃iJj ), where obviously ε̃ijj = εij − εij = 0 is the error
difference dropped. Pij is equivalent to
Z
Pij = I (ε̃ikj < Vij − Vik , ∀k 6= j) g (ε̃ij ) dε̃ij ,
where g (·) is the density of these error differences. For any f (εi ), the corresponding
g (ε̃ij ) can be derived. However, g (ε̃ij ) is consistent with an infinite number of f (εi )’s.
Since choice probabilities can always be expressed as depending only on g (ε̃ij ), one
dimension of the density of f (εi ) is not identified and must be normalized by the re-
searcher.
Different models handle this normalization in different ways. Logit, for example, has
a restrictive distribution of the error terms that automatically normalizes the needed di-
mension of the error terms. Probit, for example, handles this normalization by specifying
the model in terms of error differences, that is by parameterizing the model in terms of
g (·) without reference to f (·).
5.2.2 The Overall Scale of Utility is Arbitrary
In the same manner that adding a constant doesn’t affect the decision maker’s choice,
neither does multiplying each alternative’s utility by the same (positive) constant. The
decision maker will prefer alternative j to alternative k independent of whether the
utilities are measured by Uij = Vij + εij and Uik = Vik + εik , or when they are measured
by Uij0 = λVij + λεij and Uik
0
= λVik + λεik , where λ > 0 is a constant. The choice of scale
for the utility functions doesn’t make any difference to the researcher either, because
Pij = Prob (Uij > Uik ∀k 6= j) = Prob (λUij > λUik ∀k 6= j) = Prob Uij0 > Uik 0

∀k 6= j .
The standard way to normalize the scale of utility is to normalize the variance of
the error terms. The scale of utility and the variance of the error terms are linked by
definition. When utility is multiplied by λ, each εij is multiplied by λ, so Var [λεij ] =
λ2 Var [εij ], i.e. the variance of the error terms increases by λ2 . Normalizing the variance
of the error terms thus normalizes the scale of utility.
Normalizing with iid Errors
When errors are iid, the researcher normalizes the error variance to some number, which
is usually chosen for convenience. Since all errors have the same variance by assumption
(iid), normalizing the variance of any of them sets the variance for them all.
When the observed portion of utility is linear in parameters, the normalization pro-
vides a way of interpreting coefficients. Consider the model
Uij0 = αj + x0ij β + ε0ij ,

where Var ε0ij = σ 2 , and suppose the researcher normalizes the scale by setting the
error variance to 1. The original model becomes the following equivalent specification
Uij1 = (αj /σ) + x0ij (β/σ) + ε1ij ,

where Var ε1ij = 1.
So what if the researcher actually scales the variance of the error terms to be a
constant λ 6= 1. The model would then become
h √ i h √ i
Uij2 = (αj /σ) λ + x0ij (β/σ) λ + ε2ij ,

and Var ε2ij = λ.
The standard Logit model normalizes the variance of the errors to π 2 /6, whereas the
standard Probit model normalizes the variance to 1. The actual errors, obviously have a
variance of σ 2 , so assuming that both estimators
√ would estimate the same values for αj
and β, the Logit models results would be π/ 6 (approximately 1.28) times larger than
the Probit results. Notice that the results from the estimations do not provide estimates
of αj or β but rather estimates of the scaled parameters because of the necessary nor-
malization of the variance of the error terms. Even when we know the scale used in each
model and we can use it to de-scale it, we can only get estimates of αj /σ and of β/σ
because we don’t know the true variance of the error term, and we cannot estimate it
because we can’t identify it.
Normalizing with Heteroskedastic Errors
When we have that the variance of the errors is different for different segments of the
population, i.e. errors are heteroskedastic, we cannot simply set the overall level of utility
by normalizing the variance of the errors for all segments to the same value. Instead we
have to set the overall scale of utility by normalizing the variance of one segment, and
then estimating the variance for every other segment, relative to the first segment. To
illustrate this consider that we want to estimate a model with data from Boston and
Chicago, and that the errors in Chicago have a different variance that those in Boston,
i.e. σC2 6= σB2 . Let σC2 /σB2 = k, If we were to estimate the following specifications
Uij = αj + x0ij β + εB
ij
Uij = αj + x0ij β + εC
ij
each on its own, normalizing the variance of each of the errors to the same value, we would
√
get estimates of the coefficients for the Chicago equation that would be σC2 /σB2 = k
times those of the estimates we would get for Boston. Visually we may be inclined to
think the marginal effects of the different variables on the utility of those people living in
Chicago was different than those same marginal effects for the people in Boston, where
the only difference comes from normalizing the variances to the same value. To normalize
the √
errors appropriately in this case, we can divide the utility for the people in Chicago
by k, because it doesn’t change their because the scale of utility doesn’t matter, but
allows us to have the model
Uij = αj + x0ij β + εij

√ √
Uij = αj / k + x0ij β/ k + εij
where all the errors now have the same variance. So the scale of utility is set by nor-
malizing the variance of this constant εij , and the parameter k is estimated along with
α and β. The estimated value k̂ tells the researcher the variance of unobserved factors
in Chicago relative to those in Boston. For example k̂ = 1.20 implies that the variance
of unobserved factors is 20% greater in Chicago than in Boston.
Normalization with Correlated Errors
So far we have assumed that errors are independent across alternatives. When errors are
correlated across alternatives normalizing for scale becomes more complex. Normalizing
for the variance of the errors of one alternative is no longer sufficient. So far we have
talked about setting the scale of utility, but since only differences in utility matter, it is
really more appropriate to talk about setting the scale of utility differences.
Consider that we have four-alternatives, i.e. Uij = Vij + εij for j = 1, 2, 3, 4. The
error vector ε0i = (εi1 . . . εi4 ) has zero mean and covariance matrix
 
σ11 σ12 σ13 σ14
 · σ22 σ23 σ24 
Ω=  ·
, (5.6)
· σ33 σ34 
· · · σ44
where the dots refer to the corresponding elements in the upper part of the matrix, since
Ω is symmetric. Since only differences in utility matter, this model is equivalent to one
where all utilities are differenced from, say, the first alternative.4 The equivalent model
is
Uij − Ui1 = Vij − Vi1 + εij − εi1 for j = 2, 3, 4;
Ũij1 = Ṽij1 + ε̃ij1 for j = 2, 3, 4.
Consider now the vector of error differences, ε̃0i1 = (ε̃i21 ε̃i31 ε̃i41 ). The variance of, for
example, ε̃i21 , depends on the variances of the first and second terms of the original set
of errors, as well as the covariance between them. That is
Var [ε̃i21 ] = Var [εi2 − εi1 ] = σ11 + σ22 − 2σ12 .
Similarly the covariance between ε̃i21 and ε̃i31
Cov [ε̃i21 , ε̃i31 ] = E [(εi2 − εi1 ) (εi3 − εi1 )]
= E εi2 εi3 − εi2 εi1 − εi1 εi3 + ε2i1

= σ11 + σ23 − σ12 − σ13 .

The covariance matrix of the error differences thus is
 
σ11 + σ22 − 2σ12 σ11 + σ23 − σ12 − σ13 σ11 + σ24 − σ12 − σ14
Ω̃1 =  · σ11 + σ33 − 2σ13 σ11 + σ34 − σ13 − σ14  . (5.7)
· · σ11 + σ44 − 2σ14
A common way to set the scale of utility when errors are not iid, is to normalize
the variance of one of the error differences to some number. Suppose we normalize the
variance of ε̃i21 to 1. The covariance matrix for the error differences becomes
 
1 (σ11 + σ23 − σ12 − σ13 ) /m (σ11 + σ24 − σ12 − σ14 ) /m
 · (σ11 + σ33 − 2σ13 ) /m (σ11 + σ34 − σ13 − σ14 ) /m  (5.8)
· · (σ11 + σ44 − 2σ14 ) /m
4
Any alternative could have been chosen, so why not the first one?
5.3. BINARY OUTCOMES 11
√
where m = σ11 + σ22 − 2σ12 . Utility would thus be divided by σ11 + σ22 − 2σ12 to
obtain this scaling.
A model with J alternatives has at most J (J − 1) /2 − 1 covariance parameters after

normalization. Interpretation of the model is affected by the normalization. Consider
that we estimate the following normalized covariance matrix for the error differences in
the case of 4 alternatives
 
k ωab ωac
∗
Ω̃1 =  · ωbb ωbc  . (5.9)
· · ωcc
The parameter ωbb is the variance of the difference between the errors of for the first and
third alternatives relative to the variance of the difference between the errors for the first
and second alternatives, which not only is complicated to interpret by itself, but even
more complicated when you realize that each of the variances of the difference between
the errors for two alternatives reflects the variances of both alternatives as well as their
covariance.
5.3 Models for Binary Outcomes
We start our analysis of the different estimation models considering the case of binary
outcomes, i.e. the case when there are only 2 alternatives so j = 1, 2. Let us define
variable y as a binary (dummy) variable that takes a value of 1 when alternative 1 is
chosen, and 0 if alternative 2 is chosen, and let xi be the set of all explanatory variables
for individual i including an alternative specific constant if necessary. A regression model
is formed by parametrizing the probability p to depend on the regressor vector xi on a
K × 1 parameter vector β. The commonly used models are of single-index form with
conditional probability given by
pi ≡ Prob (yi = 1 | xi ) = F (x0i β) (5.10)
Notice that all we need to estimate is the probability that of selecting the first alternative,
because the probability of selecting the second alternative would be by definition 1 − pi .
The question then becomes what is the appropriate functional form to estimate the
probability in equation (5.10).
Marginal Effects
Notice that interest does not really lie on the estimates of β, but rather on the marginal
effects on the probability of choosing one of the alternatives, in particular the marginal
effect of a regressor on the probability that yi = 1. When the regressor, xil , is continuous,
then
∂pi
= F 0 (x0i β) βl , (5.11)
∂xil
where F 0 (x0i β) = ∂F (x0i β) /∂ (x0i β).
Notice that the marginal effect varies with the point of estimation, because xi changes
with the different individuals. We obviously seldom want to compute the marginal effect
for a given set of xi , but rather what the average marginal effect is for the sample we
are using to estimate. There are two general ways of computing the average effect:
Average Marginal Effect (AME): This is the sample average of the marginal effects
estimated at each point of observation, i.e. at each xi , so
X
AME = n−1 F 0 x0i β
b βbl .
i
Marginal Effect at the Mean (MEM): This is the marginal effect evaluated at the
mean of the regressors, so

MEM = F 0 x̄0 β b βbl .
For categorical variables, this measure is not really valid, because the regressor cannot
take small increments, but rather take only two values. Let xik be a dummy variable,
let xi,−k be the vector of regressors other than regressor k, and be β −k be the vector of
all parameters except βk . The effect of a change in regressor xik on pi is then
∆pi = F x0i,−k β −k + βk − F x0i,−k β −k .

(5.12)
We could easily extend the AME and MEM measures as

Xh i
−1 0 0
AME = n F xi,−k β −k + βk − F xi,−k β −k
b b b
i
and

MEM = F x̄0−k β
b + βbk − F x̄0 β
−k −k −k .
b
I believe a better measure for the AME can be calculated. Let q identify those observa-
tions for which the categorical variable is 1, and let r represent those observations where
the categorical variable equals zero. We can then calculate the marginal effect as the
difference in the average predicted probability for each group, as
X X
−1 0b −1 0b
nq F xq β − nr F xr β .
q r
When we have very few numbers of zeros or ones in the sample sampling error of the
mean for the group with few elements will be large, which will definitely affect this
measure. However the other two measures are sensitive to whether belonging to the
category affects the group means of the other regressors or not.
Maximum Likelihood Estimation
The nonlinear models are usually estimated by maximum likelihood. We consider that
each observation is independent, so the likelihood of the sample will be the product of
each individual probability. Notice that pi is the probability that yi = 1 for observation
i, so it must be that the probability that yi = 0 is 1 − pi . So what probability should we
apply to each observation? Well, if yi = 1 it ought to be pi , and if yi = 0 it ought to be
1 − pi . A convenient way of representing this is by letting
f (yi | xi ) = pyi 1 (1 − pi )1−yi

y 1−yi
= F (x0i β) 1 [1 − F (x0i β)] .
Notice that when yi = 1 this yields F (x0i β), and when yi = 0 it yields 1 − F (x0i β), which
is what we wanted. The log-likelihood is then
n
X
L = {yi ln F (x0i β) + (1 − yi ) ln [1 − F (x0i β)]} . (5.13)
i=1
To get the maximizing first-order conditions, we take the first derivative with respect
to β and equal each element to zero:

n
F 0 (x0i β) xi F 0 (x0i β) xi
X
yi − (1 − yi ) =0
i=1
F (x0i β) 1 − F (x0i β)
n
X yi [1 − F (x0i β)] − (1 − yi ) F (x0i β) 0 0
0 0
F (xi β) xi = 0
i=1
F (x i β) [1 − F (x i β)]
n
X yi − yi F (x0i β) − F (x0i β) + yi F (x0i β) 0 0
F (xi β) xi = 0
i=1
F (x0i β) [1 − F (x0i β)]
n
X yi − F (x0i β)
0 0
F 0 (x0i β) xi = 0
i=1
F (xi β) [1 − F (xi β)]
n
X F 0 (x0i β)
[yi − F (x0i β)] xi = 0
i=1
F (x0i β) [1 − F (x0i β)]
n
X
wi [yi − F (x0i β)] xi = 0, (5.14)
i=1
where wi = F 0 (x0i β) / {F (x0i β) [1 − F (x0i β)]} can be considered as a weight for each
observation’s prediction error. There is no explicit solution for β b
M LE , but the log-
0
likelihood function of the most popular specifications for F (xi β) is globally concave so
the Newton-Raphson iterative procedure normally converges very quickly.
5.3.1 Linear Probability Model
The linear probability model uses OLS to estimate
yi = x0i β + εi . (5.15)
This linear probability model has a number of shortcomings. A minor complication arises
because εi is heteroskedastic in a way that depends on the model parameters. Because
x0i β +εi must equal either zero or 1, εi equals either −x0i β or 1−x0i β, with probabilities F
and 1 − F , respectively, so Var [εi | xi ] = (x0i β) (1 − x0i β). Even though this is a problem
we could correct the heteroskedasticity using an FGLS estimator.
A more serious flaw is that the model doesn’t constraint ŷi = x0i β between the 0-1
interval. Such a model produces both nonsense probabilities and negative disturbances.
For these reasons, the linear probability model is becoming less frequently used except
as a basis for comparison to some other more appropriate models.
5.3.2 The Logit Model
The logit or logistic regression model specifies

0
exi β
pi = F (x0i β) = Λ (x0i β) = 0 . (5.16)
1 + e xi β
The marginal effects under this specification:
∂Λ (x0i β)
= Λ0 (x0i β) β
∂xi
= Λ (x0i β) [1 − Λ (x0i β)] β. (5.17)
The above derivations shows that Λ0 (x0i β) = Λ (x0i β) [1 − Λ (x0i β)] (you should check
this). Using equation (5.16) in the MLE first-order conditions in equation (5.14), and
keeping this in mind, we get that the log-likelihood maximization first-order conditions
are:
n
X
[yi − Λ (x0i β)] xi = 0, (5.18)
i=1
since wi = 1 for all observations

in ithis case. This condition implies that if xi includes
P h 0b
a constant, then i yi − Λ xi β = 0, so the sum of the residuals equal zero. This
−1
P 0 b
means that the average in-sample predicted probability n i Λ xi β equals the sam-
ple frequency ȳ.
This specification provides a unique interpretation of the model parameters, β. In

statistics it is very common to measure relative risk as the odds ratio, i.e. the probability
of one outcome relative to the probability of another outcome. For example, the odds of
winning in a roulette bet relative to those of losing. It is easy to show that for the logit
model
pi 0
= exi β ,
1 − pi
so taking logs

pi
ln = x0i β.
1 − pi
This means that the parameters of the model, are the semi-elasticities of the odds-ratio.
This is that a one unit increase in xil increases the odds ratio by approximately βl %.
5.3.3 The Probit Model
The probit model specifies pi to be the standard normal CDF, so

Z x0i β
pi = F (x0i β) = Φ (x0i β) = φ (z) dz, (5.19)
−∞
where Φ (·) represents the standard normal CDF, and φ (·) the standard normal PDF.
The marginal effects are straight forward
∂Φ (x0i β)
= φ (x0i β) β. (5.20)
∂xi
The probit MLE first-order conditions are that

n
X
wi [yi − Φ (x0i β)] xi = 0. (5.21)
i=1
Notice that in the probit case, the weights wi = φ (x0i β) / {Φ (x0i β) [1 − Φ (x0i β)]} is
different for each observation, so there is no guarantee that when including a constant
the sum of the residuals equals zero, because the first order condition only guarantees
that the weighted sum of the residuals equals zero, where the weights are Pwi . This
−1 0
means that there is no guarantee that the sample average prediction, n i Φ (xi β),
would equal the sample frequency ȳ.
The probit model is not as simple as the logit model (more so for the multinomial
case we will see later) but it is widely used when the starting point is a latent normal
regression. It is also used as a first step in sample selection models.
5.3.4 Choosing a Binary Model
Which model should be used, probit or logit? Empirically, either logit or probit can be
used. There is often little difference between the predicted probabilities. The difference
is greatest in the tails where probabilities are close to 0 or 1, and the difference is really
much less if the interest lies only on the marginal effects averaged over the sample, rather
than for each individual.
The natural metric to compare the two models is the log-likelihood, since the number
of parameters for both models is the same. Thus for each model compute
Xh i
Lβb = b + (1 − yi ) ln 1 − F xi β
yi ln F xi β b
i
Measures of goodness of fit in nonlinear models are called pseudo-R2 . McFadden

(1974b) proposed the following:
P h i
Lβb i yi ln F xi β + (1 − yi ) ln 1 − F xi β
b b
2
RBinary =1− =1− . (5.22)
Lȳ n [ȳ ln ȳ + (1 − ȳ) ln (1 − ȳ)]
The numerator on the fraction in equation (5.22) is the log-likelihood of the predicted
model, and the denominator is the likelihood of the sample evaluated at the mean. That
would be equivalent to the log-likelihood of a binary model where the only regressor is a
constant.
5.3.5 Testing Restrictions
Whether you use probit or tobit, you can test restrictions in a similar manner. The
significance of the individual coefficients can easily be tested by a z-test or a t-test, using
the estimated coefficients and standard errors. Testing the joint significance of different
regressors can be done using a likelihood ratio test. Let U represent the unrestricted
model, and R the restricted model. The test-statistic would be
h i
LR = −2 Lβb U − Lβb R , (5.23)
which asymptotically follows a χ2 (C) distribution, where C is the number of restrictions.

This means that a simple test for joint significance of the regressors in the model can be
done since Lȳ is the log-likelihood of the model that only uses a constant to be estimated
(not that you need that estimate that model to get that value). Thus the test for joint
significance of the regressors
h i
JS = −2 Lβb − Lȳ , (5.24)
which asymptotically follows a χ2 (K − 1) distribution, because the restrictions are that

all the coefficients other from the intercept are zero.
5.4 Multinomial Models
The previous section considered the cases where there were only two choices, so they
could be represented by a single binary variable that would take the value of 1 when a
particular choice was made, and zero if the other choice was made. We now return to
the general case that we were considering at the beginning of the chapter where there
are J choices. We can create a variable for each choice that takes the value of 1 when
the choice is made, and the value zero when it is not made. Notice that in reality, we
would only need to create J − 1 such variables, because each observation for the last
variable can be calculated by subtracting the sum of the previous J − 1 values for that
observations from 1, but to consider the model better we have all J variables. Then, in
the case of 4 alternatives (J = 4)
 
y11 y12 y13 y14
 y21 y22 y23 y24 
Y =  .. ..  .
 
.. ..
 . . . . 
yn1 yn2 yn3 yn4
In this matrix each row presents the individual’s choice, and each column the when each
alternative is chosen in the sample. In this case we want to model the probability that
individual i chooses alternative j, pij = Prob (yij = 1), and once again we are going to
do it with a cumulative distribution F (·).
We now have, however, two types of variables: alternative-variant (alternative spe-

cific) and alternative-invariant (individual specific). To illustrate, consider that we are
choosing how to commute to work, and we have four options: car, bus, train, other. An
alternative-variant variable would be cost of the trip. Notice that this variables vary not
5.4. MULTINOMIAL MODELS 19
only by alternative but also by individuals. For example, the price of gasoline and cars
will vary depending on where the individual lives, so the cost of the trip by car definitely
changes by individuals. An example of an alternative-invariant variable would be age,
since the person’s age doesn’t change with the commuting mode he chooses. Alternative
variant variables have a common coefficient for all alternatives, because it’s the variation
across alternatives what causes the variable to have a different effect on each alternative.
So it would be the difference in the cost of the trip between the car and bus alternatives
times the slope, for example, what would capture how cost affects commuting by car dif-
ferently than commuting by bus. Alternative-invariant variables have alternative-specific
coefficients, because the variable does not change across alternatives, so age will affect
commuting by car differently than commuting by bus through different slopes, one for
each alternative.
To model the alternative-variant variables, we take the same approach as with Y, so

for cost we would create a n × J matrix, C. For alternative-invariant variables remember
that in section 5.2 we saw how when we have alternative-specific coefficients we have to
normalize one of them to zero. So we normalize the coefficient for the fourth alternative
to zero. What do we do with the other three? We could create a different matrix for
each coefficient we have to estimate, that only adds value on the column that represents
the alternative of the coefficient. So for the first alternative (car) we would have the
values of age in the first column, and all zeros in the rest of them. We thus create 3
(J − 1) matrices for age, A1 , A2 , and A3 , where the values for age would be in the
column represented by the index, and zeros in all other columns. Remember also that
an intercept is alternative-invariant, so to have alternative-specific constants, you would
have to do the same as with alternative-specific coefficients but using a column of ones
instead of variable values. We would thus create another three n × J matrices for the
constants D1 , D2 , and D3 , where the index identifies where the columns of ones is placed,
and all other columns are zeros. We can then multiply each matrix by its coefficient to
have
D1 α1 + D2 α2 + D3 α3 + A1 γ1 + A2 γ2 + A3 γ3 + Cδ,
which expands to
     
1 0 0 0 0 1 0 0 0 0 1 0
 1 0 0 0   0 1 0 0   0 0 1 0 
 α1 +   α2 +   α3 +
     
 .. .. .. .. .. .. .. .. .. .. .. ..
 . . . .   . . . .   . . . . 
1 0 0 0 0 1 0 0 0 0 1 0
     
a1 0 0 0 0 a1 0 0 0 0 a1 0
 a2 0 0 0  
 0 a2
 0 0   0 0 a2 0 
.. .. .. ..  γ1 +  .. ..  γ2 +   γ3 +
   
 .. .. .... .. ..
 . . . .   . . . .   . . . . 
an 0 0 0 0 an 0 0 0 0 an 0
 
c11 c12 c13 c14
 c21 c12 c23 c24 
..  δ.
 
 .. .. ..
 . . . . 
cn1 cn2 cn3 cn4
You can see how the above expression works to add the right values in the right columns,
but we cannot compress the above expression into a Xβ format just yet. So now we
take a page from panel data, and instead of using n × J matrices for each variable, we
stack the four values for each individual one on top of each other, in a similar manner
that we stacked the time observations for each cross-section in panel data. This way the
variables become
 d1   d2   d3   a1   a2   a3   c 
1 0 0 a1 0 0 c11
 0   1   0   0   a1   0   c12 
             
 0   0   1   0   0   a1   c13 
             
 0   0   0   0   0   0   c14 
             
 1   0   0   a2   0   0   c21 
             
 0   1   0   0   a2   0   c22 
             
 0   0   1   0   0   a2   c23 
             
 0   0   0   0   0   0   c24 
             
 ..   ..   ..   ..   ..   ..   .. 
 .   .   .   .   .   .   . 
             
 1   0   0   an   0   0   cn1 
             
 0   1   0   0   an   0   cn2 
             
 0   0   1   0   0   an   cn3 
0 0 0 0 0 0 cn4
and we can compose X = [d1 d2 d3 a1 a2 a3 c].
Letting β = [α1 α2 α3 γ1 γ2 γ3 δ]0 , we can write Xβ, and the predicted

probabilities would be p = F (Xβ). Notice that each of the variables we have created
is an nJ × 1 vector, so X is an nj × K matrix (where K includes all coefficients to be
estimated, i.e. it includes the alternative-specific constants). This means that p is also
an nJ × 1 vector, and that pij would be the value in the i × j row of p. Notice also that
5.4. MULTINOMIAL MODELS 21
the i × j row of X, x0ij , has the values of all the variables for individual i and alternative
j, so pij = F x0ij β .

We also do the same transformation to Y as to the rest of the variables, and convert
it from a n × J variable into a nJ × 1 vector, y, where yij is the value in the i × j row
of y.
Maximum Likelihood
These models are usually estimated with maximum likelihood, and this is the estimator
that most software packages have. Cameron and Trivedi (2005, p. 497) mention how
when the data presents complications such as endogeneity or correlation across observa-
tional unit i, it can be more convenient to use moment-based estimators. Train (2009,
ch. 13) considers the case of endogeneity in detail, and estimation via Maximum Simu-
lated Likelihood and GMM. For the purpose of this course we only consider maximum
likelihood estimation. If you ever have to deal with these problems, you can refer to the
mentioned books.
Consider the individual first. He has J options, and the probability that he chooses
option j is pij . However, the dependent variable shows that he only chooses option j
where yij = 1. This means that the probability we want for the likelihood function is
the one for the alternative the individual actually chooses, so we have the multinomial
density the individual
J
y y
Y
f (yi ) = pyi1i1 × pyi2i2 × ... × pijij = pijij . (5.25)
j=1
This is the probability of just one observations, but there are n observations. The
likelihood thus is
n Y
J
y
Y
L= pijij , (5.26)
i=1 j=1
so the log-likelihood is
n X
X J n X
X J
L = yij ln F x0ij β .

yij ln pij = (5.27)
i=1 j=1 i=1 j=1
The maximization first order conditions thus are:

n X J n J
yij ∂pij X X F 0 x0ij β

∂L X
= = yij 0
xij = 0. (5.28)
β i=1 j=1
p ij ∂β i=1 j=1
F x ij β
These first-order conditions are usually non-linear in β, so there is no closed solution.
Model Evaluation and Selection
Several model evaluation methods are presented in Amemiya (1981) and Maddala (1983).
Comparing predicted probabilities with actual outcomes are of limited value as multino-
mial logit models estimated with intercept restrict the average of the predicted proba-
bilities to equal the sample average probability for each alternative. Like before a useful
pseudo-R2 measure is presented by McFadden (1974a)
Lβb
R2 = 1 − , (5.29)
Lȳj
where in this case

n X
X J
Lβb = yij ln F x0ij β
b (5.30)
i=1 j=1
is the log-likelihood of the fitted model, and

J
X
Lȳj = n ȳj ln ȳj , (5.31)
j=1
is the log-likelihood of the model estimated using only alternative-specific intercepts,

and thus estimates the probability of each alternative to the sample average for that
alternative ȳj .
Like in the case of binary choice, we can test restrictions using the likelihood ratio
test, where the test-statistic is given by equation (5.23), which means that the joint
significance of the coefficients can be tested using the test-statistic shown in equation
(5.24).
5.5. MULTINOMIAL LOGIT 23
5.5 Multinomial Logit
The logit model we are going to consider encompasses two models that traditionally have
been considered separately: the conditional logit and the multinomial logit. The
difference between these two models is that the conditional logit use alternative-varying
regressors, and the multinomial logit uses individual-specific (alternative-invariant) re-
gressors. We are going to consider both at the same time, for there is no reason not to,
and doing so has sometimes been dubbed as a mixed logit model. The problem is that
the same name is used to refer to a logit model that allows for random parameters to
capture unobserved heterogeneity, so that is why we consider a unique logit model that
allows for both alternative-variant and individual-specific variables.
The logit model sets

0
exij β
pij = PJ 0 , (5.32)
j=1 exij β
which implies that the log-likelihood is
n X
J 0
X exij β
L = yij PJ 0
xij β
(5.33)
i=1 j=1 j=1 e
McFadden (1974a) demonstrated that this log-likelihood is globally concave, which

helps solve for the maximum.
Consider logit’s CDF, presented in figure 5.1. When Vij is either very low or very
high, a small change in any of the explanatory variables will have a very small effect
on the probability that alternative j would be chosen. The point at which an increase
in Vij has the greatest impact on the probability of alternative j being chosen is when
the probability is close to 0.5. A small improvement then tips the balance in people’s
choices, inducing to a (relatively) large change in the probability. This has important
implications for decision makers. For example, consider the MBTA deciding in which
areas to spend to improve the bus service they provide. Improving the service in areas
where it is either too poor or too good will have very little effect, whereas investing in
areas where the service is fine so that some people are already using it, but not too many,
will have the largest effect.
Figure 5.1: Graph of the logit CDF
Maximum Likelihood
For Logit, the first-order conditions in equation (5.28) simplify to5

n X
X J
(yij − pij ) xij = 0. (5.34)
i=1 j=1
Rearranging we have
n X
X J n X
X J
yij xij = pij xij . (5.35)
i=1 j=1 i=1 j=1
The left hand side of this expression provides the average of the regressors over the
sample frequencies of the alternatives, whereas the right hand side provides the average
of those same regressors over the predicted probabilities. This means that maximum
likelihood estimation of a logit model chooses the coefficients, β
b that make the predicted
value of each explanatory variable equal to the sample average of those regressors. This
implies that when we use alternative-specific constants, the share of the people in the
sample that chose alternative j is the same as the predicted share of people that chose
alternative j. This implies that the expected residual for alternative j is zero. This is
5
You can find the derivation of these conditions in Train (2009, p. 63).
something we already observed for binary logit, as it should be since binary logit is a
particular case of multinomial logit.
Marginal Effects and Elasticities
We can compute the effect that a change in the value of a variable for a specific al-
ternative has on the probability of that alternative, as well as on the probabilities for
other alternatives. On the own alternative, the marginal effect of a change in xij,l on the
probability of choosing alternative j:
0
x0ij β PJ xij β 0 0
∂pij e j=1 e βl − exij β exij β βl

= i2
∂xij,l
hP
J x0ij β
j=1 e
0 0
J xij β
− exij β
P
j=1 e
0
exij β
= PJ x0ij β PJ x0ij β
βl
j=1 e j=1 e
= pij (1 − pij ) βl . (5.36)
And now the effect of a change in xij,l on the probability of choosing alternative k 6= j:
0 0
∂pik 0 − exij β exik β βl
= hP i2
∂xij,l J x0ij β
j=1 e
= −pij pik βl . (5.37)
Notice that the sum across alternatives of the marginal effects on the alternatives must
equal zero. Well, if you have to choose one alternative that a variable increases the
probability of one alternative it must decrease the probabilities of the rest of the alterna-
tives, and vice-versa, because the total sum of probabilities across alternatives must be
one. Therefore the change in total probability across alternatives is zero. This is easily
checked:
J
X ∂pik X
= pik (1 − pik ) βl + (−pij pik βl )
k=1
∂xij,l j6=k
!
X
= pik βl 1 − pik − pij
j6=k
= pik βl (1 − 1) = 0.
We can also derive the elasticities of the change in a variable. We start again with
the case of a variable that enters the alternative:
∂pij xij,l
ηij,l =
∂xij,l pij
xij,l
= pij (1 − pij ) βl
pij
= (1 − pij ) xij,l βl . (5.38)
And now the cross-elasticity with a variable that enters a different alternative:
∂pij xik,l
ηijk,l =
∂xik,l pij
xik,l
= −pik pij βl
pij
= −pik xik,l βl . (5.39)
This cross-elasticity is the same for all j 6= k, since j is nowhere to be found on the
right hand side of the equation. This means that a percentage change in an attribute of
alternative k, changes the probabilities of all other alternatives by the same percentage.
This is because logic satisfies the property of independence of irrelevant alternatives (IIA)
that we discuss in this next section.
5.5.1 Logit’s Characteristics
This part relies heavily on Train (2009, pp. 42–52). As with any estimation method
that we use, it is always good that we know what exactly are its characteristics so that
we know how and when we can apply it. We can summarize the applicability of logit as
follows:
1. Logit can represent systematic taste variation (that is taste variation related to
observed characteristics) but not random taste variation (differences in tastes that
cannot be linked to observed characteristics).
2. The logit model implies proportional substitution across alternatives, given the
researcher’s specification of representative utility.
3. Logit is only valid when unobserved factors are independent over time when deal-
ing with repeated choice situations. Logit cannot handle situations where unob-
served factors are correlated across time for the decision maker.
Taste Variation
The importance of a characteristic that affects choice normally varies across individuals.
For example, when car size is more important to households with many family members,
than to those with fewer members, low-income households are probably more concerned
about the purchase price of a good than higher-income households and so forth. When
taste variations can be related to observed characteristics, like the ones we have just
mentioned, Logit can accommodate them by incorporating interaction variables, i.e.
product or division of the two variables, quadratic terms for the variables, etc. . . When
these taste variations are related to unobserved characteristics, like someone’s strong
despise for the train when choosing a mode of transportation, Logit cannot model them.
To illustrate consider that when deciding what car to buy that the only 2 attributes
that are important are the purchase price, Pj , and shoulder room Sj . The value that
each household place on these characteristics vary across households, so
Uij = Sj βs,i + Pj βp,i + εij , (5.40)
where βs,i and βp,i are specific to household i.
Consider first that the value that households place on the car’s shoulder room varies
only with the number of family members in the households, Mi , but nothing else so that
βs,i = ρMi , and that the importance of the purchase price is inversely related to income,
Ii , so that βp,i = θ/Ii . Substituting these two expressions in equation (5.40) produces
Uij = (Sj Mi ) ρ + (Pj /Ii ) θ + εij ,
which means under the assumption that each εij is iid extreme value, a logit model can
be used.
The limitation of logit arises when we attempt to allow tastes to vary with respect to
unobserved variables or purely randomly. Suppose for example that the value of shoul-
der room varied with the household’s size, Mi which is observable, plus the frequency
with which the household travels together (which is unobservable). Assuming that the
traveling frequency effect is purely random across households, we can model it as a ran-
dom variable, µi , so that βs,i = ρMi + µi . Similarly, assume that the importance of
the purchase price is inversely related to income, as before Ii , but also with how frugal
the households is, which we can’t observe. If we assume the level of family frugality to
be random across households, we can model the effect it has again as a purely random
variable, so that βp,i = θ/Ii + ηi . Substituting these expressions into equation (5.40)
produces
Uij = (Sj Mi ) ρ + Sj µi + (Pj /Ii ) θ + Pj ηi + εij

= (Sj Mi ) ρ + (Pj /Ii ) θ + ε̃ij ,
where ε̃ij = Sj µi + Pj ηi + εij . It is clear that we cannot assume that ε̃ij are iid. Notice
that µi and ηi enter each alternative, so the errors are necessarily correlated across alter-
natives, Cov (ε̃ij , ε̃ik ) = σµ2 i Sj Sk + ση2i Pj Pk 6= 0, for any two cars j and k. Furthermore,
the variance of ε̃ij varies across alternatives, because Pj and Sj vary across alternatives.
Notice that σε̃2ij = σµ2 i Sj2 + ση2i Pj2 + σε2ij , which is different for different j.
Proportional Substitution
As we have already discussed, when the probability of choosing one alternative increases
the probability of choosing any other alternative decreases, since the sum of all changes
must equal zero. The pattern of these substitutions among alternatives has important
implications in many situations. Logit implies that the probabilities of all alternatives
change proportionally, which is why the cross-elasticities of the probabilities are all equal.
If this reflects the actual mode of substitution, which is true in many cases, then the
logit model is appropriate. However, when it doesn’t, the logit model is not valid. This
particular characteristic of the logit model is because it satisfies IIA.
To understand the IIA property in the logit model, consider the ratio of the proba-
bilities of any 2 alternatives j and k
eVij / j eVij
P
pij eVij
= V P Vij = V = eVij −Vik .
pik e ik / j e e ik
Notice that this ratio depends only on the observed attributes of the 2 alternatives that
are being compared. This means that this ratio will not change when the characteristics
of any other alternative, say alternative m different than both j and k, changes, so
the ratio is independent of irrelevant alternatives. This doesn’t mean that a change in
the probability of choosing alternative m will not change the probabilities of choosing
alternatives j and k, but rather that it will change them both in the same proportion
to keep the ratio of probabilities constant. This is the same as saying that the cross-
elasticity of the probability of choosing alternative j with respect to any percentage
change in Vim has to be the same as the cross-elasticity of the probability o choosing
alternative k with respect to that same percentage change in Vim , which is what we
observed in equation (5.39).
To illustrate this with an example, consider that a city currently only has bus as a
method of public transportation, so the commuter has only two options to get to work,
car and bus. The city now develops an underground train system that completely avoids
traffic, and is more convenient for the commuter than the bus. It doesn’t however reach
as many points in the city as the bus does. If this is the case, you would expect the
probability of choosing bus to decrease by more than the probability of choosing a car
as the commuting mode, because the underground system is another method of public
transportation and thus a closer substitute in people’s preferences to the bus than to
the car. Thus you would expect pic /pib , where c represents car and b represents bus,
to increase when the underground train is introduced. Logit however, would predict
no change in this ratio because it predicts that pic and pib would decrease by the same
proportion. When this is the case in real life we need another method that provides more
flexibility in the substitution patterns of the alternatives.
Repeated Choices
In many settings the researcher can observe numerous choices made by the decision
maker over time, i.e. repeated choices. This means that the observer would have panel
data, and now models the probability that individual i chooses alternative j at time t
eVijt
Pijt = P Vijt . (5.41)
je
Logit is valid if we can assume now that errors are iid across alternatives, j, individuals,
i, and periods of time, t. But sometimes current choices depend on choices that were
made in the past, or how much we value a certain characteristic of an alternative changes
through time.
To illustrate consider that we are choosing soft drinks, and we have two options:
Sprite, and Pepsi. On average people are very consistent on choosing one of the two
options over time, so to capture that effect and use logit to estimate these effects we
could include the choice in the previous period in our representative utility such that
Vijt = αyij(t−1) +x0ijt β, where yij(t−1) = 1 if that alternative was chosen in the last period,
or zero if it wasn’t. A positive value for α would indicate that the consumer gets more
utility when he consistently chooses the same option, whereas a negative value would
indicate that he likes to change his consumption from period to period. Including the
lagged variable does not make logit’s estimations inconsistent becausewe have assumed
that the errors are independent across time and thus Cov yij(t−1) , εijt = 0.
Logit’s limitation to model the effects of previous choices comes when the effect
of those choses depends on some unobserved characteristic of the individual, and thus
vary across individuals randomly. It is not clear that α should be the same for all
individuals, because there are some people who don’t like change, and people who like
change. Unfortunately this we cannot observe. This problem we have already discussed
before, on the taste variation. However, the time dimension also allows for these effects to
vary across time. It is not clear that people are consistently getting the same additional
utility from the same choice. For example, you may have a strong preference for a Pepsi,
but your appreciation of variety may increase over time, not just because you are getting
tired of consuming always the same (we can model that), but because you change your
tastes over time. This change of tastes across alternatives over time is unobservable,
and presents a similar problem of a change of tastes across individuals over alternatives.
Assuming that this unobservable part of the change in tastes across time is random across
individuals, we cannot use logit because the errors would be correlated across time and
the variance of the errors would change over time.
5.6 Nested Logit
The nested logit model is appropriate when the set of alternatives faced by a decision
maker can be partitioned into subsets, called nests, in such a way that the following
properties hold:
1. For any two alternatives in the same nest, IIA holds with respect to all other
alternatives within that nest, i.e. IIA holds within each nest.
2. For any two alternatives in different nests the ratio of probabilities can depend on
5.6. NESTED LOGIT 31
the attributes of other alternatives in the two nests, i.e. IIA does not hold in
general across nests.
To illustrate consider once again the choice of mode of transportation to work, where the
options now are: going by car alone, carpooling, public bus and public train. Consider
that the probabilities of choosing each alternative when all alternatives are available and
when each alternative is removed one by one, are given by the following table:6
Table 5.1: Nested IIA Example Probabilities

Probability
With Alternative Removed
Alternative Original Car Alone Carpool Bus Train
Car Alone 0.40 — 0.45 (+ 12.5%) 0.52 (+ 30%) 0.48 (+ 20%)
Carpool 0.10 0.20 (+ 100%) — 0.13 (+ 30%) 0.12 (+ 20%)
Bus 0.30 0.48 (+ 60%) 0.33 (+10%) — 0.40 (+ 33%)
Train 0.20 0.32 (+ 60%) 0.22 (+ 10%) 0.35 (+ 70%) —
When either car alone or carpool are not available options (the car may be in shop
for repairs), the probabilities for bus and train increase by the same proportion, but the
probability of the other car option increases by a larger proportion. At the same time
when either of the public transportation alternatives are not available, the probabilities
of choosing car alone or carpool increase by the same proportion, whereas the probability
of choosing the other public transit alternative increases by a larger proportion. This
means that we can distribute these alternatives in two separated nests: one that groups
the car options, and another that groups the public transportation options. That way
IIA will hold within each nest (the probabilities of two alternatives within each nest
change in the same proportion) but not across nests (the probability of choosing an
alternative in one nest changes by a different proportion than the probability of choosing
an alternative in a different nest).
Graphically you could consider the nests as parts of a decision tree, as presented in
figure 5.2. You could see the choice between the four alternatives described here as a
two-step decision process where first you would choose whether to go by car or by public
transportation, and then choose an option within each nest, i.e. between car alone and
6
Table 5.1 reproduces table 4.1 in Train (2009, p. 78).
Figure 5.2: Tree Diagram for Transport Mode Choice
carpool if you had chosen car in the first step or between bus and train if you had chosen
train in the first step.
To present the functional form of pij for the nested logit, let the set of alternatives
j = 1, . . . , J be partitioned into K non overlapping nests, denoted Bk , for k = 1, . . . , K.
Notice that we are still interesed in knowing pij but now alternative j belongs to a nest,
Bk . Keeping that in mind we have that the nested logit probability is
λk −1
e(x0ij β )/λk P (x0im β )/λk
m∈Bk e
pij = λ . (5.42)
PK P
e(x0im β)/λ` `
`=1 m∈B`
The new parameter here is λk that is specific to each nest, and thus equal for all the
alternatives within the nest, but varying across nests. This parameter is a measure of
the degree of independence of the errors (unobserved utility) among the alternatives in
nest k. Cameron and Trivedi (2005, p. 509) mention that it can be shown that λk =
p
1 − Cor (εij , εim ) for alternatives j, m ∈ Bk .7 The highest the level of independence of
the errors within a nest, the lower the correlation among the errors in that nest. This
means that λk will be closer to 1 for that nest. If all the errors were independent among
alternatives in each nest in all the nests, we would have λk = 1 for all k. Notice that
in that case equation (5.42) becomes equation (5.32) because that is the specification of
the standard logit, and the standard logit assumes independence across alternatives for
7
They use ρ instead of λ and have a different way to refer to the errors. Notice also that we are still
assuming that errors are independent across nests.
all alternatives. So the nested logit adds flexibility to the substitution patterns across
alternatives by assuming a specific type of correlation among the errors: errors within
nests are correlated with the errors that belong to the same nest, but errors that do not
belong to the same nest are not correlated.
Clearly the value of λk can vary across nests, reflecting the different level of indepen-
dence of the errors within each nest. Hypotheses tests can be done about the different
λk s. Notice that a test of joint significance of the λk s is not of real importance, because
that would be testing whether there is perfect correlation of the residuals within each
nest, for all nests. A more interesting test would be to test whether all λk s are equal
to each other, and in particular to 1. To test whether they are all equal to each other,
you would estimate a model with a single λ, thus constraining al to be equal, and do a
likelihood ratio test against the unconstrained nested logit model, and the test-statistic
would have K − 1 degrees of freedom, where K is the number of nests in this case. To
test whether all λ0k s are equal to 1 you can just run the nested logit without constraining
any λk , and test it against the standard logit estimation since we have already mentioned
how a nested logit with all λk = 1 collapses into the logit model. The likelihood ratio
test-statistic would follow a χ2 distribution with K degrees of freedom, where K is once
again the number of nests in this case.
p make sense that λk ∈ [0, 1], since according to Cameron and

Even though it would
Trivedi (2005) λk = 1 − Cor (εij , εim ) for alternatives j, m ∈ Bk and Cor (εij , εim ) ∈
[0, 1], Train (2009, p. 81) mentions that this may or may not be the case. He mentions
that when λk ∈ [0, 1] ∀k, the model is consistent with utility maximization for all possible
values of the explanatory variables. For λk > 1 the model is consistent with utility-
maximizing behavior for some range of explanatory variables, but not for all the values.
Finally, a negative value of λk is inconsistent with utility maximization and implies that
improving an alternative (increasing its value to the decision maker) can decrease the
probability that the alternative would be chosen.
5.6.1 IIA
We can use equation (5.42) to show that IIA holds within each nest, but not across nests.
Consider alternatives j ∈ Bk and m ∈ B` . The ratio of probabilities would be
λ −1
e (x0ij β)/λk P e (x0is β)/λk k
pij s∈Bk
= λ` −1 . (5.43)
pim

e (x0im β )/λ` P (x0is β )/λ`
s∈B` e
This ratio depends on the values of all the alternatives in each of the nests. So even
though we no longer have IIA when considering two alternatives across nests, the ratio
of probability is still independent from all those alternatives that do not belong to the
nests that each alternative compared belongs to.
When alternatives j and m both belong to the same nest, i.e. Bk = B` , then λk = λ` ,
and the ratio of probabilities simplifies to
e(xij β)/λk
0
pij
= , (5.44)
e(xim β)/λk
0
pim
which clearly does not depend on any alternative other than the two being compared.
So IIA holds within each nest, but not across nests.
5.6.2 Decomposition into Two Logits
Even though equation (5.42) provides us with the nested logit probability, it is not very
intuitive at a first glance. Consider the decision tree in figure 5.2 once more, and what
the probability of selecting car alone should be. To pick car alone I would have to pick
car first and then, given that I had already chose car, pick alone between the remaining
two choices, alone or pooling. So it would be far more intuitive if we were able to express
the probability this way. As you well know any probability can be expressed as the
product of a conditional probability and a marginal probability, so here we are going to
do just that.
To do just that, consider that we break xij into two parts: wik that contains the
variables that describe nest k, and thus do not differ across the alternatives that belong
to nest k, but do vary across nests; and zij that contains the variables that describe
alternative j. These variables vary across all alternatives, even within nests. For example
the price per gallon of gas would be common for car alone and carpooling, so it would
be included at the nest level. This allows us to rewrite equation (5.42) as
P λk −1
e(wik δ+zij γ )/λk
0 0
e (wik im )
0 δ+z0 γ /λ
k
m∈Bk
pij = λ
PK P
e (wi`0 δ+z0im γ )/λ` `
`=1 m∈B`
P λk
( 0 δ+z0 γ /λ
wik ) e(wik im )
0 δ+z0 γ /λ
k
e ij k
m∈Bk
=P λ
(wik δ+zim γ )/λk PK P
0 0
m∈Bk e e (wi`0 δ+z0im γ )/λ` `
`=1 m∈B`
P λ
(wik )
0 δ /λ
k (zij γ )/λk
0
e(wik ) k e(z0im γ )/λk k
0 δ /λ
e e m∈Bk
=P λ`
(wik δ)/λk e(zim γ )/λk K P
0 0
e
P (wi` )
0 δ /λ
` (zim γ )/λ`
0
m∈Bk `=1 m∈B` e e
P λk
e
0 δ
wik (z0im γ )/λk
e( ik ) k e( ij ) k m∈Bk e
w 0 δ /λ z 0 γ /λ
= λ
e(wik δ)/λk m∈Bk e(zim γ )/λk K ewi`0 δ
0 0
(z0im γ )/λ` `
P P P
`=1 m∈B` e
e(zij γ )/λk
0
e(wik δ+Iik λk )
0
=P ,
e(zim γ )/λk K e(wi` δ+Ii` λ` )
0 P 0
m∈Bk `=1
where Iik = ln m∈Bk e(zim γ )/λk . This expression is the product of two logit probabilities.
P 0
The left fraction is the probability of individual i selecting alternative j out of the
alternatives in nest Bk , so that is the conditional probability. The fraction to the right
is the probability of individual i picking nest Bk out of all K nests. Therefore, letting
e(zij γ )/λk
0
pij|Bk =P , (5.45)
e(zim γ )/λk
0
m∈Bk
and
e(wik δ+Iik λk )
0
piBk =P , (5.46)
e(wi` δ+Ii` λ` )
K 0
`=1
the nested logit probability is
pij = pij|Bk piBk . (5.47)

Normally the choice of the nest (marginal probability) is called the upper model and
the choice of the alternative within the nest (conditional probability) as the lower model.
The quantity Iik links the upper and lower models by bringing information from the
lower model to the upper one. Notice that Iik is the natural log of the denominator of
the lower model. So λk Iik is interpreted as the expected value of the alternatives in the
nest, which is why Iik is often called the inclusive value of the nest. Notice that when
selecting the nest it makes sense to incorporate the expected value that the alternatives
in that nest have plus the specific value of the nest itself, which is what we observe in
the marginal probability of the nest, piBk . The conditional probability of the chosen
alternative simply considers the value of the alternatives in the nest, the chosen one in
the numerator and all of them in the denominator.
5.6.3 Estimation
Using the breakdown probability function in equation (5.47) we can express the nested
logit density function for an individual as
K Y
Y yij
f (yi ) = pij|Bk piBK
`=1 j∈Bk
K
!
Y yiB Y yij
= piBkk pij|Bk
, (5.48)
`=1 j∈Bk
P
where yiBk = j∈Bk yij will equal 1 when an alternative in Bk is chosen, and zero
otherwise. The likelihood is thus
n Y
K
!
Y yiBk Y yij
L= piBk pij|Bk (5.49)
i=1 `=1 j∈Bk
Maximum likelihood maximizes

n X
K
!
X X
L = yiBk ln piBk + yij ln pij|Bk
i=1 `=1 j∈BK
n X
X K n X
X K X
= yiBk ln piBk + yij ln pij|Bk . (5.50)
i=1 `=1 i=1 `=1 j∈BK
5.7. ORDERED AND RANKED OUTCOMES 37
with respect to δ, γ and λk .
Both Cameron and Trivedi (2005) and Train (2009) mention that there is also a less
efficient sequential estimator that consists of two steps:
1. The first stage estimates the upper model, which basically maximizes the second
term in equation (5.50), and gets Îik for all k, i.e. estimates of the different Iik .
2. The second stage estimates the lower model, by adding the different Îik as added
regressors.
[k ,
Looking at equations (5.45) and (5.46) we see that the first stage would provide γ/λ
and that the second stage would provide δ b and λ [k from
bk . We then would multiply γ/λ
the first stage and λ
bk from the second stage to get δ.
b
5.7 Ordered and Ranked Outcomes
We now consider models where there is more structure among the alternatives than
in the previous models, like those where there is a natural ordering of alternatives, or
sequencing of decisions.
5.7.1 Ordered Multinomial Models
These models are when there is a natural ordering of the alternatives, like for example
the results on a poll that rank a service from 1 (poor) to 5 (excellent), or an income
category an individual may belong to.
The starting point is an index model, with a single latent variable
yi∗ = x0i β + εi (5.51)
where xi does not include a constant term. As yi∗ crosses a series of increasing unknown
thresholds, we move up the ordering of alternatives. For example yi∗ is the value the
person really gives to a service. When it is very low, say less than a threshold α1 , we
observe a response of 1 (poor). If the actual valuation is above that first threshold but
less than a second threshold, α2 , then we observe a response of 2 (mediocre). And so
forth.
In general for a J-alternative ordered model, we observe
yi = j if αj−1 < yi∗ ≤ αj (5.52)
where α0 = −∞ and αJ = +∞. Then
pij = Prob [yi = j] = Prob [αj−1 < yi∗ ≤ αj ]

= Prob [αj−1 < x0i β + εi ≤ αj ]
= Prob [αj−1 − x0i β < εi ≤ αj − x0i β]
= Prob [εi < αj − x0i β] − Prob [εi < αj−1 − x0i β]
= F (αj − x0i β) − F (αj−1 − x0i β) , (5.53)
where F (·) is the CDF of εi . Notice that we now would have to estimate the equation
parameters, β and the J − 1 thresholds, α1 , . . . , αj−1 . We estimate them via maximum
likelihood where the likelihood is given by:
n X
X J
L = yij ln [F (αj − x0i β) − F (αj−1 − x0i β)] . (5.54)
i=1 j=1
The sign of the estimates of the parameters, β b can be immediately interpreted as

determining whether the latent variable increases or not with that regressor, but not by
how much. We can calculate the marginal effects on the probabilities with
∂pij
= [F 0 (αj − x0i β) − F 0 (αj−1 − x0i β)] β. (5.55)
∂xi
The only thing left to be able to estimate such model, is to specify F (·). The ordered
logit model assumes that ε is logistic distributed so
0
eαj −xi β
F (αj − x0i β) = Λ (αj − x0i β) = 0 . (5.56)
1 + eαj −xi β
The ordered probit model assumes that ε is standard normal distributed, and F (·)
is Φ (·), the standard normal CDF.
5.8. ASSIGNMENT 39
5.7.2 Ranked Data Models
There are cases where you have several choices, and you not only know which alternative
you would choose, but you also can rank which alternative you can choose first, which
second, and so forth. Again we have J alternatives and we are calculating what the
probability that the ranking is what we specify. Consider a four alternative model,
where an individual ranks alternative 2 as the most favored one, and then alternative
3. That is, he would choose alternative 2 out of the 4 alternatives, and then he would
choose alternative 3 from the three alternatives that are left after removing alternative
2. The rank-ordered logit model is pretty to estimate because it models this joint
probability as
0 0
exi2 β exi3 β
Prob [j1 = 2, j2 = 3] = 0 0 0 0 0 0 0 .
exi1 β + exi2 β + exi3 β + exi4 β exi1 β + exi3 β + exi4 β
This is but one possible ranking, of course, since there are 11 more possible such
rankings.8 Notice, of course that we can rank two or three or all four alternatives. Of
course with this setup, ranking three alternatives is equivalent to ranking all four of
them.
For the multinomial probit model there is no similar simplification, because the multi-
nomial probit model does not have a closed form for the probabilities, but rather need
to be simulated.
5.8 Assignment
The assignment this time will be using the dataset that I use for my research on charitable
giving. We are going to be analyzing what affects giving or not giving, so the dependent
variables for this assignment are:
Binary Models: FamGiveDum;

Multinomial Models: GiveCategories.
8
There are four alternatives from which we are picking two, and the order is important, so these are
permutations of 4 alternatives chosen in two by twos, which gives a total number of possible rankings
of 12.
The independent variables are: HeadAge, HeadEducation, HeadFemDum, CoupleDum,

HeadRetiredDum, WifeRetiredDum, FamAvailTimePG, FamTotalIncomePG, ExogDonationBS,
ExogVolunteeringBS, ExogTaxesBS, StatePriceIndex, PriceDonation, and PriceTime.
We start the analysis with the binary choice between giving and not giving. So you
have estimate the binary model using LPM (OLS), logit and probit. Remember that the
coefficients per se are not of much interest, so besides the results from the estimation
of the three models I want you to present the AMEs, and the MEMs for each model
(remember that for LPM the coefficients are the marginal effects), and compare the
differences between the estimates.
The second part of the assignment will be to consider a more complex choice, one
between four alternatives: not giving, giving only money, giving only time, and giving
both money and time. For this part you will have to estimate the multinomial logit
model, and the multinomial probit model. There are no alternative-specific variables in
the explanatory variables, so the commands that you want are mlogit and mprobit.
A Note on STATA’s margins Command
STATA has created a powerful command from version 11 onwards: margins. You can
compute both AMEs and MEMs with margins, but it also does much more than that.
It allows you to correctly predict the value of the dependent variable at many specific
values of the independent ones. In any case, for you to be able to use it correctly,
and it calculates the effect of the discrete change, you need to identify the independent
variables as either factor (dummy) variables, with the prefix i., or continuous variables,
with the prefix c., in the regression command, not when calling the margins command.
For example you would write c.HeadAge i.HeadFemDum, and for all other explanatory
variables accordingly, when calling the logit command.
To compute the AMEs for binary logit and probit, you run the estimation command
with the prefixes for the explanatory variables, and then call:
AMEs: margins, dydx(*);

MEMs: margins, dydx(*) atmeans.
Both commands will appropriately report the discrete differences for the marginal effects
of a change in the dummy variable, as long as you prefixed all your dummy variables
5.8. ASSIGNMENT 41
with i. when calling the estimation command. Now, for multinomial estimations it
is slightly trickier. You have to specify what alternative you want the marginal effects
for. For binary models this is not necessary, because as long as you have the marginal
effects on the alternative for which y = 1, the marginal effect on the other alternative
is just the opposite, because the change in the overall probability is zero. This is,
say that being a Female increases the probability of giving by 0.09, this means that
being a male decreases the probability of giving by 0.09. So for each alternative you
would have to type: margins, dydx(*) predict(outcome(j)) to get the AMEs, and
margins, dydx(*) predict(outcome(j)) atmeans to get the MEMs, and replace j for
the number of the outcome you want the effects for.
Bibliography
Amemiya, Takeshi, “Qualitative Response Models: A Survey,” Journal of Economic

Literature, December 1981, 19 (4), 1483–1536.
Cameron, A. Colin and Pravin K. Trivedi, Microeconometrics: Methods and Ap-

plications, 1 ed., New York, NY USA: Cambridge University Press, May 2005.
Maddala, G. S., Limited-Dependent and Qualitative Variables in Econometrics, Vol. 3

of Econometric Series Monographs, Cambridge, UK: Cambridge University Press,
1983.
McFadden, Daniel, “Conditional Logit Analysis of Qualitative Choice Behavior,” in

P. Zarembka, ed., Frontiers in Econometrics, New York, NY USA: Academic Press,
1974, chapter 4, pp. 105–142.
, “The Measurement of Urban Travel Demand,” Journal of Public Economics, Novem-

ber 1974, 3 (4), 303–328.
Train, Kenneth E., Discrete Choice Methods with Simulation, 2 ed., New York, NY
USA: Cambridge University Press, 2009.
43

Ch. 5. - Models of Discrete Choice

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ch. 5. - Models of Discrete Choice

Uploaded by

Copyright:

Available Formats

Chapter 5

Models of Discrete Choice

5.1.1 Random Utility

A decision maker, identified by i = 1, . . . , n, faces a choice over a limited number of

Uij = Vij + εij . (5.1)

Uij > Uik ∀k 6= j;

Pij = Prob (Uij > Uik , ∀k 6= j)

5.1.2 Latent Variable

5.2 Identification of Choice Models

I would like to specify Vij to be linear in parameters with an alternative specific

Uij = αj + x0ij β + εij . (5.4)

5.2.1 Only Differences in Utility Matter

Pij = Prob (Uij − Uik > 0 ∀k 6= j)

= Prob (αj − αk ) + (xij − xik )0 β + (εij − εik ) > 0 ∀k 6= j .

The alternative-specific constant αj , captures the average effect on utility of alternative

In terms of estimation, this means that it is impossible to estimate all j constants,

UiC = αC0 + TiC β1 + Ai β2C

UiC = TiC β1 + εiC ,

Number of Independent Error Terms

Remember equation (5.2)

5.2.2 The Overall Scale of Utility is Arbitrary

Normalizing with iid Errors

Normalizing with Heteroskedastic Errors

Uij = αj + x0ij β + εij

Normalization with Correlated Errors

= σ11 + σ23 − σ12 − σ13 .

A model with J alternatives has at most J (J − 1) /2 − 1 covariance parameters after

5.3 Models for Binary Outcomes

pi ≡ Prob (yi = 1 | xi ) = F (x0i β) (5.10)

∆pi = F x0i,−k β −k + βk − F x0i,−k β −k .

We could easily extend the AME and MEM measures as

Maximum Likelihood Estimation

f (yi | xi ) = pyi 1 (1 − pi )1−yi

to β and equal each element to zero:

5.3.1 Linear Probability Model

The linear probability model uses OLS to estimate

5.3.2 The Logit Model

The logit or logistic regression model specifies

since wi = 1 for all observations

This specification provides a unique interpretation of the model parameters, β. In

5.3.3 The Probit Model

The probit model specifies pi to be the standard normal CDF, so

The probit MLE first-order conditions are that

5.3.4 Choosing a Binary Model

Measures of goodness of fit in nonlinear models are called pseudo-R2 . McFadden

5.3.5 Testing Restrictions

which asymptotically follows a χ2 (C) distribution, where C is the number of restrictions.

which asymptotically follows a χ2 (K − 1) distribution, because the restrictions are that

5.4 Multinomial Models

We now have, however, two types of variables: alternative-variant (alternative spe-

To model the alternative-variant variables, we take the same approach as with Y, so

and we can compose X = [d1 d2 d3 a1 a2 a3 c].

Letting β = [α1 α2 α3 γ1 γ2 γ3 δ]0 , we can write Xβ, and the predicted

The maximization first order conditions thus are:

These first-order conditions are usually non-linear in β, so there is no closed solution.

Model Evaluation and Selection

where in this case

is the log-likelihood of the fitted model, and

is the log-likelihood of the model estimated using only alternative-specific intercepts,

5.5 Multinomial Logit

The logit model sets

which implies that the log-likelihood is

McFadden (1974a) demonstrated that this log-likelihood is globally concave, which

Figure 5.1: Graph of the logit CDF