The Independence of Irrelevant Alternatives

The Independence of Irrelevant Alternatives
(IIA)
A major drawback with the Mlogit is the

relationship between probabilities. Recall the
formulae for the probabilities in the Mlogit;
exp( x'  m )
PM  M 1
1   exp( x'  j ) (34)
j 1
for all m=1,…,M-1. However, if we then form the

ratio of two probabilities Pj and Pk , we find that
Pj exp( x'  j )
 (35)
Pk exp( x'  k )
In other words, the ratio of probabilities of any

two outcomes is independent of the probability
of any other outcome. Adding an extra
outcome to the range of choices therefore
leaves this ratio of probabilities unchanged. This
unreasonable characteristics is known as the
independence of irrelevant alternatives, and
forms the major criticism of the Mlogit.
1
The independence assumption follows from the
initial assumption that the disturbances are
independent and homoscedastic.
What is IIA in simple terms?

The odds of i being chosen over j is
independent of the availability or attributes of
alternatives other than i and j.
Example:
The odds of PhD students choosing mango juice
during breakfast over orange juice is
independent of the availability or attributes of
other alternatives such as pineapple, apple
juice…etc.
Why this unreasonable drawback (i.e. IIA) exists

in the mlogit model in the first place?
This is mainly because the  ij ’s (i.e. i being the
individual and j the choice/alternative) are
assumed to be independent.
This assumption implies that (conditional on

observed characteristics, x’s) utility levels of any
two alternatives are independent. Therefore,
the probability ratios remain unchanged. This is
2
especially troublesome if the alternatives are
very similar.
Let us consider the typical ‘Blue’ and ‘Red’ bus

example. Decompose a category ‘travel by
bus’ into ‘travel by blue bus’ and ‘travel by red
bus’.
Logically, we would expect that a high utility for
a red bus implied a high utility for a blue bus.
Another way to look at the problem is to note
that the probability ratio of 2 alternatives does
not depend on the nature of any of the other
alternatives.
Suppose that alternative 1 denotes travel by car

and alternative 2 denotes travel by (blue) bus.
The probability ratio (or odds ratios) is given by;
Pr( y  2)
 e x 2
Pr( y  1)
irrespective of whether the 3rd alternative is a
red bus or a train. This is undesirable aspect of
the mlogit model
Testing for IIA

Hausman and McFadden (1984) suggest that if
a subset of the choice set truly is irrelevant,
3
omitting it from the model altogether will not
change parameter estimates systematically.
Exclusion of these choices will be inefficient but
will not lead to inconsistency.
But if the remaining odds ratios are not truly

independent of these alternatives, then the
parameter estimates obtained when these
choices are eliminated will be inconsistent. This
procedure is the usual basis for the Hausman’s
specification test.
The statistic is;

 2  ( ˆ s  ˆ f )'[Vˆs  Vˆ f ] 1 ( ˆ s  ˆ f ),
(36)
where s indicates the estimators based on the
restricted subset, f indicates the estimator based
on the full set of choices, and the Vˆ and Vˆ are s f
the respective estimates of the asymptotic

covariance matrices. The statistic has a limiting
chi-squared distribution with K degrees of
freedom1.
Note: In STATA, one can test whether a given

Mlogit model suffers from IIA or not.
1
McFadden (1987) shows how this hypothesis can also be tested using a Lagrange multiplier test.
4
Conditional logit (clogit) vs Mlogit (Wooldridge,
2002)
The clogit and mlogit models have similar

response probabilities, but they differ in some
important respects.
In the mlogit case, the conditioning variables or

covariates or regressors do not change across
alternatives: for each i, xi contains variables
specific to the individual but not to the
alternatives.
Context in which to use mlogit
The mlogit model is good for problems where

characteristics of the alternatives are
unimportant or are not of direct interest by the
investigator or where the data are simply not
available.
For example, in a model of occupational
choice, we do not usually know how much
someone could make in every occupation.
What we can usually collect data on are things
that affect individual productivity and tastes,
5
such as education and past experience. The
mlogit model allows these characteristics to
have different effects on the relative
probabilities between any two choices.
Context in which to use clogit
The clogit model is intended specifically for

problems where consumer or firm choices are
at least partly made based on observable
attributes of each alternative. The utility level of
each choice is assumed to be a linear function
in choice attributes, xij with common
parameter vector  .
We can actually contain the mlogit model as a
special case by appropriately choosing xij so
that they contain both the characteristics of the
decision maker (individual/firm/household) and
of the alternatives (i.e. the choice set).
Therefore, some authors refer to the clogit as
the mlogit model, with the understanding that
alternative-specific characteristics are allowed
in the response probability. However, the Clogit
model suffers from IIA like the Mlogit model
discussed earlier.
6
The Multinomial Probit (Mprobit) Model
Unlike the Mlogit model, the Mprobit model

does not suffer from IIA. However, it involves the
computation of multiple (complex) integrals
which makes convergence problematic. This
problem has been less severe in modern times
due to the spectacular advances in
econometric software packages. Like the
Mlogit, Mprobit is relevant when the LHS
variable is discrete and when we have more
than two categories that do not have a natural
ordering. It is an extension of the binary probit
model.
The stochastic error terms for this

implementation of the model are assumed to
have independent, standard normal
distributions. To successfully estimate the model
(e.g. using the mprobit command in STATA), you
must have a single observation for each
decision maker in the sample.
7
The mprobit model is frequently motivated using
a latent-variable framework. The latent variable
for the jth alternative, j=1,…,J is
ij  zi j   ij (37)
where 1xq row vector z i contains the observed
independent variables(regressors) for the ith
decision maker. Associated with z i are the J
vectors of regression coefficients  j . The
 i ,1, .....,  i , J are distributed independently and
identically standard normal. The decision maker
chooses the alternative k such that
ik  im , m  k .
Suppose the case i chooses alternative k, and

take the difference between latent variable  ik
and the J-1 others:
vijk   ij   ik ,
 zi ( j   k )   ij   ik
(38)
 zi  j '   ij '
where j’=j if j<k and j’=j-1 if j>k so that j’=1,….,J-1.
Notice that var( ij ' )  var(ij  ik )  2 and
cov( ij ' ,  il ' )  1, for, j '  l '.
8
The probability that alternative k is chosen is;
Pr(i, chooses, k )  Pr(vi1k  0,...., vi , J 1, k  0)

 Pr( i1   zi  1 ,...,  i , J 1   zi  J 1 ) (39)
Hence evaluating the likelihood function
involves computing probabilities from the
multivariate normal distribution. That all the
covariances are equal simplifies the problem
somewhat.
More formulas and technical details
Note that  i  ( i1 ,...,  i , J 1 ) ~ MVN (0, ). where

 2,1,1......... 1 
 
1,2,1......... 1 
1,1,2.......... 1 
 
  . 
. 
  (40)
. 
 
 1,1,1.......... .2 
9
Denote the deterministic part of the model as
ij '  zi  j ' ; that probability that individual i
chooses outcome k is;
Pr( yi  k )  Pr(vi1  0,...., vi , J 1  0)

 Pr( i1  i1 ,...,  i , J 1  i , J 1 )
 i 1  i , J 1
1 1

(2 ) ( J 1) / 2 
1/ 2 

... 

exp( z '  1 z )dz
2
(41)
Because of the exchangeable correlation

structure of
 (note that ij  1/ 2, i  j ), we can utilise
Dunnett’s (1989)2 result to reduce the
multidimensional integral to a single dimension:
1  J 1  z
 J 1
Pr( yi  k )            
2
( z 2 ) ( z 2 ij e
) dz
 0  j 1
ij
j 1 
(42)
Due to the complex convergence problem
associated with the estimation of mprobit
model due to the multiple integrals, a Gaussian
2
Dunnett, C. W. (1989) Algorithm AS 251: Multivariate normal probability integrals with product correlation
structure. Journal of the Royal Statistical Society, Series C 38: 564-579.
10
quadrature is used to approximate the above
single dimension integral.
This results in the following K-point quadrature
formula;
1 K  J 1 J 1 
Pr( yi  k )   wk  ( xk 2  ij )   ( xk 2  ij )
2 k 1  j 1 j 1 
(43)
where wk and x k are the weights and roots of
the Laguerre polynomial of order K.
Identification
In eq.(38), not all J of the  j are identifiable.
To remove the indeterminacy,  l is set to the
zero vector, where l is the base outcome. That
fixes the lth latent variable to zero so that the
remaining variables measure the attractiveness
of the other alternatives relative to the base.
The ordered probit/logit model

There are some economic applications where a
simple binary choice model is inappropriate or
over-simplistic. Consider, for example, an
economic model which seeks to examine the
11
choice not only of labour force participation,
but also of the choice of whether a participant
chooses to work part-time or full-time. Another
example can be the responses to a
questionnaire which illicit a series of satisfaction
rankings from taste surveys.
Each of the above examples involves more

than two possible outcomes. A model which
seeks to explain such outcomes must therefore
involve a dep. Var. with more than two possible
observed values.
The application of ordered probit/logit models

to this problem exploits the fact that the
dependent variable outcomes have a natural
(ordinal) ranking. In other words, the responses
can be ordered in some meaningful fashion.
The major advantage of these models is that,
by exploiting such a feature, the resulting model
is relatively easy to estimate. The down-side is
that the behavioural model may be considered
too restrictive.
To explain the structure of the model, consider
again a sample of data yi , xi  of size n drawn
independently from some population, where
12
now the dep. Var. y i has M possible outcomes
y i  1,..., M with a natural ordering (that is, m+1 is
in some sense ‘better’ than m). The observed
values are assumed to derive some
*
unobservable latent variable i where, as with
y
the binary choice models;
yi*  x'i   ui , for, i  1,..., n
for some kx1 parameter vector  and
(univariate) stochastic disturbance term u i . The
M outcomes for the observed variable y i are
assumed to be related to the latent variable
through the following observability criterion;
yi  m, if ,  m1  yi*   m , for, m  1,..., M ,
for a set of parameters 0 to M ,

 0   1   2  ...   M ,  0  
and  M   .
Then, the conditional probability of observing
the mth category (i.e. yi  m ) can be written
as:
13
Pr( y i  m | xi )  Pr( m1  y i*   m )
 Pr( m 1  x'i   u i   m
Rearranging to isolate u i , we have
Pr( yi  m | xi )  Pr( m1  x'i   ui   m  x'i  )

 Pr(ui   m  x'i  )  Pr(ui   m1  x'i  )
for m=1,…,M. We can now see the importance
of the relationship between the  ’s in order for
this probability to be strictly positive. To evaluate
the conditional probability, we are required to
make some distributional assumption for the
disturbance term u i .
The two most popular options are to assume a
standard normal distribution for u i (yielding the
ordered probit model) or a logistic distribution
(yielding the ordered logit model).
Taking the ordered probit first, by assuming that

u i ~ N (0,1) , the conditional probabilities can be
derived as;
Pr( yi  m | xi )  ( m  x'i  )  ( m1  x'i  )
14
One can evaluate this probability for any
combination of parameters { ,  } .
In order to isolate those parameters which best

suit a sample of data, we again exploit ML
techniques. Recall that in essence we are
looking for an expression for the observed state
probabilities, aggregated over a sample of
data.
To derive the likelihood function for the ordered

probit model, define M selection variables
z im  1( yi  m) , for m=1,…,M. Then, the likelihood
contribution for the ith observation in the
sample can be written as;
M
li   Pr( y i  m | xi ) zim
m 1
M
  [ ( m  x'i  )   ( m 1  x'i  )]zim
m 1
and the likelihood becomes
15
n M
li    m i
[ (  x '  )   ( m 1  x ' i  ) zim
i 1 m 1
Finally, taking logarithms, we come to the log-
likelihood function which can be used to
estimate the ML coefficients:
n M
li   zim ln[( m  x'i  )  ( m 1  x'i  )].
i 1 m 1
Any gradient-based ML solution routine will be
~, ~}

able to calculate those parameters {
which maximise the sample likelihood satisfying
the following FOCs;
 ln L  ln L
 0 and 0
  .
Nested logit model [(Wooldridge, 2002; Greene

(2003; pp. 725-727); Maddala (1983, pp. 67-70);
Cameron and Trivedi(2006;pp.508-512);
Handbook of Econometrics]
A different approach to relaxing IIA is to specify

a hierarchical model. The most popular of
theses is the nested logit (nlogit) model3. This
3
See McFadden (1984)Econometric analysis of qualitative response models, in Handbook of Econometrics, vol. 2,
ed. Z. Griliches and M. D. Intriligator. Amsterdam: North Holland, 1395-1457.
16
model is based on generalised extreme value
(GEV) distribution and it generalises the mlogit
model to a nested multinomial logit model.
Suppose that i=first level decision (limb) and

j=bottom level decision (branch). The decision
tree structure is as follows;
Root
Limb 1…..Limb i……limb I
Branch1….branch J1 …branch j…branch
1….branch J I
The utility for the alternative in the ith of I limbs

and jth branches is;
Uij  Vij   ij , j  1,2,..., J i ; i  1,2,..., I

The nlogit model of McFadden (1978)4 arises
when the error terms  ij have the GEV joint
cumulative distribution function
11 1 J1  I 1  IJ I
F ( )  exp[ G(e ,..., e ;...; e ,..., e )]
for the following particular specification of the
function G(.);
4
McFadden(1978) Modelling the choice of residential location, in Spatial Interaction Theory and Planning Models,
75-96, A. Karlquist L. Lundquist, F. Sinckars and W. Weibull et al (Eds.), Amsterdam, New York, North-Holland.
17
G (Y )  G(Y11 ,..., Y1J1 ,..., YI 1 ,..., YIJ I )
I Ji
  ( Yij1/ i ) i
i 1 j 1
The parameter i is a function of the

correlation between  ij , and,  jk but do exactly
equal the correlation parameter.
In fact
i  1  Cor[ ij ,  jk ]
so  i is inversely related to the correlation and

we expect 0  i  1 . i  1
The choice
corresponds to independence of  ij , and,  jk
and leads to the mlogit model. We call the
parameters  i the scale parameters.
The outcome indicator variables yij equal 1 if
alternative ij is chosen and 0 otherwise.
pij  Pr[ yij  1]  Pr[Uij  U kl , k , l ]
18
Closed form solutions for this probability can be
derived as a function of Vij , and, i (see Cameron
and Trivedi, 2006;p.526).
These are evaluated for the particular

deterministic utility function;
Vij  zi'  xij' i , j  1,..., J i , i  1,..., I ,

where zi varies over limbs only and xij varies over
both limbs and branches.  , and,  i are the
regression parameters.
In a nested logit model, alternatives are

grouped into subgroups such that the IIA
assumption is valid merely within each group. In
other words, the errors in a random utility model
are permitted to be correlated for each option
within the groups, but they are uncorrelated
across groups.
Terminologies (STATA manual)
Level or decision level
19
The level or state at which a decision is made.
First-level decisions are made first, followed by
second-level decisions, and so on.
Bottom level
The level where the final decision is made.
Alternative set
The set of all possible alternatives at any given

decision level.
Bottom alternative set
The set of all possible alternatives at the bottom

level. This is often referred to as the choice set in
the economics-choice literature.
Alternative
A specific alternative within an alternative set.

Not all alternatives (within an alternative set) are
available to someone making a choice at a
specific stage, only those that are nested within
all prior decisions.
20
Chosen alternative
The alternative picked by the decision-maker

from an alternative set.
The model
First, we illustrate the basic approach of the

model where there are only two hierarchies.
Then we extend our exposition to a case of
three-level nlogit model. McFadden (1977,
1981)5 showed how the nlogit can be derived
from a rational choice framework6.
Amemiya (1985) shows how the model can be

derived under the assumption of utility
maximisation.
Two-level nlogit model

We index the first level alternative as i and the
bottom-level alternative as j.
5
McFadden, D. (1977) Quantitative methods for analysing behaviour of individuals: some recent developments.
Cowles Foundation Discussion paper no. 474. ______(1981) Econometric Models of probabilistic choice. In
Structural Analysis of Discrete Data with Econometric Applications, pp. 198-272. Cambridge, MA: MIT Press.
6
Amemiya (1985) Advanced Econometrics, Cambridge, MA: Harvard University Press.
21
Let xij and yi refer to the vectors of
explanatory variables specific to categories (
i, j ) and ( i ), respectively.
We write the joint probability of being on limb I

and branch j as
Prij  Prj|i Pri

which is the product of the probability of
choosing branch j conditional on being on limb
I and the probability of choosing limb i.
The conditional probability Pr j|i will involve only
the parameters  :
exp( xij  )
Pr j |i 
 k
exp( xik  )
We define the inclusive values7 or the log-sum

for category ( i ) as
7
The nlogit command in STATA 9 allows you to apply linear constraints of the inclusive-value parameters. You can
constrain inclusive-value parameters to, say, be equal to each other; or specify fixed values rather than allowing
these parameters to be freely estimated. To estimate the model, you need to specify and display the tree structure of
the nested logit model using nlogitgen and nlogittree. The former helps you to generate a categorical variable that
identifies the first-level set of alternatives.
22
 
I i  ln exp( xik  )
k 
then
exp( yi   i I i )
Pri 
m exp( yi   m I m )
Three-level nlogit model
Following Greene (2003), we index the first-level
alternatives as i , the second level alternative as
j and the bottom-level alternative as k.
Let xijk , yij , and, zi refer to row vectors of
explanatory variables specific to categories (
i, j , k ), (i, j ), and (i ), respectively.
Prijk  Prk |ij Prj|i Pri
Like the two-level model, the conditional

probability Prk|ij will involve only the parameters
:
23
exp( xijk  )
Prk |ij 
 exp( x
n ijn )
We, then, define the inclusive values for

category (i,j) as
 
I ij  ln exp( xijn  )
n 
and
exp( yij   ij I ij )
Pr j |i 
 m
exp( yim   im I im )
We also define inclusive values for category

(i)as
 
J i  ln exp( yim   im I im )
m 
then
exp( zi    i J i )
Pri 
l exp( zl    l J l )
24
If we restrict all the  ij , and,  i to be 1, we
recover the clogit model of the following form:
exp(Vijk )
Prijk 
  l m n
exp(Vijk )
where
Vijk  xijk   yij  zi
Estimation and the log likelihood of the Nlogit

model
There are two ways of estimating the nlogit

model: sequential estimation and FIML (full-
information maximum likelihood)8.
g
If g=1,2,…, G denotes the groups, and Prijk is
the probability of category (i, j, k) being a
8
Note that STATA 9 uses FIML to fit the model.
25
positive outcome in group g, the log likelihood
of the nlogit model is;
ln L   ln(Prijkg )
g
  (ln Prkg|ij  ln Pr jg|i  ln Prig )

g
The sequential estimator is less efficient than the

FIML estimator. At the second stage the usual
clogit standard errors understate the true
standard errors of the sequential estimator since
they do not allow for the estimation error in
computing the inclusive value.
McFadden (1981) gives the formula for correct

standard errors, or the bootstrap can be used.
Or the Delta method.
26

The Independence of Irrelevant Alternatives - 230919 - 191757

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Independence of Irrelevant Alternatives - 230919 - 191757

Uploaded by

Copyright:

Available Formats

A major drawback with the Mlogit is the

for all m=1,…,M-1. However, if we then form the

In other words, the ratio of probabilities of any

What is IIA in simple terms?

Why this unreasonable drawback (i.e. IIA) exists

This assumption implies that (conditional on

Let us consider the typical ‘Blue’ and ‘Red’ bus

Suppose that alternative 1 denotes travel by car

Testing for IIA

But if the remaining odds ratios are not truly

The statistic is;

the respective estimates of the asymptotic

Note: In STATA, one can test whether a given

The clogit and mlogit models have similar

In the mlogit case, the conditioning variables or

Context in which to use mlogit

The mlogit model is good for problems where

Context in which to use clogit

The clogit model is intended specifically for

Unlike the Mlogit model, the Mprobit model

The stochastic error terms for this

Suppose the case i chooses alternative k, and

Pr(i, chooses, k )  Pr(vi1k  0,...., vi , J 1, k  0)

More formulas and technical details

Note that  i  ( i1 ,...,  i , J 1 ) ~ MVN (0, ). where

Pr( yi  k )  Pr(vi1  0,...., vi , J 1  0)

Because of the exchangeable correlation

The ordered probit/logit model

Each of the above examples involves more

The application of ordered probit/logit models

yi  m, if ,  m1  yi*   m , for, m  1,..., M ,

for a set of parameters 0 to M ,

Rearranging to isolate u i , we have

Pr( yi  m | xi )  Pr( m1  x'i   ui   m  x'i  )

Taking the ordered probit first, by assuming that

Pr( yi  m | xi )  ( m  x'i  )  ( m1  x'i  )

In order to isolate those parameters which best

To derive the likelihood function for the ordered

and the likelihood becomes

Nested logit model [(Wooldridge, 2002; Greene

A different approach to relaxing IIA is to specify

Suppose that i=first level decision (limb) and

The utility for the alternative in the ith of I limbs

Uij  Vij   ij , j  1,2,..., J i ; i  1,2,..., I

The parameter i is a function of the

so  i is inversely related to the correlation and

These are evaluated for the particular

Vij  zi'  xij' i , j  1,..., J i , i  1,..., I ,

In a nested logit model, alternatives are

Terminologies (STATA manual)

Level or decision level

The level where the final decision is made.

The set of all possible alternatives at any given

Bottom alternative set

The set of all possible alternatives at the bottom

A specific alternative within an alternative set.

The alternative picked by the decision-maker

First, we illustrate the basic approach of the

Amemiya (1985) shows how the model can be

Two-level nlogit model

We write the joint probability of being on limb I

Prij  Prj|i Pri