You are on page 1of 26

The Independence of Irrelevant Alternatives

(IIA)

A major drawback with the Mlogit is the


relationship between probabilities. Recall the
formulae for the probabilities in the Mlogit;
exp( x'  m )
PM  M 1
1   exp( x'  j ) (34)
j 1

for all m=1,…,M-1. However, if we then form the


ratio of two probabilities Pj and Pk , we find that
Pj exp( x'  j )
 (35)
Pk exp( x'  k )

In other words, the ratio of probabilities of any


two outcomes is independent of the probability
of any other outcome. Adding an extra
outcome to the range of choices therefore
leaves this ratio of probabilities unchanged. This
unreasonable characteristics is known as the
independence of irrelevant alternatives, and
forms the major criticism of the Mlogit.

1
The independence assumption follows from the
initial assumption that the disturbances are
independent and homoscedastic.

What is IIA in simple terms?


The odds of i being chosen over j is
independent of the availability or attributes of
alternatives other than i and j.

Example:
The odds of PhD students choosing mango juice
during breakfast over orange juice is
independent of the availability or attributes of
other alternatives such as pineapple, apple
juice…etc.

Why this unreasonable drawback (i.e. IIA) exists


in the mlogit model in the first place?
This is mainly because the  ij ’s (i.e. i being the
individual and j the choice/alternative) are
assumed to be independent.

This assumption implies that (conditional on


observed characteristics, x’s) utility levels of any
two alternatives are independent. Therefore,
the probability ratios remain unchanged. This is
2
especially troublesome if the alternatives are
very similar.

Let us consider the typical ‘Blue’ and ‘Red’ bus


example. Decompose a category ‘travel by
bus’ into ‘travel by blue bus’ and ‘travel by red
bus’.
Logically, we would expect that a high utility for
a red bus implied a high utility for a blue bus.
Another way to look at the problem is to note
that the probability ratio of 2 alternatives does
not depend on the nature of any of the other
alternatives.

Suppose that alternative 1 denotes travel by car


and alternative 2 denotes travel by (blue) bus.
The probability ratio (or odds ratios) is given by;
Pr( y  2)
 e x 2
Pr( y  1)
irrespective of whether the 3rd alternative is a
red bus or a train. This is undesirable aspect of
the mlogit model

Testing for IIA


Hausman and McFadden (1984) suggest that if
a subset of the choice set truly is irrelevant,
3
omitting it from the model altogether will not
change parameter estimates systematically.
Exclusion of these choices will be inefficient but
will not lead to inconsistency.

But if the remaining odds ratios are not truly


independent of these alternatives, then the
parameter estimates obtained when these
choices are eliminated will be inconsistent. This
procedure is the usual basis for the Hausman’s
specification test.

The statistic is;


 2  ( ˆ s  ˆ f )'[Vˆs  Vˆ f ] 1 ( ˆ s  ˆ f ),
(36)
where s indicates the estimators based on the
restricted subset, f indicates the estimator based
on the full set of choices, and the Vˆ and Vˆ are s f

the respective estimates of the asymptotic


covariance matrices. The statistic has a limiting
chi-squared distribution with K degrees of
freedom1.

Note: In STATA, one can test whether a given


Mlogit model suffers from IIA or not.

1
McFadden (1987) shows how this hypothesis can also be tested using a Lagrange multiplier test.

4
Conditional logit (clogit) vs Mlogit (Wooldridge,
2002)

The clogit and mlogit models have similar


response probabilities, but they differ in some
important respects.

In the mlogit case, the conditioning variables or


covariates or regressors do not change across
alternatives: for each i, xi contains variables
specific to the individual but not to the
alternatives.

Context in which to use mlogit

The mlogit model is good for problems where


characteristics of the alternatives are
unimportant or are not of direct interest by the
investigator or where the data are simply not
available.
For example, in a model of occupational
choice, we do not usually know how much
someone could make in every occupation.
What we can usually collect data on are things
that affect individual productivity and tastes,
5
such as education and past experience. The
mlogit model allows these characteristics to
have different effects on the relative
probabilities between any two choices.

Context in which to use clogit

The clogit model is intended specifically for


problems where consumer or firm choices are
at least partly made based on observable
attributes of each alternative. The utility level of
each choice is assumed to be a linear function
in choice attributes, xij with common
parameter vector  .
We can actually contain the mlogit model as a
special case by appropriately choosing xij so
that they contain both the characteristics of the
decision maker (individual/firm/household) and
of the alternatives (i.e. the choice set).
Therefore, some authors refer to the clogit as
the mlogit model, with the understanding that
alternative-specific characteristics are allowed
in the response probability. However, the Clogit
model suffers from IIA like the Mlogit model
discussed earlier.

6
The Multinomial Probit (Mprobit) Model

Unlike the Mlogit model, the Mprobit model


does not suffer from IIA. However, it involves the
computation of multiple (complex) integrals
which makes convergence problematic. This
problem has been less severe in modern times
due to the spectacular advances in
econometric software packages. Like the
Mlogit, Mprobit is relevant when the LHS
variable is discrete and when we have more
than two categories that do not have a natural
ordering. It is an extension of the binary probit
model.

The stochastic error terms for this


implementation of the model are assumed to
have independent, standard normal
distributions. To successfully estimate the model
(e.g. using the mprobit command in STATA), you
must have a single observation for each
decision maker in the sample.

7
The mprobit model is frequently motivated using
a latent-variable framework. The latent variable
for the jth alternative, j=1,…,J is
ij  zi j   ij (37)
where 1xq row vector z i contains the observed
independent variables(regressors) for the ith
decision maker. Associated with z i are the J
vectors of regression coefficients  j . The
 i ,1, .....,  i , J are distributed independently and
identically standard normal. The decision maker
chooses the alternative k such that
ik  im , m  k .

Suppose the case i chooses alternative k, and


take the difference between latent variable  ik
and the J-1 others:
vijk   ij   ik ,
 zi ( j   k )   ij   ik
(38)
 zi  j '   ij '
where j’=j if j<k and j’=j-1 if j>k so that j’=1,….,J-1.
Notice that var( ij ' )  var(ij  ik )  2 and
cov( ij ' ,  il ' )  1, for, j '  l '.

8
The probability that alternative k is chosen is;

Pr(i, chooses, k )  Pr(vi1k  0,...., vi , J 1, k  0)


 Pr( i1   zi  1 ,...,  i , J 1   zi  J 1 ) (39)
Hence evaluating the likelihood function
involves computing probabilities from the
multivariate normal distribution. That all the
covariances are equal simplifies the problem
somewhat.

More formulas and technical details

Note that  i  ( i1 ,...,  i , J 1 ) ~ MVN (0, ). where


 2,1,1......... 1 
 
1,2,1......... 1 
1,1,2.......... 1 
 
  . 
. 
  (40)
. 
 
 1,1,1.......... .2 

9
Denote the deterministic part of the model as
ij '  zi  j ' ; that probability that individual i
chooses outcome k is;

Pr( yi  k )  Pr(vi1  0,...., vi , J 1  0)


 Pr( i1  i1 ,...,  i , J 1  i , J 1 )
 i 1  i , J 1
1 1

(2 ) ( J 1) / 2 
1/ 2 

... 

exp( z '  1 z )dz
2
(41)

Because of the exchangeable correlation


structure of
 (note that ij  1/ 2, i  j ), we can utilise
Dunnett’s (1989)2 result to reduce the
multidimensional integral to a single dimension:
1  J 1  z
 J 1
Pr( yi  k )            
2
( z 2 ) ( z 2 ij e
) dz
 0  j 1
ij
j 1 
(42)
Due to the complex convergence problem
associated with the estimation of mprobit
model due to the multiple integrals, a Gaussian

2
Dunnett, C. W. (1989) Algorithm AS 251: Multivariate normal probability integrals with product correlation
structure. Journal of the Royal Statistical Society, Series C 38: 564-579.

10
quadrature is used to approximate the above
single dimension integral.
This results in the following K-point quadrature
formula;

1 K  J 1 J 1 
Pr( yi  k )   wk  ( xk 2  ij )   ( xk 2  ij )
2 k 1  j 1 j 1 
(43)
where wk and x k are the weights and roots of
the Laguerre polynomial of order K.

Identification
In eq.(38), not all J of the  j are identifiable.
To remove the indeterminacy,  l is set to the
zero vector, where l is the base outcome. That
fixes the lth latent variable to zero so that the
remaining variables measure the attractiveness
of the other alternatives relative to the base.

The ordered probit/logit model


There are some economic applications where a
simple binary choice model is inappropriate or
over-simplistic. Consider, for example, an
economic model which seeks to examine the

11
choice not only of labour force participation,
but also of the choice of whether a participant
chooses to work part-time or full-time. Another
example can be the responses to a
questionnaire which illicit a series of satisfaction
rankings from taste surveys.

Each of the above examples involves more


than two possible outcomes. A model which
seeks to explain such outcomes must therefore
involve a dep. Var. with more than two possible
observed values.

The application of ordered probit/logit models


to this problem exploits the fact that the
dependent variable outcomes have a natural
(ordinal) ranking. In other words, the responses
can be ordered in some meaningful fashion.
The major advantage of these models is that,
by exploiting such a feature, the resulting model
is relatively easy to estimate. The down-side is
that the behavioural model may be considered
too restrictive.
To explain the structure of the model, consider
again a sample of data yi , xi  of size n drawn
independently from some population, where
12
now the dep. Var. y i has M possible outcomes
y i  1,..., M with a natural ordering (that is, m+1 is
in some sense ‘better’ than m). The observed
values are assumed to derive some
*
unobservable latent variable i where, as with
y
the binary choice models;
yi*  x'i   ui , for, i  1,..., n
for some kx1 parameter vector  and
(univariate) stochastic disturbance term u i . The
M outcomes for the observed variable y i are
assumed to be related to the latent variable
through the following observability criterion;

yi  m, if ,  m1  yi*   m , for, m  1,..., M ,

for a set of parameters 0 to M ,


 0   1   2  ...   M ,  0  
and  M   .
Then, the conditional probability of observing
the mth category (i.e. yi  m ) can be written
as:

13
Pr( y i  m | xi )  Pr( m1  y i*   m )
 Pr( m 1  x'i   u i   m

Rearranging to isolate u i , we have

Pr( yi  m | xi )  Pr( m1  x'i   ui   m  x'i  )


 Pr(ui   m  x'i  )  Pr(ui   m1  x'i  )
for m=1,…,M. We can now see the importance
of the relationship between the  ’s in order for
this probability to be strictly positive. To evaluate
the conditional probability, we are required to
make some distributional assumption for the
disturbance term u i .
The two most popular options are to assume a
standard normal distribution for u i (yielding the
ordered probit model) or a logistic distribution
(yielding the ordered logit model).

Taking the ordered probit first, by assuming that


u i ~ N (0,1) , the conditional probabilities can be
derived as;

Pr( yi  m | xi )  ( m  x'i  )  ( m1  x'i  )

14
One can evaluate this probability for any
combination of parameters { ,  } .

In order to isolate those parameters which best


suit a sample of data, we again exploit ML
techniques. Recall that in essence we are
looking for an expression for the observed state
probabilities, aggregated over a sample of
data.

To derive the likelihood function for the ordered


probit model, define M selection variables
z im  1( yi  m) , for m=1,…,M. Then, the likelihood
contribution for the ith observation in the
sample can be written as;
M
li   Pr( y i  m | xi ) zim
m 1
M
  [ ( m  x'i  )   ( m 1  x'i  )]zim
m 1

and the likelihood becomes

15
n M
li    m i
[ (  x '  )   ( m 1  x ' i  ) zim

i 1 m 1
Finally, taking logarithms, we come to the log-
likelihood function which can be used to
estimate the ML coefficients:

n M
li   zim ln[( m  x'i  )  ( m 1  x'i  )].
i 1 m 1
Any gradient-based ML solution routine will be
~, ~}

able to calculate those parameters {
which maximise the sample likelihood satisfying
the following FOCs;
 ln L  ln L
 0 and 0
  .

Nested logit model [(Wooldridge, 2002; Greene


(2003; pp. 725-727); Maddala (1983, pp. 67-70);
Cameron and Trivedi(2006;pp.508-512);
Handbook of Econometrics]

A different approach to relaxing IIA is to specify


a hierarchical model. The most popular of
theses is the nested logit (nlogit) model3. This
3
See McFadden (1984)Econometric analysis of qualitative response models, in Handbook of Econometrics, vol. 2,
ed. Z. Griliches and M. D. Intriligator. Amsterdam: North Holland, 1395-1457.

16
model is based on generalised extreme value
(GEV) distribution and it generalises the mlogit
model to a nested multinomial logit model.

Suppose that i=first level decision (limb) and


j=bottom level decision (branch). The decision
tree structure is as follows;
Root
Limb 1…..Limb i……limb I
Branch1….branch J1 …branch j…branch
1….branch J I

The utility for the alternative in the ith of I limbs


and jth branches is;

Uij  Vij   ij , j  1,2,..., J i ; i  1,2,..., I


The nlogit model of McFadden (1978)4 arises
when the error terms  ij have the GEV joint
cumulative distribution function

11 1 J1  I 1  IJ I
F ( )  exp[ G(e ,..., e ;...; e ,..., e )]
for the following particular specification of the
function G(.);

4
McFadden(1978) Modelling the choice of residential location, in Spatial Interaction Theory and Planning Models,
75-96, A. Karlquist L. Lundquist, F. Sinckars and W. Weibull et al (Eds.), Amsterdam, New York, North-Holland.

17
G (Y )  G(Y11 ,..., Y1J1 ,..., YI 1 ,..., YIJ I )
I Ji
  ( Yij1/ i ) i
i 1 j 1

The parameter i is a function of the


correlation between  ij , and,  jk but do exactly
equal the correlation parameter.

In fact
i  1  Cor[ ij ,  jk ]

so  i is inversely related to the correlation and


we expect 0  i  1 . i  1
The choice
corresponds to independence of  ij , and,  jk
and leads to the mlogit model. We call the
parameters  i the scale parameters.
The outcome indicator variables yij equal 1 if
alternative ij is chosen and 0 otherwise.
pij  Pr[ yij  1]  Pr[Uij  U kl , k , l ]

18
Closed form solutions for this probability can be
derived as a function of Vij , and, i (see Cameron
and Trivedi, 2006;p.526).

These are evaluated for the particular


deterministic utility function;

Vij  zi'  xij' i , j  1,..., J i , i  1,..., I ,


where zi varies over limbs only and xij varies over
both limbs and branches.  , and,  i are the
regression parameters.

In a nested logit model, alternatives are


grouped into subgroups such that the IIA
assumption is valid merely within each group. In
other words, the errors in a random utility model
are permitted to be correlated for each option
within the groups, but they are uncorrelated
across groups.

Terminologies (STATA manual)

Level or decision level

19
The level or state at which a decision is made.
First-level decisions are made first, followed by
second-level decisions, and so on.

Bottom level

The level where the final decision is made.

Alternative set

The set of all possible alternatives at any given


decision level.

Bottom alternative set

The set of all possible alternatives at the bottom


level. This is often referred to as the choice set in
the economics-choice literature.

Alternative

A specific alternative within an alternative set.


Not all alternatives (within an alternative set) are
available to someone making a choice at a
specific stage, only those that are nested within
all prior decisions.
20
Chosen alternative

The alternative picked by the decision-maker


from an alternative set.

The model

First, we illustrate the basic approach of the


model where there are only two hierarchies.
Then we extend our exposition to a case of
three-level nlogit model. McFadden (1977,
1981)5 showed how the nlogit can be derived
from a rational choice framework6.

Amemiya (1985) shows how the model can be


derived under the assumption of utility
maximisation.

Two-level nlogit model


We index the first level alternative as i and the
bottom-level alternative as j.

5
McFadden, D. (1977) Quantitative methods for analysing behaviour of individuals: some recent developments.
Cowles Foundation Discussion paper no. 474. ______(1981) Econometric Models of probabilistic choice. In
Structural Analysis of Discrete Data with Econometric Applications, pp. 198-272. Cambridge, MA: MIT Press.
6
Amemiya (1985) Advanced Econometrics, Cambridge, MA: Harvard University Press.

21
Let xij and yi refer to the vectors of
explanatory variables specific to categories (
i, j ) and ( i ), respectively.

We write the joint probability of being on limb I


and branch j as

Prij  Prj|i Pri


which is the product of the probability of
choosing branch j conditional on being on limb
I and the probability of choosing limb i.
The conditional probability Pr j|i will involve only
the parameters  :

exp( xij  )
Pr j |i 
 k
exp( xik  )

We define the inclusive values7 or the log-sum


for category ( i ) as

7
The nlogit command in STATA 9 allows you to apply linear constraints of the inclusive-value parameters. You can
constrain inclusive-value parameters to, say, be equal to each other; or specify fixed values rather than allowing
these parameters to be freely estimated. To estimate the model, you need to specify and display the tree structure of
the nested logit model using nlogitgen and nlogittree. The former helps you to generate a categorical variable that
identifies the first-level set of alternatives.

22
 
I i  ln exp( xik  )
k 
then

exp( yi   i I i )
Pri 
m exp( yi   m I m )
Three-level nlogit model
Following Greene (2003), we index the first-level
alternatives as i , the second level alternative as
j and the bottom-level alternative as k.
Let xijk , yij , and, zi refer to row vectors of
explanatory variables specific to categories (
i, j , k ), (i, j ), and (i ), respectively.

Prijk  Prk |ij Prj|i Pri

Like the two-level model, the conditional


probability Prk|ij will involve only the parameters
:

23
exp( xijk  )
Prk |ij 
 exp( x
n ijn )

We, then, define the inclusive values for


category (i,j) as
 
I ij  ln exp( xijn  )
n 
and

exp( yij   ij I ij )
Pr j |i 
 m
exp( yim   im I im )

We also define inclusive values for category


(i)as
 
J i  ln exp( yim   im I im )
m 

then

exp( zi    i J i )
Pri 
l exp( zl    l J l )
24
If we restrict all the  ij , and,  i to be 1, we
recover the clogit model of the following form:

exp(Vijk )
Prijk 
  l m n
exp(Vijk )

where

Vijk  xijk   yij  zi

Estimation and the log likelihood of the Nlogit


model

There are two ways of estimating the nlogit


model: sequential estimation and FIML (full-
information maximum likelihood)8.
g
If g=1,2,…, G denotes the groups, and Prijk is

the probability of category (i, j, k) being a

8
Note that STATA 9 uses FIML to fit the model.

25
positive outcome in group g, the log likelihood
of the nlogit model is;
ln L   ln(Prijkg )
g

  (ln Prkg|ij  ln Pr jg|i  ln Prig )


g

The sequential estimator is less efficient than the


FIML estimator. At the second stage the usual
clogit standard errors understate the true
standard errors of the sequential estimator since
they do not allow for the estimation error in
computing the inclusive value.

McFadden (1981) gives the formula for correct


standard errors, or the bootstrap can be used.
Or the Delta method.

26

You might also like