You are on page 1of 9

Econometrics 2 Fall 2005

Likelihood Analysis of
Binary Choice Models
Heino Bohn Nielsen

1 of 18

Introduction
In some situations, the object of interest is a binary variable:
Do married women have a paid job or not?
Which families own their own car?
Which students will pass the exam?

The binary variable is usually coded as

1 if the woman works


yi =
0 if she doesnt

We want to explain yi (i = 1, 2, ..., N ) given K characteristics in xi.

Often applications in micro-econometrics:


Explain the choices of individuals and firms.

2 of 18

Outline of the Lecture


(1) Interpretation of a linear regression model, yi

= x0i + i, where yi is binary.

(2) Alternative approaches: Binary choice models.


(3) Motivation: A latent variable model.
(4) Likelihood analysis.
(5) Measuring the goodness-of-fit.
(6) Empirical example: Teaching economics.

3 of 18

The Simple Linear Probability Model


Consider a linear regression
yi = x0i + i,

i = 1, 2, ..., N,

where yi is binary, and xi is a K 1 vector of binary or quantitative characteristics.

We assume that E[ i | xi] = 0 such that E[yi | xi] = x0i .


Since yi is binary, the conditional expectation can be found as
x0i = E [yi | xi] = 1 Prob(yi = 1 | xi) + 0 Prob(yi = 0 | xi)
= Prob(yi = 1 | xi).

Therefore x0i is interpretable as the probability that yi = 1.

The derivative, , is interpretable as the change in the probability


Prob(yi = 1 | xi) = xi,
and not the change in yi.
4 of 18

Complications...
(1) Since

x0i is a probability, it should be bounded:


0 x0i 1.

This requires that xi is bounded or that is restricted in some way.


(2) The error term is highly non-normal and heteroskedastic.

The error term,

i,

can take two values (conditional on xi)

yi
1
0

1 x0i
x0i

The mean is given by

With Prob.
x0i
1 x0i

E [ i | xi] = (1 x0i) x0i + (x0i) (1 x0i) = 0.


he variance is
2

V [ i | xi] = (1 x0i) x0i + (x0i)2 (1 x0i) = x0i (1 x0i) ,


which depends on xi and .

5 of 18

Binary Choice Models


As an alternative, consider a class of models
Prob(yi = 1 | xi) = F (x0i)
for some link function F () that always lies between 0 and 1.

yi
Prob(yi=1|xi)

yi=xib

xi
6 of 18

Choice of Link Functions


All distribution functions are valid as link functions. Common choices are
(1) The normal distribution leading to the socalled probit model

2
t
1
exp
F (w) =
dt.
2
2

(2) The logistic distribution leading to the logit model


Z

ew
.
1 + ew

F (w) =
Density functions (standardized)
0.4

1.0

Normal
Logistic

Distribution functions (standardized)


Normal
Logistic

0.5
0.2

0.0
-4

-3

-2

-1

-4

-3

-2

-1

4
7 of 18

Interpretation of Coecients
The signs of the coecients are directly interpretable, but not the magnitudes.
To interpret the model we can look at the marginal eects or the slopes:
Prob(yi = 1 | xi) F (x0i)
=
= F 0 (x0i) k .
xik
xik
They depend on the values of xi. Often evaluated in
(a) the sample mean, or
(b) a standard individual defined by characteristics x.

The marginal eects depend on F 0 (x0i).


To have the same marginal eects in a logit and probit model, the coecients have
to be dierent. Note that
1
0
Fprobit
(0) =
0.4
2
exp(0)
0
(0) =
= 0.25
Flogit
(1 + exp(0))2
so there is a scale dierence of 0.4/0.25 = 1.6.
8 of 18

The derivatives are only approximations if xk is a dummy variable.


An alternative is to calculate the discrete eect of a specific change:
Prob(yi = 1 | x) Prob(yi = 1 | x + ) = F (x0) F ((x + )0 ).

Here x could be the mean or a standard person x.

For a specific variable x1 this could be


F ( 0 + x1 1 + x2 2 + ...xk k ) F ( 0 + (x1 + 1) 1 + x2 2 + ...xk k ).
E.g. a dummy taking values x1 = 0 versus x1 = 1.

9 of 18

Underlying Latent Model


A particular way to motivate the binary choice model is the so-called latent variable
representation.
Consider a woman, i, and consider her decision to work or not to work.
Assume that the utility gain from starting to work, yi, is a function of observed
characteristics, xi, and unobserved characteristics, i, i.e.
yi = x0i + i.
Note that yi is unobservable. It is called a latent variable.
Assume that the distribution of i | xi is symmetric, i.e. F ( ) = 1 F ( ).

We can always normalize so that she goes to work if yi > 0.

10 of 18

We only observe the decision: yi =

1 if she goes to work


0 otherwise

We can calculate the probability:


Prob(yi = 1 | xi) = Prob(yi > 0 | xi)

=
=
=
=
=

Prob(x0i +

> 0 | xi)
Prob( i > x0i | xi)
1 Prob( i x0i | xi)
Prob( i x0i | xi)
F (x0i) .
i

The binary choice model can therefore be written in the latent variable representation.
For the probit model that is

if yi > 0
1

0
yi = xi + i,
yi =
i N(0, 1),
if yi 0
0
For the logit model N(0, 1) is replaced by a standard logistic distribution.

11 of 18

Likelihood Analysis
We have a fully specified model and a natural estimator is maximum likelihood.
The density for the binary variable is the probability
f (yi | xi; ) = Prob(yi = 1 | xi; )yi Prob(yi = 0 | xi; )1yi .
Assuming that the individuals are independent, we have a likelihood function:

L() = f (y1, ..., yN | )


N
Y
=
Prob(yi = 1 | xi; )yi Prob(yi = 0 | xi; )1yi
i=1
N
Y
i=1

1yi

F (x0i)yi (1 F (x0i))

The log-likelihood function is given by

log L() =

N
X
i=1

[yi log (F (x0i)) + (1 yi) log (1 F (x0i))] .

The f.o.c. is s() = 0, but the likelihood equations have no analytical solution.

12 of 18

Measuring Goodness-of-fit
The idea of R2 is to compare with a model with only a constant.
Let
log L1 maximized log-likelihood function
log L0 maximized value in a model with only a constant term.
Two possible goodness-of-fit measures are
McFadden R2 = 1 log L1/ log L0
1
Pseudo R2 = 1
.
1 + 2 (log L1 log L0) /N

Note that in the model with only a constant, Prob(yi = 1) = p.


P
yi.
This is a simple binomial model and pbML = N1/N , where N1 =
In this case log L0 is given by

N1
N1
log L0 = N1 log
+ (N N1) log 1
.
N
N
13 of 18

Predicted Values
We can construct predictions from the probabilities:

b > 1/2
ybi = 1 if F (x0i)
b 1/2.
ybi = 0 if F (x0i)

To measure the goodness-of-fit we can compare yi and ybi in a simple cross table:
yi

0
1

0
n00
n10
n0

ybi

1
n01
n11
n1

N0
N1
N

Let p = N1/N > 1/2.


Then a simple model saying ybi = 1 for all i will have N1/N correct predictions.
To beat that we should have at least
n00 + n11 N1
>
.
N
N

14 of 18

Empirical Example: Teaching Economics!


We analyze whether a new way to teach economics (called PSI) is eective or not .
We have data for 32 classes of which 14 was exposed to PSI.
The data set includes

Variable
Mean
GRADE Whether the students grades improved, (0/1).
0.344
GPA
Grade point average.
3.117
TUCE
Pretest score, (entering knowledge of material). 21.938
0.438
PSI
Whether the class was exposed to PSI, (0/1).

We want to estimate a binary choice model, i.e.


Prob(GRADEi = 1 | xi) = F (x0i),

where F () is a logit or probit function and xi = (1, GPA, TUCE, PSI)0.

15 of 18

Results of Linear regression, Logit, and Probit


Linear
Logit
Variable
Coe. Deriv. Coe. Deriv.
(tratio)

Constant

1.498
(2.75)

(tratio)

... 13.021

Probit
Coe. Deriv.

(tratio)

... 7.452

(2.64)

...

(2.90)

GPA

0.464

0.464

2.826

0.534

1.626

0.533

TUCE

0.010

0.010

0.095

0.018

0.052

0.017

PSI

0.379

0.379

2.379

0.449

1.426

0.468

Log-likelihood
N
PseudoR2
McFadden R2

(2.78)
(0.55)
(2.38)

12.978
32
...
...

(2.24)
(0.67)
(2.23)

12.8896
32
0.325
0.374

(2.36)
(0.64)
(2.43)

12.8188
32
0.327
0.377

Note that the results are very close (in terms of derivative).
Note the scale dierences for 4 to PSI
bLogit

2.379
=
= 1.668.
bProbit 1.426

16 of 18

Results for 3 Logit Estimations


Model (A)
Model (B)
Model (C)
Variable
Coe. tratio Coe. tratio Coe. tratio
Constant
13.021
2.64 10.656 2.63 0.647
1.74
GPA
2.826
2.24
2.538
2.15
...
...
TUCE
0.095
0.67
0.086
0.64
...
...
PSI
2.379
2.23
...
...
...
...
Log-likelihood
12.890
15.991
20.592
N
32
32
32
2
PseudoR
0.325
0.223
...
2
McFadden R
0.374
0.223
...

PSI is significant.
The tratio (Wald test) is 2.23 with a pvalue of 0.025 in N(0, 1).
The likelihood ratio test is

LR = 2 (15.991 (12.890)) = 6.202,

with a pvalue of 0.013 in a 2(1).

17 of 18

To see the failure rate, look at a cross tabulation:


yi

0
1

0
18
3

ybi

1
3
8

Total
21
11
32

A naive model (ybi = 0) would have 21/32 = 0.656 correct predictions.


The logit model has 26/32 = 0.813.

To find the numerical eect of PSI we look at an average person.


I.e. with GPA = 3.117 and TUCE = 21.938.
b :
The probabilities of an improvement in test scores are F (x0)

Prob(GRADE = 1 | GPA = 3.117, TUCE = 21.938, PSI = 0) = 0.1068


Prob(GRADE = 1 | GPA = 3.117, TUCE = 21.938, PSI = 1) = 0.5633

The eect seems numerically important.

18 of 18