You are on page 1of 84

# Lecture 9: Logit/Probit

## Prof. Sharyn OHalloran

Sustainable Development U9611
Econometrics II

## So far, we know how to handle linear

estimation models of the type:
Y = 0 + 1*X1 + 2*X2 + + X +

variables to get the equation to be linear:
Taking

## Then we can run our estimation, do model

checking, visualize results, etc.

Nonlinear Estimation

## In all these models Y, the dependent

variable, was continuous.
Independent

## variables could be dichotomous

(dummy variables), but not the dependent var.

This week well start our exploration of nonlinear estimation with dichotomous Y vars.
These arise in many social science problems

Legislator

Regime Type: Autocratic/Democratic
Involved in an Armed Conflict: Yes/No

This

## is a function linking the actual Y to the

estimated Y in an econometric model

## We have one example of this already: logs

with Y = X +
Then change to log(Y) Y = X +
Run this like a regular OLS equation
Then you have to back out the results
Start

This

## is a function linking the actual Y to the

estimated Y in an econometric model

## We have one example of this already: logs

Different
with Y = X +
s here
Then change to log(Y) Y = X +
Run this like a regular OLS equation
Then you have to back out the results

Start

## If the coefficient on some particular X is ,

then a 1 unit X (Y) = [log(Y))]
= e (Y)
for small values of , e 1+ , this is
almost the same as saying a % increase in Y
(This is why you should use natural log
transformations rather than base-10 logs)
Since

F(Y)

= X +

## How does this apply to situations with

dichotomous dependent variables?
I.e.,

## First, lets look at what would happen if we

tried to run this as a linear regression
As a specific example, take the election of
minorities to the Georgia state legislature

= 0: Non-minority elected
Y = 1: Minority elected

## Dichotomous Independent Vars.

Black Representative Elected

The data
look like
this.

The only
values Y
can have
are 0 and 1
0

.2

.4
.6
Black Voting Age Population

.8

## Dichotomous Independent Vars.

And heres
a linear fit
of the data

.2

.4
.6
Black Voting Age Population

.8
Fitted values

Note that
the line
goes below
0 and
above 1

## Dichotomous Independent Vars.

The line
doesnt fit
the data
very well.

.2

.4
.6
Black Voting Age Population

.8
Fitted values

And if we
take values
of Y
between 0
and 1 to be
probabilities,
this doesnt
make sense

## Redefining the Dependent Var.

How to solve this problem?
We need to transform the dichotomous Y
into a continuous variable Y (-,)
So we need a link function F(Y) that takes
a dichotomous Y and gives us a
continuous, real-valued Y
Then we can run
F(Y) = Y = X +

Original
Y

Original
Y

Y as a
Probability

Original
Y

Y as a
Probability

?
Y (- ,)

## Redefining the Dependent Var.

What function F(Y) goes from the [0,1]
interval to the real line?
Well, we know at least one function that
goes the other way around.

That

## is, given any real value it produces a

number (probability) between 0 and 1.

This is the

## Redefining the Dependent Var.

What function F(Y) goes from the [0,1]
interval to the real line?
Well, we know at least one function that
goes the other way around.

That

## is, given any real value it produces a

number (probability) between 0 and 1.

That

## Redefining the Dependent Var.

So we would say that
Y = (X + )
1(Y) = X +
Y = X +
Then our link function is F(Y) = 1(Y)

This

## term was coined in the 1930s by

biologists studying the dosage-cure rate link
It is short for probability unit

.1

.2

.3

.4

Probit Estimation

-4

-2

After estimation, you can back out probabilities using the standard normal dist.

.1

.2

.3

.4

Probit Estimation

-4

-2

## Say that for a given observation, X = -1

.1

.2

.3

.4

Probit Estimation

-4

-2

## Say that for a given observation, X = -1

.1

.2

.3

.4

Probit Estimation

-4

-2

## Say that for a given observation, X = -1

.3

.4

Probit Estimation

.1

.2

Prob(Y=1)

-4

-2

## Say that for a given observation, X = -1

.1

.2

.3

.4

Probit Estimation

-4

-2

## Say that for a given observation, X = 2

.1

.2

.3

.4

Probit Estimation

-4

-2

## Say that for a given observation, X = 2

.3

.4

Probit Estimation

.1

.2

Prob(Y=1)

-4

-2

## Say that for a given observation, X = 2

Probit Estimation

## In a probit model, the value of X is taken to

be the z-value of a normal distribution
Higher

## values of X mean that the event is

more likely to happen

## Have to be careful about the interpretation

of estimation results here
one unit change in Xi leads to a i change in
the z-score of Y (more on this later)

## The estimated curve is an S-shaped

cumulative normal distribution

## Probability of Electing a Black Rep.

.2
.4
.6
.8

Probit Estimation

.2

.4
.6
Black Voting Age Population

.8

This fits the data much better than the linear estimation
Always lies between 0 and 1

## Probability of Electing a Black Rep.

.2
.4
.6
.8

Probit Estimation

.2

.4
.6
Black Voting Age Population

.8

## Can estimate, for instance, the BVAP at which Pr(Y=1) = 50%

This is the point of equal opportunity

## Probability of Electing a Black Rep.

.2
.4
.6
.8

Probit Estimation

.2

.4
.6
Black Voting Age Population

.8

## Can estimate, for instance, the BVAP at which Pr(Y=1) = 50%

This is the point of equal opportunity

## Probability of Electing a Black Rep.

.2
.4
.6
.8

Probit Estimation

.2

.4
.6
Black Voting Age Population

.8

## Can estimate, for instance, the BVAP at which Pr(Y=1) = 50%

This is the point of equal opportunity

## Probability of Electing a Black Rep.

.2
.4
.6
.8

Probit Estimation

.2

.4
.6
Black Voting Age Population

.8

## Redefining the Dependent Var.

{0,1} to the real line
Well look at an alternative approach based on the
odds ratio
If some event occurs with probability p, then the
odds of it happening are O(p) = p/(1-p)
p

= 0 O(p) = 0
p = O(p) = 1/3 (Odds are 1-to-3 against)
p = O(p) = 1 (Even odds)
p = O(p) = 3 (Odds are 3-to-1 in favor)
p = 1 O(p) =

Original
Y

Y as a
Probability

interval

Original
Y

Y as a
Probability

Odds of Y

## So taking the odds of Y occuring moves us from the [0,1]

interval to the half-line [0, )

Original
Y

Y as a
Probability

Odds of Y

## The odds ratio is always non-negative

As a final step, then, take the log of the odds ratio

Original
Y

Y as a
Probability

Odds of Y

Log-Odds of Y

0
-

Logit Function

## This is called the logit function

logit(Y)

= log[O(Y)] = log[y/(1-y)]

At

## first, this was computationally easier than

working with normal distributions
Now, it still has some nice properties that well
investigate next time with multinomial dep. vars.

## The density function associated with it is very

close to a standard normal distribution

.2

## Logit vs. Probit

.15

Logit

.05

.1

Normal

-4

-2

The logit function is similar, but has thinner tails than the normal distribution

Logit Function

## This translates back to the original Y as:

Y
log
= X
1 Y
Y
= e X
1 Y
Y = (1 Y ) e X
Y = e X e X Y
Y + e X Y = e X

(1 + e )Y = e
X

e X
Y=
1 + e X

Latent Variables

For the rest of the lecture well talk in terms of probits, but
everything holds for logits too
One way to state whats going on is to assume that there
is a latent variable Y* such that

Y * = X + , ~ N 0, 2

## Latent Variable Formulation

For the rest of the lecture well talk in terms of probits, but
everything holds for logits too
One way to state whats going on is to assume that there
is a latent variable Y* such that

Y * = X + , ~ N 0, 2

Normal = Probit

Latent Variables

For the rest of the lecture well talk in terms of probits, but
everything holds for logits too
One way to state whats going on is to assume that there
is a latent variable Y* such that

Y * = X + , ~ N 0, 2

Normal = Probit

## In a linear regression we would observe Y* directly

In probits, we observe only

0 if y 0
yi =
1 if y > 0
*
i
*
i

Latent Variables

For the rest of the lecture well talk in terms of probits, but
everything holds for logits too
One way to state whats going on is to assume that there
is a latent variable Y* such that

Y * = X + , ~ N 0, 2

Normal = Probit

## In a linear regression we would observe Y* directly

In probits, we observe only

0 if y 0
yi =
1 if y > 0
*
i
*
i

## These could be any

constant. Later well
set them to .

Latent Variables

## This translates to possible values for the error term:

yi* > 0 x i + i > 0 i > x i

## Pr yi* > 0 | x i = Pr ( yi = 1 | x i ) = Pr ( i > x i )

Similarly,

x i

= Pr i >

x i
=

x i
Pr ( yi = 0 | x i ) = 1

Latent Variables

## Look again at the expression for Pr(Yi=1):

x i
Pr ( yi = 1 | x i ) =

We cant estimate both and , since they enter the
equation as a ratio
So we set =1, making the distribution on a
standard normal density.
One (big) question left: how do we actually estimate
the values of the b coefficients here?
(Other

## Then for each observation yi, we can plug

in xi and to get Pr(yi=1)=(xi ).
For

## Then if the actual observation was yi=1,

we can say its likelihood (given ) is 0.8
But if yi=0, then its likelihood was only 0.2

And

## For any given trial set of coefficients, we can

calculate the likelihood of each yi.
Then the likelihood of the entire sample is:

L ( y1 ) L ( y2 ) L ( y3 ) K L ( yn ) = L ( yi )
i =1

## Maximum likelihood estimation finds the s that

maximize this expression.
Heres the same thing in visual form

## Maximum Likelihood Estimation

1

y* = x + , Probit: 1#[0,1]
P(y=1) = Probit(y*)

## Maximum Likelihood Estimation

1

y* = x + , Probit: 1#[0,1]
P(y=1) = Probit(y*)

## Given estimates of , the distance

from yi to the line P(y=1) is 1-L(yi | )

## Maximum Likelihood Estimation

1

y* = x + , Probit: 1#[0,1]
P(y=1) = Probit(y*)

1-L(y3)
0

y3

## Given estimates of , the distance

from y3 to the line P(y=1) is 1-L(y3 | )

## Maximum Likelihood Estimation

1

y9
y* = x + , Probit: 1#[0,1]
P(y=1) = Probit(y*)

1-L(y9)

1-L(y3)
0

## Given estimates of , the distance

from y9 to the line P(y=1) is 1-L(y9 | )

## Maximum Likelihood Estimation

1
Probit(x)

Impact of changing

## Maximum Likelihood Estimation

1
Probit(x)

Impact of changing to

1
Probit(x)

## Remember, the object is to maximize

the product of the likelihoods L(yi | )

1
Probit(x)

## Using may bring regression line closer

to some observations, further from others

## Maximum Likelihood Estimation

1

P( y1 = 1) = P y1* 0 = P(1 x1 )

1

0
t = 1 2 3

Time Series

1

0
1

0
1

0
1

Country 1

Country 2

Country 3

Country 4

## Recall that a likelihood function is:

n

L ( y1 ) L ( y2 ) L ( y3 ) K L ( yn ) = L ( yi ) L
i =1

Since

## maximizing the log(L) is the same as maximizing L

n

log(L ) = log L ( yi )
i =1

= log[L ( yi )]
i =1

## Lets see how this works on some simple examples

Take a coin flip, so that Yi=0 for tails, Yi=1 for heads
Say

## you toss the coin n times and get p heads

Then the proportion of heads is p/n

Since Yi is 1 for heads and 0 for tails, p/n is also the sample mean

Intuitively,

## If the true probability of heads for this coin is , then

the likelihood of observation Yi is:

if yi = 1
L ( yi ) =
1 if yi = 0
= (1 )
yi

1 yi

n

i =1

i =1

## max [log L ( yi| )] = log yi (1 )

1 yi

= yi log( ) + (1 yi ) log(1 )
i =1

## To maximize this, take the derivative with respect to

n

d yi log( ) + (1 yi ) log(1 )
d log L

= i =1

1
= yi (1 yi )

1
i =1
1

## Finally, set this derivative to 0 and solve for

n
yi (1 yi )
=0

i =1

[ y (1 ) (1 y ) ]
i =1

[y
i =1

(1 )

=0

yi + (1 yi ) ] = 0
n

n = yi
i =1
n

y
i =1

## Finally, set this derivative to 0 and solve for

n
yi (1 yi )
=0

i =1

[ y (1 ) (1 y ) ]
i =1

[y
i =1

(1 )

=0

yi + (1 yi ) ] = 0
n

n = yi
i =1
n

y
i =1

## Magically, the value

of that maximizes
the likelihood
function is the
sample mean, just
as we thought.

## Can do the same exercise for OLS regression

set of coefficients that maximize the likelihood would
then minimize the sum of squared residuals, as before

The

## This works for logit/probit as well

In fact, it works for any estimation equation
Just

## look at the likelihood function L youre trying to

maximize and the parameters you can change
Then search for the values of that maximize L
(Well

## Maximizing L can be computationally intense, but

with todays computers its usually not a big problem

## This is what Stata does when you run a probit:

. probit black bvap
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration

0:
1:
2:
3:
4:
5:
6:

log
log
log
log
log
log
log

likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood

Probit estimates

## Log likelihood = -198.78004

=
=
=
=
=
=
=

-735.15352
-292.89815
-221.90782
-202.46671
-198.94506
-198.78048
-198.78004
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2

=
=
=
=

1507
1072.75
0.0000
0.7296

-----------------------------------------------------------------------------black |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------bvap |
0.092316
.5446756
16.95
0.000
0.081641
0.102992
_cons | -0.047147
0.027917
-16.89
0.000
-0.052619
-0.041676
------------------------------------------------------------------------------

## This is what Stata does when you run a probit:

. probit black bvap
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration

0:
1:
2:
3:
4:
5:
6:

log
log
log
log
log
log
log

likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood

Probit estimates

## Log likelihood = -198.78004

=
=
=
=
=
=
=

-735.15352
-292.89815
-221.90782
-202.46671
-198.94506
-198.78048
-198.78004

Maximizing the
log-likelihood
function!
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2

=
=
=
=

1507
1072.75
0.0000
0.7296

-----------------------------------------------------------------------------black |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------bvap |
0.092316
.5446756
16.95
0.000
0.081641
0.102992
_cons | -0.047147
0.027917
-16.89
0.000
-0.052619
-0.041676
------------------------------------------------------------------------------

## This is what Stata does when you run a probit:

. probit black bvap
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration

0:
1:
2:
3:
4:
5:
6:

log
log
log
log
log
log
log

likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood

Probit estimates

## Log likelihood = -198.78004

=
=
=
=
=
=
=

-735.15352
-292.89815
-221.90782
-202.46671
-198.94506
-198.78048
-198.78004

Maximizing the
log-likelihood
function!
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2

=
=
=
=

1507
1072.75
0.0000
0.7296

-----------------------------------------------------------------------------black |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------bvap |
0.092316
.5446756
16.95
0.000
0.081641
0.102992
_cons | -0.047147
0.027917
-16.89
0.000
-0.052619
-0.041676
------------------------------------------------------------------------------

Coefficients are
significant

## In linear regression, if the coefficient on x is ,

then a 1-unit increase in x increases Y by .
But what exactly does it mean in probit that the
coefficient on BVAP is 0.0923 and significant?
It

## means that a 1% increase in BVAP will raise the

z-score of Pr(Y=1) by 0.0923.
And this coefficient is different from 0 at the 5% level.

## So raising BVAP has a constant effect on Y.

But this doesnt translate into a constant effect on
the original Y.
This

.2
.4
.6
.8

## For instance, raising BVAP from .2 to .3 has little

appreciable impact on Pr(Black Elected)

.2

.4
.6
Black Voting Age Population

.8

.2
.4
.6
.8

## But increasing BVAP from .5 to .6 does have a

big impact on the probability

.2

.4
.6
Black Voting Age Population

.8

## So lesson 1 is that the marginal impact of

changing a variable is not constant.
Another way of saying the same thing is that in
the linear model
Y = 0 + 1 x1 + 1 x1 + K + n xn , so
Y
= i
xi

## In the probit model

Y = ( 0 + 1 x1 + 1 x1 + K + n xn ) , so
Y
= i ( 0 + 1 x1 + 1 x1 + K + n xn )
xi

## This expression depends on not just i, but on the

value of xi and all other variables in the equation
So to even calculate the impact of xi on Y you
have to choose values for all other variables xj.
Typical

or their medians

## Another approach is to fix the xj and let xi vary

from its minimum to maximum values
Then

## you can plot how the marginal effect of xi

changes across its observed range of values

## Model voting for/against incumbent as

Probit (Y ) = X + , where
x1i = Constant
x2i = Party ID same as incumbent
x3i = National economic conditions
x4i = Personal financial situation
x5i = Can recall incumbent' s name
x6i = Can recall challenger' s name
x7 i = Quality challenger

Significant
Coefficients

## This backs out the

marginal impact
of a 1-unit change
in the variable on
the probability of
voting for the
incumbent.

big

big

## This backs out the

marginal impact
of a 1-unit change
in the variable on
the probability of
voting for the
incumbent.

## Or, calculate the impact of facing a quality challenger

by hand, keeping all other variables at their median.

## Or, calculate the impact of facing a quality challenger

by hand, keeping all other variables at their median.

(1.52)

## Or, calculate the impact of facing a quality challenger

by hand, keeping all other variables at their median.

From standard
normal table

(1.52)

## Or, calculate the impact of facing a quality challenger

by hand, keeping all other variables at their median.

From standard
normal table

(1.52)

(1.52-.339)
So theres an increase of .936 - .881 = 5.5% votes in
favor of incumbents who avoid a quality challengers.

## Model the probability that a bill is passed in the

Senate (over a filibuster) based on:
The

## coalition size preferring the bill be passed

An interactive term: size of coalition X end of session