You are on page 1of 13

Semester II

Application of Econometric Theory with STATA

STATA Instruction Manual

By

TANIMA BANERJEE

UNIVERSITY OF CALCUTTA

Email id: tanima.bnrj@gmail.com


Unit 4: Limited dependent variable models – Logit and Probit model
estimation
Logit and Probit regression are regression models where the dependent variable is
categorical, i.e. qualitative. Here, we will consider only the case of binary dependent variable,
i.e. the case where the dependent variable takes only two values, 0 and 1. In economics, logit or
probit models are often called qualitative response choice models.

Let us consider the case of labour market entry:

yi = 1 if a person enters into the labour market

And yi = 0 if a person does not enter into the labour market.

In a given group of individuals, we may be interested to find out the probability of an individual
within the group to enter into the labour market given a set of other factors that act as
explanatory factors/variables in determining the probability. In other words to say, we intend to
estimate the probability, pi, that the yi would be 1 given a set of other factors, such as age,
gender, education level, etc.

To measure pi, the simplest idea would be to consider pi to be a linear function of a set of
covariates, xi, where xi represents the vector of covariates. Then, we can write:

pi = xi′ β………………………………….(1)

where β is a vestor of regression coefficients. This model is basically called linear


probability model. We can estimate this by applying OLS method. However, one major problem
with this model is that the value of the left-hand side variable, i.e. pi , has to be between 0 and 1,
but the linear predictor in the right hand side of the equation, i.e. xi′ β, can take any real value,
unless we put some complex restrictions on the coefficients.

We can generally solve this problem and get away of the requirement of imposing ant
complex range of restrictions on the coefficients by transforming the probability, i.e. pi, and
making the transformation a linear function of the covariates. First, we can convert pi into the
odds, where

oddi = pi / (1- pi)


This ratio can be defined as the ratio of favourable to unfavourable cases. If pi is very small, then
oddi would also be small. Next, we take logarithms of odds and form the logit or log-odds as
follows:

Logit(pi) = log [pi / (1- pi)]

Through this transformation, we can remove the restrictions that the dependent variable has to lie
within 0 and 1. Logit can take any value from - ∞ to + ∞. As pi goes down to zero and odd

moves to zero, the logit approaches - ∞. On the other hand, as pi approaches 1 and odd moves to

+ ∞, the logit also approaches + ∞. Negative logits represent probabilities below 0.5 and
positive logits represent probabilities above 0.5. At probability 0.5, logit is zero.

So, we can present the logistic regression model as the following:

logit(pi ) = xi′ β………………………………….(2)

This model can be described as a generalized linear model with binomial response variable. In
this model, the regression coefficient, i.e. βj, represent the change in logit of the probability
associated with a unit change in the jth predictor holding all other explanatory factors constant.
Now, to get the value of estimated pi , we need to do the following steps:

First, by taking exponential we have:

[pi / (1- pi)] = exp (xi′ β)

Now, solving for pi, we have:

pi = exp (xi′ β) / [1 + exp (xi′ β)]

Hence, marginal change in pi caused by a unit change in xj holding other predictors constant
can be given as follows:

dpi/dxij = βj pi (1- pi)


Thus, the effect of the jth predictor on pi depends on the coefficient βj and the value of pi. Often,
we evaluate the impact on probability of a particular factor by setting the probability to the
sample mean. Thus, the result approximate the effect of the explanatory factor on the probability
near the mean of the response.

The estimation of a logit model is generally a maximum likelihood estimation as the dependent
variable is associated with a probability.

Estimation of Logit Model in Stata:

Let us consider the following regression model:

Logit(pi) = β1 + β2 agei + β3 sexi + ui

Where pi stands for the probability of entering into the labour market as wage labourer for the ith
individual.

To run logistic regression regression, type:

logit dependent_variable independent_variable(s)

In this case, we should write (we are using emp.dta data file):

logit d_entry age i.sex

We will get the output like the following:


. logit d_entry age i.sex

Iteration 0: log likelihood = -5055.4392


Iteration 1: log likelihood = -5047.4624
Iteration 2: log likelihood = -5047.4548
Iteration 3: log likelihood = -5047.4548

Logistic regression Number of obs = 10,253


LR chi2(2) = 15.97
Prob > chi2 = 0.0003
Log likelihood = -5047.4548 Pseudo R2 = 0.0016

d_entry Coef. Std. Err. z P>|z| [95% Conf. Interval]

age .0033597 .0012641 2.66 0.008 .0008821 .0058374


2.sex -.1493168 .0500632 -2.98 0.003 -.2474389 -.0511948
_cons -1.461519 .0547845 -26.68 0.000 -1.568894 -1.354143

Important points to be noted:

 if Prob > chi2 is < 0.05 then the model can be said to be a good fit. This is a test to see
whether all the coefficients in the model are different than zero. In this case, our
regression model is a good fit as Prob > chi2 is 0.0003, i.e. less than 0.05.

 P>|z| presents two-tail p-values test to see whether corresponding coefficient is different
from 0. The p-value has to be lower than 0.05 (95% confidence interval) if we want to
reject the null hypothesis that the coefficient is not different from zero. If p-value of an
explanatory variable is less than 0.05 at 5 percent level of significance, then we can say
that the variable has a significant influence on the dependent variable.

 Coef presents Logit coefficients which are in log-odds units and cannot be read as regular
OLS coefficients. To interpret we need to estimate the predicted probabilities of y=1.

Now, to get the predicted probabilities to enter into the labour market for each
observation/individual in the data set we are using we to type the following just after running the
regression:

predict d_entry_hat
Now, for example, if we want to have the predicted probability of entering into the labour market
for male with age 30, then we should type

tabulate d_entry_hat if sex == 1 & age == 30

While running logit regression we can request odds ratio rather than logit coefficients by adding
the option or (after comma).

Hence, in this case, we should type:

logit d_entry age i.sex, or

We will get the output as the following:

. logit d_entry age i.sex, or

Iteration 0: log likelihood = -5055.4392


Iteration 1: log likelihood = -5047.4624
Iteration 2: log likelihood = -5047.4548
Iteration 3: log likelihood = -5047.4548

Logistic regression Number of obs = 10,253


LR chi2(2) = 15.97
Prob > chi2 = 0.0003
Log likelihood = -5047.4548 Pseudo R2 = 0.0016

d_entry Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

age 1.003365 .0012684 2.66 0.008 1.000882 1.005855


2.sex .8612962 .0431192 -2.98 0.003 .7807979 .9500936
_cons .2318839 .0127036 -26.68 0.000 .2082754 .2581685

Here, Odds ratio represents the odds of Y=1 when X increases by 1 unit. These are the exp(logit
coeff).

If the Odds Ratio > 1, then the odds of Y=1 increases with increase in X. It implies that if Odds
Ratio is greater than 1, then Logit Coefficient would be positive.

If the Odds Ratio < 1, then the odds of Y=1 decreases with increase in X. It implies that if Odds
Ratio is lower than 1, then Logit Coefficient would be negative.
Marginal effects at mean:

Now, if we want to estimate the marginal effect of a predictor on the dependent variable, i.e. the
probability to enter into the labour market in the present case, while holding other factors
constant, then we need to use the margins command. Generally we estimate margins at means.
After running the logit regression type:

margins, dydx(*) atmeans

In our case, we will get the following result:

. margins, dydx(*) atmeans

Conditional marginal effects Number of obs = 10,253


Model VCE : OIM

Expression : Pr(d_entry), predict()


dy/dx w.r.t. : age 2.sex
at : age = 33.18482 (mean)
1.sex = .515849 (mean)
2.sex = .484151 (mean)

Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]

age .000526 .0001977 2.66 0.008 .0001384 .0009135


2.sex -.0233426 .0078078 -2.99 0.003 -.0386456 -.0080396

Note: dy/dx for factor levels is the discrete change from the base level.

Here, dy/dx for age is 0.0005 that represents the change in probability to enter the labour market
for one year increase in age. The effect here is significant at 5 percent level.
Now, dy/dx for sex is -0.023 that represents the change in probability to enter the labour market
for a female as compared to a male. It implies that a female has approximately 2 percent lower
probability to enter the labout market than a male. The effect here is significant at 5 percent
level.

If we want to estimate marginal effect for any categorical independent variable, say sex, for
some given values of a continuous dependent variable, say age, we need to type:

Example 1: margins, dydx(sex) at(age= 20) vsquish


Here, we are trying to find out marginal effect of sex on the probability of entering into the
labour market for those individuals who are 20 years old.

We will get the following result:

. margins, dydx(sex) at(age= 20) vsquish

Conditional marginal effects Number of obs = 10,253


Model VCE : OIM

Expression : Pr(d_entry), predict()


dy/dx w.r.t. : 2.sex
at : age = 20

Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]

2.sex -.0227117 .0076011 -2.99 0.003 -.0376096 -.0078139

Note: dy/dx for factor levels is the discrete change from the base level.

Example 2: margins, dydx(sex) at(age=(20 30)) vsquish

Here, we are trying to find out marginal effect of sex on the probability of entering into the
labour market for those individuals who are 20 and 30 years old.

We will get the following result:


. margins, dydx(sex) at(age=(20 30)) vsquish

Conditional marginal effects Number of obs = 10,253


Model VCE : OIM

Expression : Pr(d_entry), predict()


dy/dx w.r.t. : 2.sex
1._at : age = 20
2._at : age = 30

Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]

2.sex
_at
1 -.0227117 .0076011 -2.99 0.003 -.0376096 -.0078139
2 -.0231899 .0077571 -2.99 0.003 -.0383935 -.0079863

Note: dy/dx for factor levels is the discrete change from the base level.

Example 2: Let us now consider a case with three independent variables, age sex and
household group (hhd_gr). Here, household group variable is a categorical variable with four
groups, Scheduled Tribe(STs) (Code 1), Scheduled Castes (SCs) (Code 2), Other backward
castes (OBCs) (Code 3) and Others (Code 4). So, we should run a logit regression first with one
additional independent variable. Type:

logit d_entry age i.sex i.hhd_gr

We will get the following outcome:


. logit d_entry age i.sex i.hhd_gr

Iteration 0: log likelihood = -5055.4392


Iteration 1: log likelihood = -5034.0459
Iteration 2: log likelihood = -5033.8988
Iteration 3: log likelihood = -5033.8988

Logistic regression Number of obs = 10,253


LR chi2(5) = 43.08
Prob > chi2 = 0.0000
Log likelihood = -5033.8988 Pseudo R2 = 0.0043

d_entry Coef. Std. Err. z P>|z| [95% Conf. Interval]

age .0037629 .001272 2.96 0.003 .0012698 .0062559


2.sex -.1500732 .0501322 -2.99 0.003 -.2483305 -.0518158

hhd_gr
2 -.2981956 .1705197 -1.75 0.080 -.6324081 .0360169
3 -.6790007 .185865 -3.65 0.000 -1.043289 -.314712
9 -.5292953 .1649289 -3.21 0.001 -.85255 -.2060406

_cons -.9936585 .1691016 -5.88 0.000 -1.325092 -.6622253

Now, if we want to find out marginal effect of sex on the probability of entering into the labour
market given that age is 20 years old and household group is Code 2. Then type:

margins, dydx(sex) at(age= 20 hhd_gr = 2) vsquish

We will get the following result:


. margins, dydx(sex) at(age= 20 hhd_gr = 2) vsquish

Conditional marginal effects Number of obs = 10,253


Model VCE : OIM

Expression : Pr(d_entry), predict()


dy/dx w.r.t. : 2.sex
at : age = 20
hhd_gr = 2

Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]

2.sex -.0253775 .0084874 -2.99 0.003 -.0420124 -.0087426

Note: dy/dx for factor levels is the discrete change from the base level.
If we want to find out marginal effect of sex on the probability of entering into the labour
market given that age takes the value 20 and 30 and household group (hhd_gr) takes the value 2
and 3. Then type:

margins, dydx(sex) at(age= (20 30) hhd_gr = (2 3)) vsquish

We will get the following result:


. margins, dydx(sex) at(age= (20 30) hhd_gr = (2 3)) vsquish

Conditional marginal effects Number of obs = 10,253


Model VCE : OIM

Expression : Pr(d_entry), predict()


dy/dx w.r.t. : 2.sex
1._at : age = 20
hhd_gr = 2
2._at : age = 20
hhd_gr = 3
3._at : age = 30
hhd_gr = 2
4._at : age = 30
hhd_gr = 3

Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]

2.sex
_at
1 -.0253775 .0084874 -2.99 0.003 -.0420124 -.0087426
2 -.0199791 .0067653 -2.95 0.003 -.0332389 -.0067193
3 -.0259198 .0086651 -2.99 0.003 -.0429032 -.0089364
4 -.0204956 .0069333 -2.96 0.003 -.0340846 -.0069065

Note: dy/dx for factor levels is the discrete change from the base level.

If we want to find out marginal effect of household group (hhd_gr) (Code 1 of hhd_gr acts as
the reference group here) on the probability of entering into the labour market given that age
takes the value 20 and 30 and sex is 1 (male). Then type:

margins, dydx(hhd_gr) at(age= (20 30) sex = 1) vsquish

We will get the following result:


. margins, dydx(hhd_gr) at(age= (20 30) sex = 1) vsquish

Conditional marginal effects Number of obs = 10,253


Model VCE : OIM

Expression : Pr(d_entry), predict()


dy/dx w.r.t. : 2.hhd_gr 3.hhd_gr 9.hhd_gr
1._at : age = 20
sex = 1
2._at : age = 30
sex = 1

Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]

2.hhd_gr
_at
1 -.0567479 .0343602 -1.65 0.099 -.1240926 .0105969
2 -.0577799 .0349123 -1.66 0.098 -.1262068 .0106469

3.hhd_gr
_at
1 -.1169387 .0354702 -3.30 0.001 -.186459 -.0474185
2 -.1193385 .0360628 -3.31 0.001 -.1900204 -.0486567

9.hhd_gr
_at
1 -.0949275 .0334047 -2.84 0.004 -.1603996 -.0294554
2 -.0967944 .0339414 -2.85 0.004 -.1633183 -.0302705

Note: dy/dx for factor levels is the discrete change from the base level.

Probit estimation in Stata:


In this case, we should type:

probit d_entry age i.sex

The rest of the steps associated with calculating predicted probabilities, marginal effects of
independent variables on the dependent variable would be same as in case of Logit regression.
However, it is necessary to remember that in case of probit estimation, we would not be able to
find odds ratio, as for probit estimation we consider standard normal transformation (z
score) of probability (pi). Hence, the probit regression coefficients indicate the influence of
predictors on z score of pi . Thus, to find out the marginal effects of predictors on pi, we need to
find marginal effects separately after running the probit regression using the ‘margins’
command same as in case of Loigit regression.
How to avoid heteroskedasticity problem in Logit and Probit regression
To avoid the heteroskedasticity problem given the sample data set while running logit or probit
regression, we can can follow the same process as we do in case of running OLS regression. We
just need to use Robust standard errors to eliminate the adverse effect of having heteroskedastic
error, if any, on the estimated regression coefficients. Hence, we just need to incorporate robust
command while running logit or probit regression as the following:

Logit Regression: Type-

logit dependent_variable independent_variable(s), vce(robust)

Example 1: (we are using emp.dta data file):

logit d_entry age i.sex, vce(robust)

Probit Regression: Type-

probit dependent_variable independent_variable(s), vce(robust)

Example: (we are using emp.dta data file):

probit d_entry age i.sex, vce(robust)

You might also like