You are on page 1of 66

Chapter Five: Discrete Choice and

Limited Dependent Variable Models


Regression with dummy variable
Models with Discrete Dependent Variables
Models for Binary Choices
Models for Unordered Multiple Choices
Ordered Choice Models
Limited Dependent Variable Models
Censored and Truncated Regression Models
Sample Selection and Average Treatment Effect
Introduction
 In empirical analysis we may encounter four types of variables. These are ratio scale, interval
scale, ordinal scale and nominal scale variables.
 Regression analysis can deal with many interesting variables that are expressed in qualitative
terms (both explanatory and dependent). These qualitative measures have to be transformed
into some proxy so that it could be represented and used in a regression. Dummy variables are
discrete transformations and used for this purpose. Variables that assume 0 and 1 artificially are
called dummy variables
 Let’s start from dummy variable regression of regressors
 Two shifters: Intercept shifter and Slope shifter
1. Effects on intercept
 Suppose you are investigating the relationship between schooling and earnings, and you have
both males and females in your sample.You would like to see if the sex of the respondent
makes a difference.
 Example : Suppose we want to compare gender differences in salary of an accountant for 22
accountants. Assume after regression, we found the following result:
Y = 35.20 + 10.25D
se (43.82) (3.45)
Where,Y is income in thousands and D is a dummy variable taking value 1 for male& 0 otherwise.
Cont’d---
 Interpretation
The estimated mean salary of female accountants is $35,200 and the estimated mean salary of
male accountants is $45,450. Thus, t- statistics shows us gender differential is statistically
significant since β is significant (because the t-calculated value =β/se(β) =10.25/3.45=2.97
is greater than the t-tabulated value at 5% significance level for the two tailed test).

 Regression on one quantitative variable and one qualitative variable with two
classes, or categories
Yi=β0+β1Di+β2Xi+Ui (5.1)
Where: Yi  annual salary of ith accountant Xi  Years of experience and Di  1 if male and
=0, otherwise
 Example Assume the following regression result from a model given above with Y being the
hourly wage rate, D a dummy for men, and X a variable for years of schooling. The dependent
variable is expressed in USA dollar ($). Standard errors are given within parenthesis:
2.4 X
(8.16) (4.30) (0.63)
Use the regression results to calculate how much higher the average hourly wage rate is for men.
Cont’d---
 First we have to check if the coefficient for the male dummy is significant.
 With a t-value (β/se(β)=21.9/4.3) equals to 5, the coefficient is significantly different from
zero at any conventional significance levels. The marginal effect measured with this regression
says that men earn 21.9 hours more than women do on average, maintaining years of schooling
constant.
 Regression on one qualitative variable with more than two classes
Example : If we want to look at the effect of location( Addis Ababa, Hawassa, Arba Minch etc) on
Person's salary in thousands of Birr (Y). We create variable for towns- three variables:
 Adiss Ababa- D1= 1 if from Adiss Ababa, 0 otherwise; Hawassa –D2=1 if from Hawassa, 0
otherwise and Arba Minch-D3= 1 if from Arba Minch, 0 otherwise
 Suppose the result of the Model:Y= γ0 + γ1D1+ γ2D2+ e become Y= γ0 + 3.8D1+ 2.1D2
 Why was Arba Minch dropped?
 Interpretation: Adiss Ababa=γ1=3.8 means that people in Adiss Ababa earn birr 3800 more
on average relative to people in Arbaminch while those in Hawassa earn birr 2100 more on
averege in reference to people in A/Minch
Cont’d---
 Let us regress to the accountants’ salary regression above, but now assume that in addition to
years of experience and sex, the skin color (black-0 and white-1) of the accountant is also an
important determinant of salary. The equation can be re-written as:
Yi  1  2 D2i  3 D3i  X i  ui (5.2)
 Assuming , E (u i )  0 we can obtain the following regression from:
Mean salary for black female accountant:
E (Y i | D 2  0 , D 3  0 , X i )   1   X i (5.3)
Mean salary for black male accountant:
E (Y i | D 2  1, D 3  0 , X i )  ( 1   2 )   X i (5.4)
Mean salary for white female accountant:
E (Y i | D 2  0 , D 3  1, X i )  ( 1   3 )  X i (5.5)
Mean salary for white male accountant:
E (Y i | D  1, D  1, X i )  (     )  X (5.6)
2 3 1 2 3 i
Cont’d---
2. The Slope dummy variables
 As could be seen above, the dummy variable could work us an intercept shifter.
Sometimes it is reasonable to believe that the shift should take place in the slope
coefficient instead of the intercept.
 If we go back to the human capital model it is possible to argue that the difference in
wage rate between men and women could be due to differences in their return to
education. This would mean that men and women have slope coefficients that are
different in size.
 A model that control for differences in the slope coefficient for different categories of
the qualitative variable could be expressed in the following way:
ln Y = βo +(β1 +β2D) X +Ui
= βo + β1 X + β2 (DX) + Ui (5.7)
In this case the slope coefficient for X equals β1 when D=O and β1+ β2 when D=1.
Hence, a way to test if the return to education differs between men and women would
be to test if β2 is different from zero (coefficient for interaction effects)
Cont’d---
Example: Assume the results from 22 observations are presented below with standard errors
within parenthesis:
lnY = 4.11 + 0.024X + 0.014DX
(0.031) (0.003) (0.002)
 Use the regression results to investigate if there is a difference in the return to education
between men and women.
 To answer that question we simply test the estimated coefficient for the cross product. Doing
that, we receive a t-value of 7 which is above any critical values of conventional significance
level. Hence, we can conclude that the returns to education differ between men and women.

3. A model with Intercept and Slope dummy variable


 Whenever working with cross products it is very important to always include the involved
variables separately to separate that kind of effect from the cross product. If it is the case that D
in itself has a positive effect on the dependent variable, that unique effect will be part of the
cross effect otherwise. Hence whenever including a cross product the model should be
specified in the following way: lnY = βo + β1X + β2D+ β3(DX)+U (5.8)
 When we include the two variables, X and D separately and together with their product we
allow for changes in both the intercept and the slope.
Cont’d---
 Example: Extend the earlier equation by including D separately. Doing that,
we received the following results, with standard errors within parenthesis.
 lnY = 4.006 + 0.033X + 0.210D-0.002DX
(0.045) (0.004) (0.062) (0.005)
 By investigating the t-values we see that β1, and β2 are statistically significant
from zero. But the t-value from the cross product is not significant any more.
Since D alone has a significant effect on the dependent variable, there is little
effect left from the cross product, and hence we conclude that there is no
difference in the return to education between men and women.
Qualitative dependent Variables Regression
 In previous chapters, we have dealt with models where the range of the dependent variable is
unbounded. Now we shall consider imposing both an upper and a lower bound on the variable
and we consider models of Discrete choice and Limited Dependent Variables (LDV) in which
the economic agent's response is limited in some way.
 A LDV is broadly defined as a dependent variable, whose range of values is restricted, rather
than being continuous on the real line.
 Discrete choice/dependent variables are sometimes called qualitative response/choice models,
because the dependent variables are discrete taking on two or more possible values, rather than
continuous. These models specify the probability that an individual chooses an option among a
set of alternatives or choices/categories; hence the dependent variable is categorical
(Nominal and Ordinal variables).
 Discrete choice models can be classified according to the number of available alternatives.
* Binomial/binary choice models (dichotomous): 2 available alternatives
* Multinomial choice models (polytomous): 3 or more available alternatives
* Ordered (count data or ordered rating responses) or unordered alternatives
 Discrete choice and LDV models can be used for time series and panel data, but they are most
often applied to cross-sectional data. Thus, we focus on cross-sectional applications
Models for Binary Choice
 In a regression analysis, we usually face a qualitative response (dependent) variable of the yes or no,
pass/fail, win/lose type and taking values 0 and 1.
 Discrete choice models dealing with such kind of binary responses are called binary choice
/Dummy dependent variable models

1. Linear Probability Model


 Technically, it is possible to estimate the binary choices using OLS. Such linear model for binary
choices where OLS is used is called linear probability model (LPM).
 The linear probability model is simply running OLS for a regression, where the dependent
variable is a dummy (i.e. binary) variable:
(5.9)
where Di is a dummy variable, and the Xs, βs, and ε are typical independent variables, regression
coefficients, and an error term, respectively.
 The term linear probability model comes from the fact that the right side of the equation is
linear while the expected value of the left side measures the probability that Di = 1
 It is based on an assumption that the probability of an event occurring, Pi, is linearly related to a set
of explanatory variables. However, the actual probabilities cannot be observed, so we would
estimate this linear regression model by OLS, the coefficients and t-statistics, etc are then
interpreted in the usual way
Cont’d---
 When y is a binary variable taking only two values (0 and 1),  i cannot be interpreted as a
measure of change in y for a unit change in xi, other things being equal. It does not make sense
to say that a 1 point increase in x1, citrus paribus, increases Di by b1
 In such cases, y either changes from 0 to 1 or changes from 1 to 0, or does not change. Instead
of predicting Di itself, we predict the probability that Di = 1.
 It can make sense to say that a 1 point increase in x1, cetris paribus, increases the probability of
success by b1
 In particular, when y takes on values 0 and 1, it is always true that the probability of success
(y =1) is the same as the expected value of y :
E (y = 1 | x ) = E (y | x ) (5.10)
 In the LPM, i measures the change in the probability of success when xi changes, holding other
covariates fixed:
D Pr (y = 1 | x ) = b D i x i (5.11)

 Suppose, for example, that we wanted to model the probability that a firm i will pay a dividend p(yi = 1)
as a function of its market capitalization (x1i, measured in millions of US dollars), and we fit the
following line:
Pˆi  - 0 . 3  0 . 012 x 1 i
, where
P̂i denotes the fitted or estimated probability for firm i.
 This model suggests that for every $1m increase in size, the probability that the firm will pay a dividend increases by 0.012 (or 1.2%).
 A firm whose stock is valued at $50m will have a -0.3+0.01250=0.3 (or 30%) probability of making a dividend payment.
Problems with LPM
 Though LPM is simple to interpret, it has a number of shortcomings:
o Except in cases where the probability does not depend on any of the covariates, LPM is
alwasy heteroskedastic (disturbance term changes systematically with the x’s). Therefore,
heteroscedasticity-robust standard errors are always used in the context of LDV models.
o The constant slope () in the LPM is less intutive in many cases
o The predicted probability can be less than 0 or greater than 1 which violates the intution
that probabilites shold be between 0 and 1
 In earlier example, for any firm whose value is less than $25m, the model-predicted
probability of dividend payment is negative, while for any firm worth more than $88m,
the probability is greater than one
 An obvious solution is to censor/truncate the probabilities at 0 or 1, so that a probability
of -0.3, say, would be set to zero, and a probability of, say, 1.2, would be set to 1.

 We need a procedure to translate our linear regression results into true probabilities
 Two common transformations that result in sigmoid functions are probit and logit
transformations.
Cont’d---
 The transformation function F that maps the linear combination into [0,1] and
satisfies in general F(−∞) = 0, F(+∞) = 1, and dF(z)/dz > 0 [that is, it is a
cumulative distribution function].
 We want a translator such that:
 The closer to -∞ is the value from our linear regression model, the closer to
0 is our predicted probability.
 The closer to +∞ is the value from our linear regression model, the closer
to 1 is our predicted probability.
 No predicted probabilities are less than 0
or greater than 1
Non-Linear Probability Models (NLPM)
 The NLPM(both the logit and probit model) approaches are able to overcome the limitation
of the LPM that it can produce estimated probabilities that are negative or greater than one.
 They do this by using a function that effectively transforms the regression model so that the
fitted values are bounded within the (0,1) interval.
 Visually, the fitted regression model will appear as an S-shape rather than a straight line, as
was the case for the LPM
 To begin, assume the appropriate statistical experiment. The statistical experiment is draws
from a Bernoulli/binomial distribution (where the relationship between x and p is non-linear).
The probability model from the Bernoulli distribution is given:
f ( y | z )  p y (1 - p ) 1 - y
i i
(5.12 )
, where p is a parameter reflecting the probability that y=1(success) and 1-p is prob. of failure.
 This probability often follows an S shaped distribution. In other words, the probability that
y=1 remains small until some threshold is crossed, at which point it switches rapidly to remain
large after the threshold. This suggests a cumulative density function.
 That is, the relationship between the probability P(y/x=1) and x is non-linear. The probability
is expected to follow a sigmoid or S-shape which resembles the cumulative distribution
function (CDF) of a random variable
Cont’d---
 A model whose dependent variable has an upper and a lower bound can be derived from an ordinary
regression model by mapping the dependent variable y through a sigmoid or S-shaped function. As y → - ∞ ,
the sigmoid tends to its lower asymptote whereas, as y → ∞, the sigmoid tends to the upper asymptote.

Pr(x)
1

0
x
- +

 The response function (logistic or probit) is an S-shaped function, which implies a fixed change
in X has a smaller impact on the probability when it is near zero than when it is near the
middle. Thus, it is a non-linear response function.
2. Binary Logit Model (BLM)
• The logit model uses a the cumulative logistic distribution to transform the model so that the
probabilities follow the S-shape given on the previous slide
• The binomial logit is an estimation technique for equations with dummy dependent variables
that avoids the unboundedness problem of the linear probability model
 BLM is non-linear and does so by using a variant of the cumulative logistic function:
(5.13)
 Again, for the logit model is bounded by 1 and 0
 Logits cannot be estimated using OLS but are instead estimated by Maximum
Likelihood Estimation(MLE), an iterative estimation technique that is especially useful for
equations that are nonlinear in the coefficients/parameters
 Maximum likelihood estimates are those parameter values that maximize the probability of
drawing the data that are actually observed
 Moreover just like the other procedures the estimators( , ) are obtained by optimizing
(maximizing) the so called Likelihood Function:
L(β,δ2; X) = L(θ; X), θ = ( , ) (5.14)
 The likelihood function is the joint density evaluated at the observation points
 It is easier to maximise the logarithm of the likelihood function: l(θ; y/X) = lnL(θ; y/X)
rather than the likelihood function itself

Cont’d---
The cumulative logistic function for logit is grounded in the concept of an odds ratio.
 More formally, the logit model in eq., 5.13 is derived as follow:
 Let p/(1-p) is odds ratio (prob. of success/prob. failure); hence, the log odds ratio or logit transformation is given as:
 P 
ln     1   2 X ki  ...   k X ki  z = ln[p/(1-p)] =  +  X + e (5.15)
1- P 
, where, p is the probability that the event Y occurs, p(Y=1)
Assume a latent variable, Z , mediates between the explanatory‘s and the dummy variable Y. Then solving for the
probability that y=1 we have:
P
 ez
1- P
P  ( 1 - P ) e z  e z - Pe z
z
P  Pe  ez
P (1  e z )  e z
ez eZ 1 1 1
P  p   
1 ez 1 eZ e-Z 1 e-Z  e-z 1 1 e- z
Yi 1-Yi
 
The above logistic probability is simply denoted as   Z  (cumulative logistic function: f Yi / Z     Z  1-   Z)  
 The higher Z is, the higher the probability that Y = 1.
 The logistic distribution constrains the estimated probabilities to lie between 0 and 1.
 The estimated probability of success/occurrence of event Y is:
p = 1/[1 + e(- -  X)] (5.16)
1-p = 1/[1 + e( + X)]
 If you let  +  X =0, then p = .50 (probability of success)
 As  +  X gets really big, p approaches 1, but, as  +  X gets really small, p approaches 0
3. Binary Probit Model (BPM)
 Instead of using the cumulative logistic function to transform the model, the probit model
uses the cumulative normal standard distribution
1-Y
 Like logit model, the Bernoulli trail of conditional on Z is given by: f Y / Z   PY 1 - P,
for the Probit model .
 Plugging the standard normal cumulative density function into the above function
Y 1-Yi 
f Yi / Z     Z  i 1 -   (5.17)
 When the transformation function F is the cumulative density function (cdf) of the standard
normal distribution, the response probability is given by:
t Z2
1 -2
P (Y  1)   e dt   ( z ) (5.18)
, where - 2

z  1   2 X ki  ...   k X ki

• As for the logistic approach, this function provides a transformation to ensure that the
fitted probabilities will lie between zero and one.
Logit or Probit?
 The Logit and Probit models are almost identical and the choice of the model is arbitrary (there is no
practical difference between the two models), although logit model has certain advantages (simplicity-
much easier to work with because the function can be simplified to a linear equation). Probit functions
are difficult to work with because they require integration, but computer packages can calculate probits
easily
 In the probit model, we assume εi is distributed by the standard normal. In the logit model, we assume εi
is distributed by the logistic
 Both the Probit and Logit models have the same basic structure.
1. Estimate a latent variable Z/Y* (unobserved, what we observe is a dummy variable y=1, if z > 0
or y = 1, otherwise <=0 ) using a linear model. Z ranges from negative infinity to positive
*
infinity. The idea behind logit and probit is: y   1 i f y  0 (5.19)
 0 i f y*  0
2. Use a non-linear function to transform Z into a predicted Y value between 0 and 1
 Both are used to predict an outcome variable that is categorical (violates the assumption of linearity in
normal regression) from one or more categorical or continuous predictor variables
 Both uses a the CDF (CLF and CNF) to transform the model so that the probabilities follow the S-shape,
but differ in the relative thickness of the tails. Logit is relatively thicker than Probit. This difference
would, however, disappear, as the sample size gets large. If we have a small sample the two distributions
can differ significantly in their results (the slope varies dramatically), but they are quite similar in large
samples.
Estimation, hypothesis testing and Interpretation
• Since both are non-linear, maximum likelihood is usually used to estimate the parameters of
the model (of course, except LPM, binary choice models use MLE)
 The likelihood function for these models is given by:
n
Y 1-Yi 
L k / Yi , Xi   z i 1- , for Probit Model
i1
n 1-Yi 
Y
L  k / Yi , Xi     z  i 1-  z   , for Logit Model
i 1
 The Log Likelihood function of these models is give as:
n
ln L   k / Yi , X i    Yi ln    z    1 - Yi  ln 1 -   z    , for Probit, and
i 1

n
ln L   k / Yi , X i    Yi ln    z    1 - Yi  ln 1 -   z    , for Logit
i 1
 These functions can be optimized using standard methods to get the parameter values.
 Maximization of the likelihood function with respect to the parameter  gives the logit and
probit estimates depending on the cdf used (logistic or normal)
 The ML estimator of β is consistent and asymptotically normally distributed.
 However, the estimation rests on the strong assumption that the latent error term is normally
distributed and homoscedastic. If homoscedasticity is violated, no easy solution.

Cont’d---
The probability model in the form of a regression is:
E Y / X   0 1- F   ' X   1F   ' X   (5.20)
 F  ' X 
 Whatever distribution is used, the parameters of the model like those of any other nonlinear
regression model, are not necessarily the marginal effects:
E Y / X   dF   ' X  
  (5.21)
X  d   ' X  
 f  ' X  
 Where, f .  is the density function that corresponds to the cumulative density distribution, F .  .
a) For the normal distribution, this is:
E Y / X 
  ' X   (5.22)
X
Where  . is the standard normal density.
b) For logistic distribution
= d  ' X   e
 'X
   ' X  1-  ' X   (5.23)
d   ' X  1 e ' X 2

EY / X 
= 
  ' X  1-   ' X   
= (5.24)
X
 In interpreting the estimated model, in most cases the means of the regressors are used.
Cont’d---
 In the logit model, we can interpret an on the odds. That is, every
unit increase in X results in a multiplicative effect of eb on the odds.
Example: If b = 0.25, then e.25 = 1.28.Thus, when X changes by one
unit, p increases by a factor of 1.28, or changes by 28%.
 In the probit model, use the Z-score terminology. For every unit
increase in X, the Z-score (or the Probit of “success”) increases by b
units or, we can also say that an increase in X changes Z by b standard
deviation units.
 Let’s see the following regression out put
Cont’d---
Assume the following regression result from 644 observations using Binary logit
model. What Point Spreads Say About on the Probability of Winning the game?

Cont’d---
The estimated slope of the point spread is -0.1098

 A 1-point increase in the point spread decreases E(Z ) by 0.1098 units.

 How do we interpret the slope dZ/dX ?

 In a linear regression, we look to coefficients for three elements:


1. Statistical significance:You can still read statistical significance from the slope dZ/dX.
The z-statistic reported for probit or logit is analogous to OLS’s t-statistic.
2. Sign: If dZ/dX is positive, then dProb(Y)/dX is also positive
 The z-statistic on the point spread is -7.22, well exceeding the 5% critical value of 1.96.
The point spread is a statistically significant explanator of winning a game.
 The sign of the coefficient is negative. A higher point spread predicts a lower chance of
winning.
3. Magnitude: the magnitude of dZ/dX has no particular interpretation. We care about the
magnitude of dProb(Y)/dX. From the computer output for a probit or logit estimation, you
can interpret the statistical significance and sign of each coefficient directly. But
interpreting the magnitude is trickier or wrong.
 Problems in Interpreting Magnitude:
Cont’d---
1. The estimated coefficient relates X to Z. We care about the relationship between X and
Prob(Y = 1).
2. The effect of X on Prob(Y = 1) varies depending on Z.
 There are two basic approaches to assessing the magnitude of the estimated coefficient.

 One approach is to predict Prob(Y ) for different values of X, to see how the probability changes as
X changes.
 The effect of a 1-unit change in X varies greatly, depending on the initial value of E(Z ).

 E(Z ) depends on the values of all explanators.

 Step One: Calculate the E(Z ) values for X = 5.88 and X = 6.88, using the fitted values.

 Step Two: Plug the E(Z ) values into the formula for the logistic density function.
Z (5 .8 8 )  0 - 0 .1 0 9 8 5 .8 8  0 .6 4 5 6
Z ( 6 .8 8 )  0 - 0 .1 0 9 8 6 .8 8  0 .7 5 5 4

e x p ( Zˆ )
F o r t h e l o g i t , F ( Zˆ ) 
1  e x p ( Zˆ )

F ( 0 .7 5 5 4 ) - F ( 0 .6 4 5 6 )  3 .2 0 - 3 .4 4  - 0 .0 2 4 .
Cont’d---
 Changing the point spread from 5.88 to 6.88 predicts a 2.4 percentage
point decrease in the team’s chance of victory.
 Note that changing the point spread from 8.88 to 9.88 predicts only a 2.1
percentage point decrease.
 In summary:
 The signs of the coefficients in the logit/ probit model have the same
meaning as in the linear probability (i.e. OLS) model
 The interpretation of the magnitude of the coefficients differs, though, the
dependent variable has changed dramatically.
 Above all, note that you cannot interpret the coefficients directly in terms of
units of change in y for a unit change in x, as in regression analysis.
Cont’d---
 As stated, interpretation of the coefficients needs slight care
 Suppose, a 1-unit increase in xi will cause a iF(xi) increase in probability.
 Usually, these impacts of incremental changes in an explanatory variable are
evaluated by setting each of them to their mean values.
 These estimates are sometimes known as the marginal effects
 The “marginal effects” are not constant: the slope (i.e. the change in
probability) of the graph of the logit/probit changes as estimated z moves
from 0 to 1
 Therefore, the marginal effects can be evaluated at the sample means of the
data. Or the marginal effects can be evaluated at every observation and the
average can be computed to represent the marginal effects.
 More generally, the marginal effects are give as
 j for linear probability mod el
pi 
  j pi 1- pi  for the log it mod el
xij 
 j  zi  for the probit mod el
Hypothesis testing
 Standard errors and t-ratios will automatically be calculated by the econometric software
package used, and hypothesis tests can be conducted in the usual fashion for LPM
 Usually hypothesis testing can be conducted using z-statistic for NLBM
 One of the statistic that is used in testing hypotheses in logit and probit models is the
likelihood ratio (LR) test.
 It is twice the difference in the log-likelihoods:
(5.25)
L R = 2 (L ur - L r )
,where Lur is the log-likelihood value for the unrestricted model (the maximum likelihood
function when maximized with respect to all the parameters) and the Lr is the log-likelihood
value for the restricted model (the maximum when maximized with restrictions).
 Since Lur  Lr, LR  0 and it is usually strict positive. The idea is that, because the MLE
maximizes the log-likelihood function, dropping variables generally leads to no larger log-
likelihood
 In binary choice models, the LR is a negative number due to the fact that variables in the log-
likelihood function are strictly between 0 and 1 whose logarithm is negative.
 Multiplication by two is required so that LR has an appropriate 2 distribution under H0.
Measures of Goodness of fit
 When the dependent variable to be measure is dichotomous, there is a problem of using the
conventional R2 as a measure of goodness of fit
 Various R2 measures have been devised for Logit and Probit. However, none is a measure of
the closeness of observations to an expected value as in regression analysis. All are adhoc
 It is also possible to calculate a pseudo R-squared (measures based on likelihood ratios) for
logit and probit comparable to the standard R2 for linear regressions by computing the
squared correlation between the fitted and actual values of yi.
 Let be the maximum likelihood function when maximized with respect to all the parameters
and be the maximum when maximized with restrictions.
 McFadden (1974) suggested:
L ur (5.26)
p seu d o R 2 = 1 -
L 0

, where Lur is is the log-likelihood function for the estimated model and L0 is the log-
likelihood function with only intercept. More practically,
McFadden's-R2 = 1 - [LL(, )/LL()]
 If the covariates have no explanatory power, Lur /L0 =1 and the pseudo R-squared is zero.
 Since usually |Lur|<|L0|, then the pseduo R2 is greater than zero.
 If Lur were zero, then the pseudo R-squared would assume unity. But it cannot be zero as it
requires that all the estimated probabilities when yi=1 should be one and when yi = 0, the
estimated probabilities should be zero.
Multinomial Models
 Multinomial models are multi-equation models used in unordered dependent
variables case
 Such unordered choice model can be motivated by a random utility model. For
the ith consumer faced with J choices, suppose that the utility of choice j is:
U ij = xij q + eij (5.27)
 If the consumer makes choice j in particular, then we assume that Uij is the
maximum among the J utilities.
 Hence, the statistical model is driven by the probability that choice j is made,
which is:
P r (U i j > U i k ) for all ot her k ¹ j . (5.28)
 The probability that a particular consumer will choose a particular alternative
is given by the probability that the utility of that alternative to that consumer is
greater than the utility to that consumer of all other alternatives
 Two models in this category are multinomial logit and multinomial probit
models. These are simple extension to the logit and probit models when the
dependent variable can take more than two categorical values
Multinomial Logit Model
 Multinomial logit models are a straightforward extension to the logit model when the
dependent variable can take more than two categorical values
 The multinomial logit is developed specifically for the case with more than two
qualitative choices and the choice is made simultaneously
 Suppose we face a multiple choice response variable such as means of transportation
choices where the order of the choices does not matter
 A traveler has a choice for a trip to work either by: Car, Bus or Train, Or
 Choice of a major: economics, marketing, management, finance or accounting.
 How can we build and estimate a model for choosing from more than two different
choices?
 If the traveler would have the choice between car and bus only we would estimate a
model of the probability of going by car against going by bus, or vice versa
 With the extra choice we just add an extra model explaining train against bus
 So we go from 1 equation for 2 alternatives to 2 equations for 3 alternatives
Cont‘d---
 In general:
 Multinomial logit models are multi-equation models
 A response variable with K categories will generate K-1 equations
 Each of these K-1 equations is a binary logistic regression comparing a group
with the reference group
 Notice that multinomial logit can only be used when the choices are actually
mutually exclusive. That means that a person chooses exactly one of the
options, not more and not less
 Why multinomial logit and not probit?
 The reason for this is simple
 It is easy to extend the binary model to a multinomial model using the
logistic distribution
 However not using the normal distribution
Multinomial choice probabilities
 Pij = P(individual i choices alternative j)
 The first category used as a reference

1
pij pPindividual
i1 i chooses alternativej
1  exp    x   exp    x 
, j 1 (5.29)
12 22 i 13 23 i

exp  12   22 xi 
pi 2  , j2 (5.30)
1  exp  12   22 xi   exp  13   23 xi 

exp  13   23 xi  (5.31)


pi 3  , j3
1  exp  12   22 xi   exp  13   23 xi 
Estimation
 Multinomial logistic regression simultaneously estimates the K-1 equations using
maximum likelihood estimation
 The assumptions underlying the model assure that the sum of the probabilities of all
the categories add up to 1
 There is no order within the categories of Y( any of them can be the baseline for
comparison)
 Since one alternative is the base for all other alternatives
 The estimated parameters per equation reflect the effect on the log-odds ratio of
alternative under consideration against the base alternative. That is:
 P ji  j  1,  , J - 1 (5.32)
ln     j X ji   ji
 P bi 
, where J is the number of alternatives
Interpretation
 We will get K-1 sets of estimates. One set of estimates for the effects of the
independent variable for each comparison with the reference level
 The sign of a coefficient reflects the direction of change in the odds ratio in response
to a ceteris paribus change in the value to which the coefficient is attached P ji Pbi 
 It does not reflect the direction of change in the individual probabilities: (not P ji )
 Significance of coefficients: similar to the binomial logit model
 Goodness of fit: similar to the binomial logit model with Chi-square test and pseudo-
R2
 Example
 Say people can choose among three brands of beer
 1- Dashen
 2- Harer
 3 -Meta
 We want to explain the preferences of male and female and the relation
between age and preference
Cont’d---
 Multinomial logistic regression Number of obs= 735
LR chi2(4)= 185.85 Prob > chi2= 0.0000
Log likelihood = -702.9707 Pseudo R2 = 0.1168
------------------------------------------------------------------------------------
brand | Coef. Std. Err. z P>|z| [95% Conf.Interval]
-------------+---------------------------------------------------------------------
2 |
female |.523 .194 2.70 0.007 .143 .904
age |.368 .055 6.69 0.000 .260 .476
_cons |-11.7 1.77 -6.64 0.000 -15.2 -8.2
-------------+-------------------------------------------------------------------
3 |
female | .465 .226 2.06 0.039 .022 .909
age | .685 .062 10.95 0.000 .561 .808
_cons|-22.7 2.05 -11.04 0.000 -26.7 -18.6
------------------------------------------------------------------------------------
Cont’d---
 The output above has two parts, labeled with the categories of the
outcome variable brand.
 They correspond to two equations:
o log(P(Harer)/P(Dashen)) = β20 + β21*female + β22*age
o log(P(Meta)/P(Dashen)) = β30 + β31*female + β32 *age,
 Log-odds
 We would say that for a one unit increase in age , we expect a Beta increase
or decrease in the log odds of Harer to Dashen, given all of the other
variables in the model are held constant
 From the above table:
With one year increase in age, the log of the ratio of the two probabilities,
P(Harer=2)/P(Dashen=1), will be increased by 0.368
And the log of the ratio of the two probabilities P(Meta=3)/P(Dashen=1)
will be increased by 0.686
Cont’d---
• Therefore, we can say that, in general, the older a person is, the more he/she will
prefer Harer and Meta relative to Dashen
• Which beer is actually preferred the most depends on the values of the constant and
the gender of the person
• We can calculate K-1 probabilities with the K-1 equations
• The Kth probability is 1 minus the sum of the K-1 calculated probabilities
 Odds ratio
 Parameters are interpreted as the effect of a variable on the log odds
 The odds-ratio is the ratio of the probability to choose an alternative relative to the
base
 The odds-ratio is also called relative risk
 exp( ) is the effect of the independent variable on the "odds ratio“
 The effect of one extra year of age on the relative risk of choosing Harer over
Dashen is exp(0.3682) = 1.45
 What does this mean? It means choosing Harer over Dashen becomes more likely
when age increases, because the effect is larger than one
Multinomial probit
 One important assumption of the estimation of multinomial logit is that the odds
ratios, Pij/Pik are independent of the other alternatives or choices/ remaining
probabilities .
 This assumption follows from the initial assumption that the disturbances are
independent and homoskedastic.This is termed as the independent from irrelevant
alternatives (IIA) assumption
 Hausman and McFadden (1984) suggest that if a subset of the choice set is truly
irrelevant, omitting it from the model altogether will not change parameter estimates
systematically.
 Exclusion of these choices will be inefficient but will not lead to inconsistency.
 If the remaining odds ratios are not truly independent from these alternatives, then
the parameters obtained when these choices are excluded will be inconsistent
 An alternative to multinomial logit model in particular given that it suffers from the
violation of the IIA assumption is the multinomial probit model
 Hence, when the IIA assumption fails, the natural alternative is multinomial probit
model
 The practical problem associated with multinomial probit model was the difficulty to
evaluate the multiple integral of the normal probability for dimensions higher than 2.
Ordered Choice Models
 The choice options in multinomial models have no natural ordering or
arrangement. However, in some cases choices are ordered in a specific way.
Examples include:
 Results of opinion surveys in which responses can be strongly disagree,
disagree, neutral, agree or strongly agree
 Assignment of grades or work performance ratings. Students receive
grades A, B, C, D, F which are ordered on the basis of a teacher’s
evaluation of their performance.
 Employees are often given evaluations on scales such as outstanding,
Excellent, Very Good, Good, Fair and Poor
 Levels of employment are unemployed, part-time, or full-time

 Educational attainment can be analyzed as illiterate, 1st cycle, 2nd cycle,


high school, tertiary level
Cont’d---
 When modeling these types of outcomes numerical values are assigned
to the outcomes, but the numerical values are ordinal, and reflect only
the ranking of the outcomes
 Ordered models assume there's some underlying, unobservable true
outcome variable, occurring on an interval scale
 We don't observe that interval-level information about the outcome, but
only whether that unobserved value crosses some threshold(s) that put
the outcome into a lower or a higher category, categories which are
ranked, revealing ordinal but not interval-level information.
 Both ordered logit and ordered probit can be used to estimate ordered
choices
Ordered Probit Model
 Example on labour participation
 Suppose married female answer the question ’how much would you like
to work?’ The choices are:
 Not work = 0
 Part time = 1
 Full time =2
 Thus, the value we assign to each outcome is no longer arbitrary
 To analyze the question, we introduce a latent variable (a variable that is
not observed)
 we introduce the latent variable y*
o The latent variable takes a low value if a woman does want to work (y=0)
o A high value if she wants to work full-time (y=2)
o And an intermediate value if she wants to work part-time
 The relation between y and y* is obvious
Cont’d---
 Introduce thresholds gamma1 and gamma2 to account for low,
intermediate and high choices
 The formal relationship between y and y* is:
y i  0 if y i   1 (5.33)
y i  1 if  1  y i   2 (5.34)
y i  2 if  2  y i (5.35)
 If the number of choices increases, the number of gammas increases
 Using the thresholds, the ordered probit model is:
P  y i  0   P  y i   1  
P  X i    i   1     1 - X i   (5.36)
P  y i  1   P  1  y i   2 
P X i    i   2     2 - X i   -   1 - X i   (5.37)
P  y i  2   P  y i   2 
P X i    i   2   1 -   2 - X i   (5.38)
Cont’d---
 The ordered probit procedure uses maximum likelihood to estimate:
 The parameters
 And the values of the thresholds or cutoff points (intercepts)
 Replacing the normal distribution by the logistic distribution gives ordered
logit
 Most software includes options for both ordered probit, which depends on
the errors being standard normal, and ordered logit, which depends on the
assumption that the random errors follow a logistic distribution
 The effect of value of the parameters and the observed explanatory variables is
determining the value of y*
 This value, together with the values of the thresholds predicts the outcome for
individual i
Ordered Logit Model
 The ordered logit model (also ordered logistic regression or proportional odds
model), is a regression model for ordinal dependent variables
 It can be thought of as an extension of the logistic regression model that
applies to dichotomous dependent variables, allowing for more than two
(ordered) response categories
 For example, if one question on a survey is to be answered by a choice among
"poor", "fair", "good", "very good", and "excellent", and the purpose of the
analysis is to see how well that response can be predicted by the responses to
other questions, some of which may be quantitative, then ordered logistic
regression may be used.
 One important assumption in ordered logit model is the parallel regression
assumption or the proportional odds assumption.
 Noting that ordered logit regression with J alternatives is equivalent to J-1
binary regressions, the proportional odds assumption states that the slope
coefficients are identical across each regression.
Cont’d---
 The parallel regression assumption can be tested using Wald test developed by Brant
(1990). The Wald statistic for the  coefficient is:
Wald = [ /seB]2 (5.39)
which is distributed chi-square with 1 degree of freedom
 When the parallel regression assumption is rejected, an alternative model at least in
the ordered logit framework is the generalized ordered logit model.
 The GOL model allows the coefficients to differ across each of the J-1 comparisons
 MLE provides results that include “cut points” (thresholds) and coefficients. The cut
points reflect the predicted cumulative probabilities at covariate values of zero
 OLR essentially runs multiple equations- one less than the number of options on one’s
scale
 For example, assume that you have a 4 point scale, 1=not at all optimistic, 2=not very
optimistic, 3=somewhat optimistic, and 4=very optimistic.
 The first equation compares the likelihood that y=1 to the likelihood that y does not
=1 (that is, y=2 or 3 or 4)
 The second equation compares the likelihood that y=1 or 2 to the likelihood that y=3 or 4.
 The third equation compares the likelihood that y=1, 2, or 3 to the likelihood that y=4.
Cont’d---
 Note that OLR only reports one parameter estimate for each indpendent
variable. That is, it constrains the parameter estimates to be constant across
categories.
 It assumes that the coefficients for the variables would not vary if one actually
separately estimated the different equations.
 Suppose yi* is a continous latent variable which is a linear function of the
explanatory variables
yn* = Xn  + n (5.40)
and can be ‘mapped’ on an ordered multinomial variable as follows:
yn= 1 if 0 < yn*  1 (5.41)
yn = j if j-1 < yn*  j (5.42)
yn = J if J-1 < yn*  J (5.43)
0 < 1 < …. < j < … < J
Cont’d---
 The probabilities become:
*
P (Y n  j | X n )  P ( j -1  y n   j )
 P ( j -1 - (X n  )   n   j - (X n  )
 F ( j - (X n  )) - F (  j-1 - (X n  ))
 Ordered Logit model
exp(  j - X n )
F ( j - X n ) 
1  exp(  j - X n )
 Interpretation of parameters (marginal effects)

 P (Y n  1) 1  1  (5.44)
-  1 -   k
 x nk 1  exp( X n  -  1 )  1  exp( X n  -  1 ) 
 P (Y n  j ) 1  1 
(  1- -
 x nk 1  exp( X n  -  j )  1  exp( X n  -  j ) 

1  1  (5.45)
1 - )  k

1  exp( X n  -  j -1 )  1  exp( X n  -  j -1 ) 
 P (Y n  J ) 1  1 
(  1 -   k (5.46)
 x nk 1  exp( X n  -  J )  1  exp( X n  -  J ) 
Cont’d---
 Define Z = Xβ+U with no intercept and U ~ N(0,1)
Pr(y=0) = Pr(z < _cut1)
Pr(y=1) = Pr(_cut1 < z < _cut2)
Pr(y=2) = Pr(_cut2 < z)
| |
y=0 | y = 1 | y=2
| |
---------------|------------|------------
_cut1 _cut2
 Disadvantages of OLM
 Assumption of equal slope k
 Biased estimates if assumption of strictly ordered outcomes does not hold
 treat outcomes as nonordered unless there are good reasons for imposing a
ranking
Limited Dependent Variable (LDV) Models---
 A LDV is a continuous variable with a lot of repeated observations at the lower or
upper limit. Examples include the quantity of a product consumed, the number of
hours that an individual work, etc.
 LDVM arise when the variable of interest is constrained to lie between zero and one,
as in the case of a probability, or is constrained to be positive, as in the case of wages
or hours worked.
 Most often, researchers encounter a data where some values of the dependent
variable are missing most of the time from a randomly drawn sample, probably due to
data sample design such as top-coding or institutional constraints. In such cases, the
data is said to be censored.
 Such censoring problem is solved by censored regression model by utilizing the
information we have on whether the missing values are above or below a certain
threshold.
 Tobit model is originally developed to deal with corner solution outcome, it is used to
estimate models of both cases (corner solution outcomes where the dependent
variable assumes zero values for a significant part of the data and censoring of various
type).
 If no observations are censored, the Tobit model is the same as an OLS regression
Censoring vs Truncation
 Censored regression models are used when the value for the dependent variable is
unknown (clustered at a lower threshold, an upper threshold, or both so that values
above (or below) this cannot be observed) while the values of the independent
variables are still available
 But in Truncated regression models the whole observations are missing so that the
values of both variables are unknown when the dependent variable is above (or below)
a certain threshold.
 Censored sample include consumers who consume zero quantities of a product
 But Truncated sample only include consumers who choose positive quantities of a
product.
 Censored sample observe people that do not work but their work hours are recorded
as zero.
 Truncated sample do not observe anything about people who do not work.
 The censored sample is representative of the population (only the mean for the
dependent variable is not) because all observations are included.
 The truncated sample is not representative of the population because some
observations are not included.
Cont’d---
 Truncation has greater loss of information than censoring (missing
observations rather than values for the dependent variable)
 Unlike the types of limited dependent variables examined so far, censored or
truncated variables may not necessarily be dummies
 Because of censoring, the dependent variable y is the incompletely observed
value of the latent dependent variable y*.
 Suppose that a privatisation is heavily oversubscribed, and you were trying to
model the demand for the shares using household income, age, education, and
region of residence as explanatory variables. The number of shares allocated to
each investor may have been capped at, say 250, resulting in a truncated
distribution.
 In this example, even though we are likely to have many share allocations at
250 and none above this figure, all of the observations on the independent
variables are present and hence the dependent variable is censored, not
truncated.
Tobit Model
 This model is for metric dependent variable and when it is limited in the sense we
observe it only if it is above or below some cut off level. For example,
 the wages may be limited from below by the minimum wage
 Top coding income at, say, at $300,000
 expenditures on a certain product can not be lower than zero. In case of a good like
food, that is not a problem, because almost everybody has food expenditures. In
case of a good like cars this is different.You will observe a lot of people having zero
expenditure on cars
 Zero, in this case, is called a corner solution. The tobit model is taking account of
corner solutions. This is also called Tobit I model
 It is a standard regression model in which all negative values are censored at zero
 Reasoning behind:
 If we include the censored observations as y = 0, the censored observations on the left will
pull down the end of the line, resulting in underestimates of the intercept and overestimates
of the slope.
 If we exclude the censored observations and just use the observations for which y>0 (that is,
truncating the sample), it will overestimate the intercept and underestimate the slope.
Cont’d---
 The degree of bias in both will increase as the number of observations that take on the value of
zero increases.
 The Tobit model uses all of the information, including info on censoring and provides
consistent estimates
 The standard tobit model:

y i  xi '   i
 
y i  y i if y i  0 (5.47)

y i  0 if y i  0
 The latent variable is often explained intuitively as indicating that there are people that are
willing to have negative outcomes on the dependent variable.
 For instance on car sales. Mathematically that is what we are assuming
 However, intuitively that does not make sense
 To illustrate, suppose that we wanted to model the demand for shares, as discussed earlier, as a
function of income (x2i), age (x3i), education (x4i), and region of residence (x5i). The Tobit
model would be
y i*   1   2 x 2 i   3 x 3 i   4 x 4 i   5 x 5 i  u i
y i  y i* fo r y i*  2 5 0
*
(5.48)
yi  250 fo r y i  250
Cont’d---
 yi* represents the true demand for shares (i.e. the number of shares
requested) and this will only be observable for demand less than 250.
 It is important to note in this model that 2, 3, etc., represent the impact on
the number of shares demanded (of a unit change in x2i, x3i, etc.) and not the
impact on the actual number of shares that will be bought (allocated).
 The Tobit model is also referred to as the censored regression model
 Example the Income of y*=4,000 will be censored as y=3,000 with top
coding of 3,000
 Censoring from below and Censoring from above
 Censoring from below
 The actual value for the dependent variable y is observed if the latent variable y* is
above the limit and the limit is observed for the censored observations.
 We observe the actual hours worked for people who work and zero for people
who do not work.
Cont’d---
 Censoring from above
 The actual value for the dependent variable y is observed if the latent variable y* is
below the limit and the limit is observed for the censored observations.
 If people make below $100,000, we observe their actual income and if they
make above $100,000, we record their income as 100,000 (censored values).
 The model consists of two parts
 The probability of observing a zero value
 The distribution of the positive values
 The probability of observing a zero value:
P y i  0   P y i  0   P  i  - x i '  

(5.49)
 i xi '    xi '    xi '  
 P  -   -  1-  
         
Estimation and interpretation
 It is also a nonlinear model and similar to the probit model. It is estimated
using maximum likelihood estimation techniques. The likelihood function for
the tobit model takes the form:
(5.50)

 This is an unusual function, it consists of two terms, the first for non-censored
observations (it is the pdf), and the second for censored observations (the cdf)
 The maximum likelihood estimates of  and  are obtained by maximizing the
log-likelihood.
 The estimated tobit coefficients are the marginal effects of a change in xj on y*,
the unobservable latent variable and can be interpreted in the same way as in a
linear regression model.
 But such an interpretation may not be useful since we are interested in the
effect of X on the observable y (or change in the censored outcome).
Cont’d---
 It can be shown that change in y is found by multiplying the coefficient with

Pr(a<y*<b), that is, the probability of being uncensored. Since this probability is a
fraction, the marginal effect is actually satisfied.

 In the above, a and b denote lower and upper censoring points. For example, in left

censoring, the limits will be: a =0, b=+∞.


 The t-statistics for each estimated  in the Tobit estimates are computed using
the estimated standard errors.
 The Wald test and LR test are used to test for overall stability or multiple
exclusion restrictions
Expectations and Marginal Effects
 There are three expectations worth noting:
 Expected value of the latent variable: E (y* | x )
 The “conditional” expectation: E (y | y > 0, x ) , which is the expected
value of y for the subpopulation where y is positive.
 The “unconditional expectation”: E (y | x )
 The expected value of the latent variable is less important: we do not know y*
anyway, the “conditional” expectation is crucial
 The marginal effects for the latent variable are the coefficients.
- Marginal effect on the desired hours of work.
 Marginal effects for the censored sample(Tobit model)
-For the censored sample (with zeros and positive amounts)
- Marginal effect on the actual hours of work for workers and non-workers.
 For the truncated sample (with positive amounts)
- Marginal effect on the actual hours of work for workers.
 We use the different marginal effects depending on what variable is of interest
in the study.
Cont’d---
 Assume the model explaining the number of affairs married people have
 The number of affairs is explained by: Male, Number of years married, Children (0/1),
Religiousness, Education years, Occupation (scale 1-7) & Self rating of marriage (scale 1-5)
# affairs OLS estimates Tobit estimates
estimate t-value Estimate Z-value
Constant 5.87 5.16 7.61 1.95
Male 0.05 0.17 0.95 0.89
# years married -0.05 -2.25 -0.19 -2.37
Children (0/1) -0.14 -0.41 1.02 0.80
Religiousness -0.48 -4.27 -1.70 -4.19
Years of education -0.01 -0.21 0.03 0.11
Occupation 0.10 1.18 0.21 0.66
Self-rating of marriage -0.71 -5.93 -2.27 -5.47
R-square 0.13 0.15

 The signs of the effects of children and years of education change when using tobit instead of
OLS
 The effect of self rating of marriage is much higher using tobit
Cont’d---
 Before moving on, two important limitations of Tobit modelling should be
noted.
 First, such models are much more seriously affected by non-normality and
heteroscedasticity than are standard regression models, and biased and
inconsistent estimation will result.
 Second, the tobit model requires it to be plausible that the dependent variable
can have values close to the limit.
 There is no problem with the privatisation example discussed above since the
demand could be for 249 shares.
 However, it would not be appropriate to use the Tobit model in situations
where this is not the case, such as the number of shares issued by each firm in
a particular month.
 For most companies, this figure will be exactly zero, but for those where it is
not, the number will be much higher and thus it would not be feasible to
issue, say, 1 or 3 or 15 shares.
 In this case, an alternative approach should be used.
 The truncated regression can be considered
Truncation
 Truncation is a situation where a subset of the population in a sampling scheme is
excluded on the basis of the dependent variable y before sampling.
 A truncated dependent variable, on the other hand, occurs when the observations for
both the dependent and the independent variables are missing when the dependent
variable is above (or below) a certain threshold.
 Example: Sample consists only of people with values of Y below a limit c. (truncated
regression model.)
 In the case of truncation, we do not have a random sample from the underlying
population. The rule that was used to include units in the sample is, however, known.
It is apparent to see that no information on explanatory variables is available.
 Thus, dealing with truncated data is really a sample selection problem because the
sample of data that can be observed is not representative of the population of interest
- the sample is biased, very likely resulting in biased and inconsistent
parameter estimates.
 This is a common problem, which will result whenever data for buyers or users only
can be observed while data for non-buyers or non-users cannot.
Cont’d--
 The distribution of the positive values:
o This is equal to the distribution of the dependent variable given that it is
positive
o This is a truncated normal distribution with expectation:
E y i | y i  0   x i '   E  i |  i  - xi' 
 x i '   
 xi '   (5.51)
 x i '   
   is the normal density function
 The marginal effect on the expected value of y of a change in of explanatory
variable k is:  E y i 
  k  x i '  /  
 x ik
(5.52)
 Notice that
 The expectation depends on the x-values of person i. And, therefore, marginal effects differ
between individuals
Cont’d--
 For truncated data, a more general model is employed that contains two
equations - one for whether a particular data point will fall into the observed
or constrained categories and another for modelling the resulting variable.
 The second equation is equivalent to the tobit approach.
 This two-equation methodology allows for a different set of factors to affect
the sample selection
 If it is thought that the two sets of factors will be the same, then a single
equation can be used and the tobit approach is sufficient.
 In many cases, however, the researcher may believe that the variables in the
sample selection and estimation equations should be different.
 Thus the equations could be
a*     z   z   z     z   (5.53)
i 1 2 2i 3 3i 4 4i m mi i

yi*  1  2 x2i  3 x3i   4 x4i    k xki  ui (5.54)

where yi = yi* for ai* > 0 and yi is unobserved for ai*  0.


ai* denotes the relative ‘advantage’ of being in the observed sample relative to the
unobserved sample.
Cont’d--
 The first equation determines whether the particular data point i will be
observed or not, by regressing a proxy for the latent (unobserved) variable,
ai*, on a set of factors, zi.
 The second equation is similar to the tobit model.
 Ideally, the two equations will be fitted jointly by maximum likelihood.
 This is usually based on the assumption that the error terms, are multivariate
normally distributed and allowing for any possible correlations between them.
 Maximization of the likelihood function of the truncated regression model
with respect to  and 2 gives maximum likelihood estimators.
 The MLEs are consistent and approximately normal.
 The inference, including standard errors and log-likelihood statistics, is
standard.
 However, while joint estimation of the equations is more efficient, it is
computationally more complex and hence a two-stage procedure popularised
by Heckman (1976) is often used.

You might also like