Professional Documents
Culture Documents
Chapter 4 (Compatibility Mode)
Chapter 4 (Compatibility Mode)
Regression on one quantitative variable and one qualitative variable with two
classes, or categories
Yi=β0+β1Di+β2Xi+Ui (5.1)
Where: Yi annual salary of ith accountant Xi Years of experience and Di 1 if male and
=0, otherwise
Example Assume the following regression result from a model given above with Y being the
hourly wage rate, D a dummy for men, and X a variable for years of schooling. The dependent
variable is expressed in USA dollar ($). Standard errors are given within parenthesis:
2.4 X
(8.16) (4.30) (0.63)
Use the regression results to calculate how much higher the average hourly wage rate is for men.
Cont’d---
First we have to check if the coefficient for the male dummy is significant.
With a t-value (β/se(β)=21.9/4.3) equals to 5, the coefficient is significantly different from
zero at any conventional significance levels. The marginal effect measured with this regression
says that men earn 21.9 hours more than women do on average, maintaining years of schooling
constant.
Regression on one qualitative variable with more than two classes
Example : If we want to look at the effect of location( Addis Ababa, Hawassa, Arba Minch etc) on
Person's salary in thousands of Birr (Y). We create variable for towns- three variables:
Adiss Ababa- D1= 1 if from Adiss Ababa, 0 otherwise; Hawassa –D2=1 if from Hawassa, 0
otherwise and Arba Minch-D3= 1 if from Arba Minch, 0 otherwise
Suppose the result of the Model:Y= γ0 + γ1D1+ γ2D2+ e become Y= γ0 + 3.8D1+ 2.1D2
Why was Arba Minch dropped?
Interpretation: Adiss Ababa=γ1=3.8 means that people in Adiss Ababa earn birr 3800 more
on average relative to people in Arbaminch while those in Hawassa earn birr 2100 more on
averege in reference to people in A/Minch
Cont’d---
Let us regress to the accountants’ salary regression above, but now assume that in addition to
years of experience and sex, the skin color (black-0 and white-1) of the accountant is also an
important determinant of salary. The equation can be re-written as:
Yi 1 2 D2i 3 D3i X i ui (5.2)
Assuming , E (u i ) 0 we can obtain the following regression from:
Mean salary for black female accountant:
E (Y i | D 2 0 , D 3 0 , X i ) 1 X i (5.3)
Mean salary for black male accountant:
E (Y i | D 2 1, D 3 0 , X i ) ( 1 2 ) X i (5.4)
Mean salary for white female accountant:
E (Y i | D 2 0 , D 3 1, X i ) ( 1 3 ) X i (5.5)
Mean salary for white male accountant:
E (Y i | D 1, D 1, X i ) ( ) X (5.6)
2 3 1 2 3 i
Cont’d---
2. The Slope dummy variables
As could be seen above, the dummy variable could work us an intercept shifter.
Sometimes it is reasonable to believe that the shift should take place in the slope
coefficient instead of the intercept.
If we go back to the human capital model it is possible to argue that the difference in
wage rate between men and women could be due to differences in their return to
education. This would mean that men and women have slope coefficients that are
different in size.
A model that control for differences in the slope coefficient for different categories of
the qualitative variable could be expressed in the following way:
ln Y = βo +(β1 +β2D) X +Ui
= βo + β1 X + β2 (DX) + Ui (5.7)
In this case the slope coefficient for X equals β1 when D=O and β1+ β2 when D=1.
Hence, a way to test if the return to education differs between men and women would
be to test if β2 is different from zero (coefficient for interaction effects)
Cont’d---
Example: Assume the results from 22 observations are presented below with standard errors
within parenthesis:
lnY = 4.11 + 0.024X + 0.014DX
(0.031) (0.003) (0.002)
Use the regression results to investigate if there is a difference in the return to education
between men and women.
To answer that question we simply test the estimated coefficient for the cross product. Doing
that, we receive a t-value of 7 which is above any critical values of conventional significance
level. Hence, we can conclude that the returns to education differ between men and women.
Suppose, for example, that we wanted to model the probability that a firm i will pay a dividend p(yi = 1)
as a function of its market capitalization (x1i, measured in millions of US dollars), and we fit the
following line:
Pˆi - 0 . 3 0 . 012 x 1 i
, where
P̂i denotes the fitted or estimated probability for firm i.
This model suggests that for every $1m increase in size, the probability that the firm will pay a dividend increases by 0.012 (or 1.2%).
A firm whose stock is valued at $50m will have a -0.3+0.01250=0.3 (or 30%) probability of making a dividend payment.
Problems with LPM
Though LPM is simple to interpret, it has a number of shortcomings:
o Except in cases where the probability does not depend on any of the covariates, LPM is
alwasy heteroskedastic (disturbance term changes systematically with the x’s). Therefore,
heteroscedasticity-robust standard errors are always used in the context of LDV models.
o The constant slope () in the LPM is less intutive in many cases
o The predicted probability can be less than 0 or greater than 1 which violates the intution
that probabilites shold be between 0 and 1
In earlier example, for any firm whose value is less than $25m, the model-predicted
probability of dividend payment is negative, while for any firm worth more than $88m,
the probability is greater than one
An obvious solution is to censor/truncate the probabilities at 0 or 1, so that a probability
of -0.3, say, would be set to zero, and a probability of, say, 1.2, would be set to 1.
We need a procedure to translate our linear regression results into true probabilities
Two common transformations that result in sigmoid functions are probit and logit
transformations.
Cont’d---
The transformation function F that maps the linear combination into [0,1] and
satisfies in general F(−∞) = 0, F(+∞) = 1, and dF(z)/dz > 0 [that is, it is a
cumulative distribution function].
We want a translator such that:
The closer to -∞ is the value from our linear regression model, the closer to
0 is our predicted probability.
The closer to +∞ is the value from our linear regression model, the closer
to 1 is our predicted probability.
No predicted probabilities are less than 0
or greater than 1
Non-Linear Probability Models (NLPM)
The NLPM(both the logit and probit model) approaches are able to overcome the limitation
of the LPM that it can produce estimated probabilities that are negative or greater than one.
They do this by using a function that effectively transforms the regression model so that the
fitted values are bounded within the (0,1) interval.
Visually, the fitted regression model will appear as an S-shape rather than a straight line, as
was the case for the LPM
To begin, assume the appropriate statistical experiment. The statistical experiment is draws
from a Bernoulli/binomial distribution (where the relationship between x and p is non-linear).
The probability model from the Bernoulli distribution is given:
f ( y | z ) p y (1 - p ) 1 - y
i i
(5.12 )
, where p is a parameter reflecting the probability that y=1(success) and 1-p is prob. of failure.
This probability often follows an S shaped distribution. In other words, the probability that
y=1 remains small until some threshold is crossed, at which point it switches rapidly to remain
large after the threshold. This suggests a cumulative density function.
That is, the relationship between the probability P(y/x=1) and x is non-linear. The probability
is expected to follow a sigmoid or S-shape which resembles the cumulative distribution
function (CDF) of a random variable
Cont’d---
A model whose dependent variable has an upper and a lower bound can be derived from an ordinary
regression model by mapping the dependent variable y through a sigmoid or S-shaped function. As y → - ∞ ,
the sigmoid tends to its lower asymptote whereas, as y → ∞, the sigmoid tends to the upper asymptote.
Pr(x)
1
0
x
- +
The response function (logistic or probit) is an S-shaped function, which implies a fixed change
in X has a smaller impact on the probability when it is near zero than when it is near the
middle. Thus, it is a non-linear response function.
2. Binary Logit Model (BLM)
• The logit model uses a the cumulative logistic distribution to transform the model so that the
probabilities follow the S-shape given on the previous slide
• The binomial logit is an estimation technique for equations with dummy dependent variables
that avoids the unboundedness problem of the linear probability model
BLM is non-linear and does so by using a variant of the cumulative logistic function:
(5.13)
Again, for the logit model is bounded by 1 and 0
Logits cannot be estimated using OLS but are instead estimated by Maximum
Likelihood Estimation(MLE), an iterative estimation technique that is especially useful for
equations that are nonlinear in the coefficients/parameters
Maximum likelihood estimates are those parameter values that maximize the probability of
drawing the data that are actually observed
Moreover just like the other procedures the estimators( , ) are obtained by optimizing
(maximizing) the so called Likelihood Function:
L(β,δ2; X) = L(θ; X), θ = ( , ) (5.14)
The likelihood function is the joint density evaluated at the observation points
It is easier to maximise the logarithm of the likelihood function: l(θ; y/X) = lnL(θ; y/X)
rather than the likelihood function itself
Cont’d---
The cumulative logistic function for logit is grounded in the concept of an odds ratio.
More formally, the logit model in eq., 5.13 is derived as follow:
Let p/(1-p) is odds ratio (prob. of success/prob. failure); hence, the log odds ratio or logit transformation is given as:
P
ln 1 2 X ki ... k X ki z = ln[p/(1-p)] = + X + e (5.15)
1- P
, where, p is the probability that the event Y occurs, p(Y=1)
Assume a latent variable, Z , mediates between the explanatory‘s and the dummy variable Y. Then solving for the
probability that y=1 we have:
P
ez
1- P
P ( 1 - P ) e z e z - Pe z
z
P Pe ez
P (1 e z ) e z
ez eZ 1 1 1
P p
1 ez 1 eZ e-Z 1 e-Z e-z 1 1 e- z
Yi 1-Yi
The above logistic probability is simply denoted as Z (cumulative logistic function: f Yi / Z Z 1- Z)
The higher Z is, the higher the probability that Y = 1.
The logistic distribution constrains the estimated probabilities to lie between 0 and 1.
The estimated probability of success/occurrence of event Y is:
p = 1/[1 + e(- - X)] (5.16)
1-p = 1/[1 + e( + X)]
If you let + X =0, then p = .50 (probability of success)
As + X gets really big, p approaches 1, but, as + X gets really small, p approaches 0
3. Binary Probit Model (BPM)
Instead of using the cumulative logistic function to transform the model, the probit model
uses the cumulative normal standard distribution
1-Y
Like logit model, the Bernoulli trail of conditional on Z is given by: f Y / Z PY 1 - P,
for the Probit model .
Plugging the standard normal cumulative density function into the above function
Y 1-Yi
f Yi / Z Z i 1 - (5.17)
When the transformation function F is the cumulative density function (cdf) of the standard
normal distribution, the response probability is given by:
t Z2
1 -2
P (Y 1) e dt ( z ) (5.18)
, where - 2
z 1 2 X ki ... k X ki
• As for the logistic approach, this function provides a transformation to ensure that the
fitted probabilities will lie between zero and one.
Logit or Probit?
The Logit and Probit models are almost identical and the choice of the model is arbitrary (there is no
practical difference between the two models), although logit model has certain advantages (simplicity-
much easier to work with because the function can be simplified to a linear equation). Probit functions
are difficult to work with because they require integration, but computer packages can calculate probits
easily
In the probit model, we assume εi is distributed by the standard normal. In the logit model, we assume εi
is distributed by the logistic
Both the Probit and Logit models have the same basic structure.
1. Estimate a latent variable Z/Y* (unobserved, what we observe is a dummy variable y=1, if z > 0
or y = 1, otherwise <=0 ) using a linear model. Z ranges from negative infinity to positive
*
infinity. The idea behind logit and probit is: y 1 i f y 0 (5.19)
0 i f y* 0
2. Use a non-linear function to transform Z into a predicted Y value between 0 and 1
Both are used to predict an outcome variable that is categorical (violates the assumption of linearity in
normal regression) from one or more categorical or continuous predictor variables
Both uses a the CDF (CLF and CNF) to transform the model so that the probabilities follow the S-shape,
but differ in the relative thickness of the tails. Logit is relatively thicker than Probit. This difference
would, however, disappear, as the sample size gets large. If we have a small sample the two distributions
can differ significantly in their results (the slope varies dramatically), but they are quite similar in large
samples.
Estimation, hypothesis testing and Interpretation
• Since both are non-linear, maximum likelihood is usually used to estimate the parameters of
the model (of course, except LPM, binary choice models use MLE)
The likelihood function for these models is given by:
n
Y 1-Yi
L k / Yi , Xi z i 1- , for Probit Model
i1
n 1-Yi
Y
L k / Yi , Xi z i 1- z , for Logit Model
i 1
The Log Likelihood function of these models is give as:
n
ln L k / Yi , X i Yi ln z 1 - Yi ln 1 - z , for Probit, and
i 1
n
ln L k / Yi , X i Yi ln z 1 - Yi ln 1 - z , for Logit
i 1
These functions can be optimized using standard methods to get the parameter values.
Maximization of the likelihood function with respect to the parameter gives the logit and
probit estimates depending on the cdf used (logistic or normal)
The ML estimator of β is consistent and asymptotically normally distributed.
However, the estimation rests on the strong assumption that the latent error term is normally
distributed and homoscedastic. If homoscedasticity is violated, no easy solution.
Cont’d---
The probability model in the form of a regression is:
E Y / X 0 1- F ' X 1F ' X (5.20)
F ' X
Whatever distribution is used, the parameters of the model like those of any other nonlinear
regression model, are not necessarily the marginal effects:
E Y / X dF ' X
(5.21)
X d ' X
f ' X
Where, f . is the density function that corresponds to the cumulative density distribution, F . .
a) For the normal distribution, this is:
E Y / X
' X (5.22)
X
Where . is the standard normal density.
b) For logistic distribution
= d ' X e
'X
' X 1- ' X (5.23)
d ' X 1 e ' X 2
EY / X
=
' X 1- ' X
= (5.24)
X
In interpreting the estimated model, in most cases the means of the regressors are used.
Cont’d---
In the logit model, we can interpret an on the odds. That is, every
unit increase in X results in a multiplicative effect of eb on the odds.
Example: If b = 0.25, then e.25 = 1.28.Thus, when X changes by one
unit, p increases by a factor of 1.28, or changes by 28%.
In the probit model, use the Z-score terminology. For every unit
increase in X, the Z-score (or the Probit of “success”) increases by b
units or, we can also say that an increase in X changes Z by b standard
deviation units.
Let’s see the following regression out put
Cont’d---
Assume the following regression result from 644 observations using Binary logit
model. What Point Spreads Say About on the Probability of Winning the game?
Cont’d---
The estimated slope of the point spread is -0.1098
One approach is to predict Prob(Y ) for different values of X, to see how the probability changes as
X changes.
The effect of a 1-unit change in X varies greatly, depending on the initial value of E(Z ).
Step One: Calculate the E(Z ) values for X = 5.88 and X = 6.88, using the fitted values.
Step Two: Plug the E(Z ) values into the formula for the logistic density function.
Z (5 .8 8 ) 0 - 0 .1 0 9 8 5 .8 8 0 .6 4 5 6
Z ( 6 .8 8 ) 0 - 0 .1 0 9 8 6 .8 8 0 .7 5 5 4
e x p ( Zˆ )
F o r t h e l o g i t , F ( Zˆ )
1 e x p ( Zˆ )
F ( 0 .7 5 5 4 ) - F ( 0 .6 4 5 6 ) 3 .2 0 - 3 .4 4 - 0 .0 2 4 .
Cont’d---
Changing the point spread from 5.88 to 6.88 predicts a 2.4 percentage
point decrease in the team’s chance of victory.
Note that changing the point spread from 8.88 to 9.88 predicts only a 2.1
percentage point decrease.
In summary:
The signs of the coefficients in the logit/ probit model have the same
meaning as in the linear probability (i.e. OLS) model
The interpretation of the magnitude of the coefficients differs, though, the
dependent variable has changed dramatically.
Above all, note that you cannot interpret the coefficients directly in terms of
units of change in y for a unit change in x, as in regression analysis.
Cont’d---
As stated, interpretation of the coefficients needs slight care
Suppose, a 1-unit increase in xi will cause a iF(xi) increase in probability.
Usually, these impacts of incremental changes in an explanatory variable are
evaluated by setting each of them to their mean values.
These estimates are sometimes known as the marginal effects
The “marginal effects” are not constant: the slope (i.e. the change in
probability) of the graph of the logit/probit changes as estimated z moves
from 0 to 1
Therefore, the marginal effects can be evaluated at the sample means of the
data. Or the marginal effects can be evaluated at every observation and the
average can be computed to represent the marginal effects.
More generally, the marginal effects are give as
j for linear probability mod el
pi
j pi 1- pi for the log it mod el
xij
j zi for the probit mod el
Hypothesis testing
Standard errors and t-ratios will automatically be calculated by the econometric software
package used, and hypothesis tests can be conducted in the usual fashion for LPM
Usually hypothesis testing can be conducted using z-statistic for NLBM
One of the statistic that is used in testing hypotheses in logit and probit models is the
likelihood ratio (LR) test.
It is twice the difference in the log-likelihoods:
(5.25)
L R = 2 (L ur - L r )
,where Lur is the log-likelihood value for the unrestricted model (the maximum likelihood
function when maximized with respect to all the parameters) and the Lr is the log-likelihood
value for the restricted model (the maximum when maximized with restrictions).
Since Lur Lr, LR 0 and it is usually strict positive. The idea is that, because the MLE
maximizes the log-likelihood function, dropping variables generally leads to no larger log-
likelihood
In binary choice models, the LR is a negative number due to the fact that variables in the log-
likelihood function are strictly between 0 and 1 whose logarithm is negative.
Multiplication by two is required so that LR has an appropriate 2 distribution under H0.
Measures of Goodness of fit
When the dependent variable to be measure is dichotomous, there is a problem of using the
conventional R2 as a measure of goodness of fit
Various R2 measures have been devised for Logit and Probit. However, none is a measure of
the closeness of observations to an expected value as in regression analysis. All are adhoc
It is also possible to calculate a pseudo R-squared (measures based on likelihood ratios) for
logit and probit comparable to the standard R2 for linear regressions by computing the
squared correlation between the fitted and actual values of yi.
Let be the maximum likelihood function when maximized with respect to all the parameters
and be the maximum when maximized with restrictions.
McFadden (1974) suggested:
L ur (5.26)
p seu d o R 2 = 1 -
L 0
, where Lur is is the log-likelihood function for the estimated model and L0 is the log-
likelihood function with only intercept. More practically,
McFadden's-R2 = 1 - [LL(, )/LL()]
If the covariates have no explanatory power, Lur /L0 =1 and the pseudo R-squared is zero.
Since usually |Lur|<|L0|, then the pseduo R2 is greater than zero.
If Lur were zero, then the pseudo R-squared would assume unity. But it cannot be zero as it
requires that all the estimated probabilities when yi=1 should be one and when yi = 0, the
estimated probabilities should be zero.
Multinomial Models
Multinomial models are multi-equation models used in unordered dependent
variables case
Such unordered choice model can be motivated by a random utility model. For
the ith consumer faced with J choices, suppose that the utility of choice j is:
U ij = xij q + eij (5.27)
If the consumer makes choice j in particular, then we assume that Uij is the
maximum among the J utilities.
Hence, the statistical model is driven by the probability that choice j is made,
which is:
P r (U i j > U i k ) for all ot her k ¹ j . (5.28)
The probability that a particular consumer will choose a particular alternative
is given by the probability that the utility of that alternative to that consumer is
greater than the utility to that consumer of all other alternatives
Two models in this category are multinomial logit and multinomial probit
models. These are simple extension to the logit and probit models when the
dependent variable can take more than two categorical values
Multinomial Logit Model
Multinomial logit models are a straightforward extension to the logit model when the
dependent variable can take more than two categorical values
The multinomial logit is developed specifically for the case with more than two
qualitative choices and the choice is made simultaneously
Suppose we face a multiple choice response variable such as means of transportation
choices where the order of the choices does not matter
A traveler has a choice for a trip to work either by: Car, Bus or Train, Or
Choice of a major: economics, marketing, management, finance or accounting.
How can we build and estimate a model for choosing from more than two different
choices?
If the traveler would have the choice between car and bus only we would estimate a
model of the probability of going by car against going by bus, or vice versa
With the extra choice we just add an extra model explaining train against bus
So we go from 1 equation for 2 alternatives to 2 equations for 3 alternatives
Cont‘d---
In general:
Multinomial logit models are multi-equation models
A response variable with K categories will generate K-1 equations
Each of these K-1 equations is a binary logistic regression comparing a group
with the reference group
Notice that multinomial logit can only be used when the choices are actually
mutually exclusive. That means that a person chooses exactly one of the
options, not more and not less
Why multinomial logit and not probit?
The reason for this is simple
It is easy to extend the binary model to a multinomial model using the
logistic distribution
However not using the normal distribution
Multinomial choice probabilities
Pij = P(individual i choices alternative j)
The first category used as a reference
1
pij pPindividual
i1 i chooses alternativej
1 exp x exp x
, j 1 (5.29)
12 22 i 13 23 i
exp 12 22 xi
pi 2 , j2 (5.30)
1 exp 12 22 xi exp 13 23 xi
P (Y n 1) 1 1 (5.44)
- 1 - k
x nk 1 exp( X n - 1 ) 1 exp( X n - 1 )
P (Y n j ) 1 1
( 1- -
x nk 1 exp( X n - j ) 1 exp( X n - j )
1 1 (5.45)
1 - ) k
1 exp( X n - j -1 ) 1 exp( X n - j -1 )
P (Y n J ) 1 1
( 1 - k (5.46)
x nk 1 exp( X n - J ) 1 exp( X n - J )
Cont’d---
Define Z = Xβ+U with no intercept and U ~ N(0,1)
Pr(y=0) = Pr(z < _cut1)
Pr(y=1) = Pr(_cut1 < z < _cut2)
Pr(y=2) = Pr(_cut2 < z)
| |
y=0 | y = 1 | y=2
| |
---------------|------------|------------
_cut1 _cut2
Disadvantages of OLM
Assumption of equal slope k
Biased estimates if assumption of strictly ordered outcomes does not hold
treat outcomes as nonordered unless there are good reasons for imposing a
ranking
Limited Dependent Variable (LDV) Models---
A LDV is a continuous variable with a lot of repeated observations at the lower or
upper limit. Examples include the quantity of a product consumed, the number of
hours that an individual work, etc.
LDVM arise when the variable of interest is constrained to lie between zero and one,
as in the case of a probability, or is constrained to be positive, as in the case of wages
or hours worked.
Most often, researchers encounter a data where some values of the dependent
variable are missing most of the time from a randomly drawn sample, probably due to
data sample design such as top-coding or institutional constraints. In such cases, the
data is said to be censored.
Such censoring problem is solved by censored regression model by utilizing the
information we have on whether the missing values are above or below a certain
threshold.
Tobit model is originally developed to deal with corner solution outcome, it is used to
estimate models of both cases (corner solution outcomes where the dependent
variable assumes zero values for a significant part of the data and censoring of various
type).
If no observations are censored, the Tobit model is the same as an OLS regression
Censoring vs Truncation
Censored regression models are used when the value for the dependent variable is
unknown (clustered at a lower threshold, an upper threshold, or both so that values
above (or below) this cannot be observed) while the values of the independent
variables are still available
But in Truncated regression models the whole observations are missing so that the
values of both variables are unknown when the dependent variable is above (or below)
a certain threshold.
Censored sample include consumers who consume zero quantities of a product
But Truncated sample only include consumers who choose positive quantities of a
product.
Censored sample observe people that do not work but their work hours are recorded
as zero.
Truncated sample do not observe anything about people who do not work.
The censored sample is representative of the population (only the mean for the
dependent variable is not) because all observations are included.
The truncated sample is not representative of the population because some
observations are not included.
Cont’d---
Truncation has greater loss of information than censoring (missing
observations rather than values for the dependent variable)
Unlike the types of limited dependent variables examined so far, censored or
truncated variables may not necessarily be dummies
Because of censoring, the dependent variable y is the incompletely observed
value of the latent dependent variable y*.
Suppose that a privatisation is heavily oversubscribed, and you were trying to
model the demand for the shares using household income, age, education, and
region of residence as explanatory variables. The number of shares allocated to
each investor may have been capped at, say 250, resulting in a truncated
distribution.
In this example, even though we are likely to have many share allocations at
250 and none above this figure, all of the observations on the independent
variables are present and hence the dependent variable is censored, not
truncated.
Tobit Model
This model is for metric dependent variable and when it is limited in the sense we
observe it only if it is above or below some cut off level. For example,
the wages may be limited from below by the minimum wage
Top coding income at, say, at $300,000
expenditures on a certain product can not be lower than zero. In case of a good like
food, that is not a problem, because almost everybody has food expenditures. In
case of a good like cars this is different.You will observe a lot of people having zero
expenditure on cars
Zero, in this case, is called a corner solution. The tobit model is taking account of
corner solutions. This is also called Tobit I model
It is a standard regression model in which all negative values are censored at zero
Reasoning behind:
If we include the censored observations as y = 0, the censored observations on the left will
pull down the end of the line, resulting in underestimates of the intercept and overestimates
of the slope.
If we exclude the censored observations and just use the observations for which y>0 (that is,
truncating the sample), it will overestimate the intercept and underestimate the slope.
Cont’d---
The degree of bias in both will increase as the number of observations that take on the value of
zero increases.
The Tobit model uses all of the information, including info on censoring and provides
consistent estimates
The standard tobit model:
y i xi ' i
y i y i if y i 0 (5.47)
y i 0 if y i 0
The latent variable is often explained intuitively as indicating that there are people that are
willing to have negative outcomes on the dependent variable.
For instance on car sales. Mathematically that is what we are assuming
However, intuitively that does not make sense
To illustrate, suppose that we wanted to model the demand for shares, as discussed earlier, as a
function of income (x2i), age (x3i), education (x4i), and region of residence (x5i). The Tobit
model would be
y i* 1 2 x 2 i 3 x 3 i 4 x 4 i 5 x 5 i u i
y i y i* fo r y i* 2 5 0
*
(5.48)
yi 250 fo r y i 250
Cont’d---
yi* represents the true demand for shares (i.e. the number of shares
requested) and this will only be observable for demand less than 250.
It is important to note in this model that 2, 3, etc., represent the impact on
the number of shares demanded (of a unit change in x2i, x3i, etc.) and not the
impact on the actual number of shares that will be bought (allocated).
The Tobit model is also referred to as the censored regression model
Example the Income of y*=4,000 will be censored as y=3,000 with top
coding of 3,000
Censoring from below and Censoring from above
Censoring from below
The actual value for the dependent variable y is observed if the latent variable y* is
above the limit and the limit is observed for the censored observations.
We observe the actual hours worked for people who work and zero for people
who do not work.
Cont’d---
Censoring from above
The actual value for the dependent variable y is observed if the latent variable y* is
below the limit and the limit is observed for the censored observations.
If people make below $100,000, we observe their actual income and if they
make above $100,000, we record their income as 100,000 (censored values).
The model consists of two parts
The probability of observing a zero value
The distribution of the positive values
The probability of observing a zero value:
P y i 0 P y i 0 P i - x i '
(5.49)
i xi ' xi ' xi '
P - - 1-
Estimation and interpretation
It is also a nonlinear model and similar to the probit model. It is estimated
using maximum likelihood estimation techniques. The likelihood function for
the tobit model takes the form:
(5.50)
This is an unusual function, it consists of two terms, the first for non-censored
observations (it is the pdf), and the second for censored observations (the cdf)
The maximum likelihood estimates of and are obtained by maximizing the
log-likelihood.
The estimated tobit coefficients are the marginal effects of a change in xj on y*,
the unobservable latent variable and can be interpreted in the same way as in a
linear regression model.
But such an interpretation may not be useful since we are interested in the
effect of X on the observable y (or change in the censored outcome).
Cont’d---
It can be shown that change in y is found by multiplying the coefficient with
Pr(a<y*<b), that is, the probability of being uncensored. Since this probability is a
fraction, the marginal effect is actually satisfied.
In the above, a and b denote lower and upper censoring points. For example, in left
The signs of the effects of children and years of education change when using tobit instead of
OLS
The effect of self rating of marriage is much higher using tobit
Cont’d---
Before moving on, two important limitations of Tobit modelling should be
noted.
First, such models are much more seriously affected by non-normality and
heteroscedasticity than are standard regression models, and biased and
inconsistent estimation will result.
Second, the tobit model requires it to be plausible that the dependent variable
can have values close to the limit.
There is no problem with the privatisation example discussed above since the
demand could be for 249 shares.
However, it would not be appropriate to use the Tobit model in situations
where this is not the case, such as the number of shares issued by each firm in
a particular month.
For most companies, this figure will be exactly zero, but for those where it is
not, the number will be much higher and thus it would not be feasible to
issue, say, 1 or 3 or 15 shares.
In this case, an alternative approach should be used.
The truncated regression can be considered
Truncation
Truncation is a situation where a subset of the population in a sampling scheme is
excluded on the basis of the dependent variable y before sampling.
A truncated dependent variable, on the other hand, occurs when the observations for
both the dependent and the independent variables are missing when the dependent
variable is above (or below) a certain threshold.
Example: Sample consists only of people with values of Y below a limit c. (truncated
regression model.)
In the case of truncation, we do not have a random sample from the underlying
population. The rule that was used to include units in the sample is, however, known.
It is apparent to see that no information on explanatory variables is available.
Thus, dealing with truncated data is really a sample selection problem because the
sample of data that can be observed is not representative of the population of interest
- the sample is biased, very likely resulting in biased and inconsistent
parameter estimates.
This is a common problem, which will result whenever data for buyers or users only
can be observed while data for non-buyers or non-users cannot.
Cont’d--
The distribution of the positive values:
o This is equal to the distribution of the dependent variable given that it is
positive
o This is a truncated normal distribution with expectation:
E y i | y i 0 x i ' E i | i - xi'
x i '
xi ' (5.51)
x i '
is the normal density function
The marginal effect on the expected value of y of a change in of explanatory
variable k is: E y i
k x i ' /
x ik
(5.52)
Notice that
The expectation depends on the x-values of person i. And, therefore, marginal effects differ
between individuals
Cont’d--
For truncated data, a more general model is employed that contains two
equations - one for whether a particular data point will fall into the observed
or constrained categories and another for modelling the resulting variable.
The second equation is equivalent to the tobit approach.
This two-equation methodology allows for a different set of factors to affect
the sample selection
If it is thought that the two sets of factors will be the same, then a single
equation can be used and the tobit approach is sufficient.
In many cases, however, the researcher may believe that the variables in the
sample selection and estimation equations should be different.
Thus the equations could be
a* z z z z (5.53)
i 1 2 2i 3 3i 4 4i m mi i