You are on page 1of 144

Chapter 5

Models for qualitative and limited dependent


variables

Applied Econometrics
Winter Term 2020/21
Prof. Dr. Simone Maxand
Humboldt University Berlin
5.1 Introduction 2 | 144

Contents I
5.1 Introduction
5.2 Binary response models
5.2.1 Introduction & model formulation
5.2.2 Probit and logit models
5.2.3 Maximum likelihood estimation
5.2.4 Model diagnostics
5.3 Limited dependent variables
5.3.1 Introduction
5.3.2 Truncation and censoring
5.3.3 Truncated regression model
5.3.4 Censored regression model
5.A Literature

Applied Econometrics – Chapter 5


5.1 Introduction 3 | 144

5.1 Introduction
What is Microeconometrics?
I Analysis of individual data, i.e. data concerning the behaviour and
attitudes of persons, households or firms.
. Econometric methods to study microeconomic phenomena.
. The underlying model is typically a microeconomic model
where individual decisions and behavior are a function of
exogenous parameters.
I Typical questions: How do individual characteristics affect
. decision to work (or to buy a new product)?
. choice of travel mode (train, bus, car, bike)?
. household purchases of durable goods?
. number of hours worked?
. number of children?
. duration of unemployment?
Applied Econometrics – Chapter 5
5.1 Introduction 4 | 144

The classical linear regression model might need to be modified


for analyzing individual data, because

I the dependent variables are often non-continuous (qualitative, e.g.


binary),
I the dependent variables may be limited, e.g.,
. wages of workers are non-negative,
. the monthly household expenditures for a certain consumer
good may be described by a positive continuous variable which
coexists with a discrete cluster of observations at zero, or
I the data may be subject to systematic sample selection since
. the data are non-experimental (“observational”), i.e. they are
not “generated by a random experiment”; instead they are
collected from surveys and administration records.

Applied Econometrics – Chapter 5


5.1 Introduction 5 | 144

Types of variables / data


I Basic distinction: Quantitative (metric) vs. qualitative (categorical,
non-metric) variables
I A quantitative variable may be continuous or discrete.
. One can “measure” the size of the difference between any two
variable values, so that it is sometimes called metric variable.
. Examples: (log)salary, or age in years
. Often, except for count data, the distinction between discrete
and continuous data doesn’t matter (usually one considers the
real line as the variable’s support).
⇒ The classical linear model may be applied in case of a quantitative
dependent variable (except for count data) provided that
. it is not limited (truncated or censored), and
. the observations form a random sample from the population.
Applied Econometrics – Chapter 5
5.1 Introduction 6 | 144

I Qualitative variables have a finite number of mutually exclusive


categories, which can be
. non-ordered (nominal - binary or multinomial - variable), or
. ordered (ordinal variable).
I A binary (dichotomous) variable has two possible outcomes; it
indicates whether a certain property is present or not, e.g.:
. Has a credit application been approved (yes/no)?
. Is a person’s willingness-to-pay greater than the asking price
(yes/no)?
I A multinomial variable has three or more possible outcomes
(non-ordered categories), e.g.:
. Type of health insurance (state, private, or none)
. Employment status (full-time, part-time, unemployed, or not in
labor force)

Applied Econometrics – Chapter 5


5.1 Introduction 7 | 144

I An ordered variable has categories which are ordered, but differences


between the categories are “not defined”, e.g.:
. Self-assessed health satisfaction in GSOEP with scale from 0
(totally dissatisfied) to 10 (totally satisfied).
In this course:
I Binary dependent variables: Only two possible values.
. E.g. decision of employees to take on a job.
I Limited dependent variables: Values of variables are limited, e.g. on
an interval.
. Example: Duration of unemployment is censored from below
at 0, where 0 possesses a positive probability.
⇒ Models for truncated & censored data

Applied Econometrics – Chapter 5


5.1 Introduction 8 | 144

1.2

1.0

0.8

Y: BVG ticket 0.6

0.4

0.2

0.0

-0.2
0 1000 2000 3000 4000 5000
X: INC

Figure 1: Example of binary dependent variable from HU student survey

Applied Econometrics – Chapter 5


5.1 Introduction 9 | 144

Censored dependent variable


I yi takes on values in a limited range.
I Often: yi takes on zero for significant fraction of observations.
I Examples:
. Household purchases of durable good
. Number of hours worked
I Problem: Modeling the conditional expectation of yi given xi by a
linear model may be misleading since
. the linear relation may hold in the population, but observed
sample should not be considered as representative!
I Solution: Censored regression model (e.g. Tobit model)

Applied Econometrics – Chapter 5


5.1 Introduction 10 | 144

2000

1600

Y: RENT 1200

800

400

0 1000 2000 3000 4000 5000


X: INC

Figure 2: Limited dependent variable from HU student survey

Applied Econometrics – Chapter 5


5.1 Introduction 11 | 144

30

25

20

Y: Wage of Wifes 15

10

-5
7 8 9 10 11 12
X: log(Family income)

Figure 3: Limited dependent variable (see Greene (2003), Ex. 22.8)

Applied Econometrics – Chapter 5


5.1 Introduction 12 | 144

Truncated data
I Data for (xi , yi ) are not available if yi is above or below a certain
threshold.
I That is, some observations have been systematically excluded from
the sample.
I Example: Sample of (data on) households with income below
100,000 $
. The sample necessarily excludes all households with income
above that level. ⇒ No random sample of all households.
I Using truncated sample for investigating relationship between y and
x is potentially misleading (when using a linear model).
I Solution: Truncated regression model

Applied Econometrics – Chapter 5


5.1 Introduction 13 | 144

Common aspects of models


I Often, interest is in the conditional probability distribution instead
of the conditional expectation (as in case of a linear model).
I Question: What is the ceteris paribus effect of a change in one
explanatory variable on the entire distribution of the dependent
variable?
I Here: Parametric approach, i.e. we assume that the (conditional)
distribution of the dependent variable is known up to a
finite-dimensional parameter which in turn is specified as a
(parametric) function of the explanatory variables.
I Assuming (cross-sectional) independence, the parameters of the
model are estimated by maximum likelihood (instead of least
squares in the linear regression model).

Applied Econometrics – Chapter 5


5.1 Introduction 14 | 144

Empirical example 1: Determinants of fertility


I Ref.: Winkelmann & Boes
I Individual fertility decisions (no. of children born by a women)
depend on
. labor market opportunities and thus on education,
. social norms and values,
. marital status,
. health, etc.
I Focus in empirical studies is on women’s education:
If higher education of women leads to fewer children per woman,
then we have
. an explanation for fertility decline in developed world in 2nd
half of last century, and
. a recipe for reducing high population growth rates in some
parts of the developing world.
Applied Econometrics – Chapter 5
5.1 Introduction 15 | 144

Example 1: Data
I US General Social Survey (GSS):
. Annual or biannual cross-sectional survey (started in 1972)
. Information on no. of children ever born by a women, etc.
I Number of children is a count variable!
I Alternatively, we could investigate the proportion of childless women
⇒ binary variable!
I Here: Every 4th year 1974 - 2002
I Restriction to women beyond child-bearing age (40 years) to avoid
interfering effect of age:
. “Younger women tend to have less children than older”.
. Otherwise: Consider no. of children for younger women as
censored.
Applied Econometrics – Chapter 5
5.1 Introduction 16 | 144
Example 1: Descriptive statistics
I Pool observations over years ⇒ 5150 women (age ≥ 40)
No. of children ever born Frequencies
to women (age ≥ 40) Absolute Relative
0 744 14.45
1 706 13.71
2 1368 26.56
3 1002 19.46
4 593 11.51
5 309 6.00
6 190 3.69
7 89 1.73
8 or more 149 2.89
Table 1: Fertility distribution
Applied Econometrics – Chapter 5
5.1 Introduction 17 | 144

Example 1: Research questions


1. Is there a downward trend in fertility (i.e. do earlier birth cohorts
have a higher fertility than later ones)?
2. If yes, to what extent can this trend be explained by the rising
education levels of women?
I Aim: Statistical explanation (more educated women have fewer
children; proportion of more educated women increases over time ⇒
average fertility declines)
I No analysis of: Why more educated women have less children?
⇒ Investigate, whether (over time)
. average levels of fertility went down?
. average levels of education increased?
I Count data model: Problem of “censoring” (category: “8 or more”)

Applied Econometrics – Chapter 5


5.1 Introduction 18 | 144

Example 1: Year-by-year statistics


No. of No. of Proportion Years of
Year observations children of childless schooling
1974 410 3.17 0.09 11.07
(0.10) (0.01) (0.16)
1978 445 2.73 0.14 11.00
1982 577 2.96 0.14 11.05
1986 470 2.70 0.16 11.34
1990 431 2.50 0.15 12.41
1994 989 2.40 0.15 12.78
1998 911 2.42 0.15 12.94
2002 917 2.36 0.16 13.25
(0.06) (0.01) (0.10)

Table 2: Fertility and average education level by years


Applied Econometrics – Chapter 5
5.1 Introduction 19 | 144

Example 1: Interpretation of Table 2


I Large no. of observations per year
⇒ Small confidence intervals for population parameters
I Clear evidence of a downward trend in fertility
I Possibly, this trend can (partially) be explained by increased levels of
education.
I Exercise:
. Can the average of a discrete variable be normally distributed?
(Consequences for inference?)
. Test whether the average no. of children is the same in 1974
and in 2002.
. Is the difference in education levels between 1974 and 2002
statistically significant?

Applied Econometrics – Chapter 5


5.1 Introduction 20 | 144

Example 1: Linear regression analysis


Dependent Variable: KIDS
Method: Least Squares
Date: 10/12/09 Time: 12:48
Sample: 1 9120 IF AGE >= 40
Included observations: 5150

Variable Coefficient Std. Error t-Statistic Prob.

YEAR=1974 3.170732 0.093298 33.98506 0.0000


YEAR=1978 2.734831 0.089554 30.53846 0.0000
YEAR=1982 2.960139 0.078646 37.63887 0.0000
YEAR=1986 2.702128 0.087139 31.00926 0.0000
YEAR=1990 2.496520 0.090997 27.43533 0.0000
YEAR=1994 2.400404 0.060071 39.95942 0.0000
YEAR=1998 2.422613 0.062590 38.70613 0.0000
YEAR=2002 2.364231 0.062385 37.89755 0.0000

R-squared 0.018405 Mean dependent var 2.586408


Adjusted R-squared 0.017069 S.D. dependent var 1.905469
S.E. of regression 1.889137 Akaike info criterion 4.111669
Sum squared resid 18350.97 Schwarz criterion 4.121839
Log likelihood -10579.55 Durbin-Watson stat 1.808056

Figure 4: No. of children in dependence of year dummies


I Why model without constant?
I Predicted no. of children in 1982?
Applied Econometrics – Chapter 5
5.1 Introduction 21 | 144

Dependent Variable: KIDS


Method: Least Squares
Date: 10/12/09 Time: 12:55
Sample: 1 9120 IF AGE >= 40
Included observations: 5150

Variable Coefficient Std. Error t-Statistic Prob.

C 3.026134 0.055684 54.34433 0.0000


TIME -0.026256 0.002929 -8.963844 0.0000

R-squared 0.015368 Mean dependent var 2.586408


Adjusted R-squared 0.015177 S.D. dependent var 1.905469
S.E. of regression 1.890954 Akaike info criterion 4.112428
Sum squared resid 18407.74 Schwarz criterion 4.114971
Log likelihood -10587.50 F-statistic 80.35050
Durbin-Watson stat 1.802490 Prob(F-statistic) 0.000000

Figure 5: No. of children in dependence of time


I time:=year-1974
I Predicted no. of children in 1982?
I Prediction of no. of children in 2000?

Applied Econometrics – Chapter 5


5.1 Introduction 22 | 144

Dependent Variable: KIDS


Method: Least Squares
Date: 10/12/09 Time: 12:57
Sample: 1 9120 IF AGE >= 40
Included observations: 5150

Variable Coefficient Std. Error t-Statistic Prob.

C 4.391621 0.102744 42.74341 0.0000


TIME -0.014002 0.002967 -4.719580 0.0000
EDUC -0.128275 0.008187 -15.66723 0.0000

R-squared 0.060188 Mean dependent var 2.586408


Adjusted R-squared 0.059823 S.D. dependent var 1.905469
S.E. of regression 1.847595 Akaike info criterion 4.066229
Sum squared resid 17569.83 Schwarz criterion 4.070042
Log likelihood -10467.54 F-statistic 164.8140
Durbin-Watson stat 1.837188 Prob(F-statistic) 0.000000

Figure 6: No. of children in dependence of time and years of schooling

I Which model would you prefer?


I Is education related to fertility?
I Discuss potential shortfalls of linear regression?

Applied Econometrics – Chapter 5


5.2 Binary response models 23 | 144

Contents I
5.1 Introduction
5.2 Binary response models
5.2.1 Introduction & model formulation
5.2.2 Probit and logit models
5.2.3 Maximum likelihood estimation
5.2.4 Model diagnostics
5.3 Limited dependent variables
5.3.1 Introduction
5.3.2 Truncation and censoring
5.3.3 Truncated regression model
5.3.4 Censored regression model
5.A Literature

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.1 Introduction & model formulation 24 | 144

5.2.1. Binary response models


5.2.1. Introduction and model formulation
I Binary dependent variables occur very frequently:
Outcome is true/false, yes/no decision, success/failure.
I E.g.: Event that a person buys a car, is unemployed, smokes, is
childless, has visited a doctor during the last quarter, has an
extramarital affair, has been granted a bank loan, etc.
I Depending on the aim of the analysis, binary variables may be
constructed from multinomial, ordered or continuous variables (with
loss of information!):
. Happy/unhappy instead of ordered variable happiness with
scale from 0 (“completely unhappy”) to 10 (“completely
happy”)
. Recession yes/no? based on GDP (continuous variable)
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.1 Introduction & model formulation 25 | 144

Example: Discrimination in mortgage market?


I Does a bank treat applicants for a mortgage with German and
Non-German origin (to buy an identical house) the same way?
I By law they must receive identical treatment.
I Loans are made and denied for many legitimate reasons.
I It is not sufficient to compare the fraction of both groups of
applicants who were denied a mortgage.
⇒ We need a method for comparing rates of denial, holding other
characteristics constant.
I Problem is similar to multiple regression, but now the dependent
variable is binary.

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.1 Introduction & model formulation 26 | 144

Bernoulli variable
I Two possible outcomes of y are usually coded by
1 (yes/“success”) and 0 (no/“failure”), i.e.:
y = 1 if “event occurred”, otherwise y = 0.
I No loss in generality if interest is only in the probability of success.
!
⇒ With p = P(y = 1) = 1 − P(y = 0):
y ∼ Bernoulli(p) = Bin(1, p)

⇒ Maximum likelihood approach is appropriate.


I Outcome of y ∈ {0, 1} (and thus p) might depend on individual’s
characteristics x = (x1 , . . . , xK )0 .
I Econometric analysis requires observations of (y , x ):
(yi , xi ), i = 1, . . . , N; with xi = (xi1 , . . . , xiK )0 , xi1 ≡ 1

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.1 Introduction & model formulation 27 | 144

Linear probability model


I Assumption: E(εi |xi ) = 0 (⇒ E(εi ) = 0)
K
X
yi = xi0 β + εi = β1 + βj xij + εi , i = 1, ..., N
j=2

I “Probability of success”
pi := P(yi = 1|xi ) = E(yi |xi ) = xi0 β
⇒ yi ∼ Bernoulli(pi ) = Bin(1, pi )
⇒ V(yi |xi ) = pi (1 − pi ) = xi0 β(1 − xi0 β)

I Marginal (probability) effect of xj (on E(yi |xi ) or pi ):


∂P(yi = 1|xi ) ∂E(yi |xi )
= βj = (j = 2, ..., K )
∂xij ∂xij
. The last equality holds (only) for the chosen coding.
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.1 Introduction & model formulation 28 | 144

Disadvantages of linear probability model


I Implicit restrictions on β : 0 ≤ xi0 β ≤ 1 (∀i)
I yi binary ⇒ εi binary (not normally distributed!):
two possible outcomes: 1 − xi0 β (with probability pi ) and
−xi0 β (with probability 1 − pi )
⇒ Given xi , εi is heteroskedastic:
V(εi |xi ) = V(yi |xi ) = xi0 β(1 − xi0 β)

⇒ OLSE βbOLS of β is unbiased, but inefficient.


I βbOLS neglects parameter restriction ⇒ possibly:
p b i = 1|xi ) = x 0 βbOLS ∈
bi = P(y / [0, 1]!
i

I OLSE with robust standard errors (or MLE for pi = xi0 β) may serve
as useful exploratory tool (often: reasonable direct estimation of
average marginal effects and hint to statistically relevant variables).
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.1 Introduction & model formulation 29 | 144

Emp. Example 1 (determinants of fertility)


I Linear probability model
I CHILDLESS=1 if woman has no children
I Explanatory variables: TIME, EDUC (years of schooling), WHITE
(dummy variable), SIBS (number of siblings)
Dependent Variable: CHILDLESS
Method: Least Squares
Date: 10/19/09 Time: 17:54
Sample: 1 9120 IF AGE >= 40
Included observations: 5150

Variable Coefficient Std. Error t-Statistic Prob.

C 0.039729 0.026033 1.526072 0.1271


TIME 0.000602 0.000564 1.067451 0.2858
EDUC 0.007573 0.001637 4.626005 0.0000
WHITE 0.014248 0.013702 1.039826 0.2985
SIBS -0.002354 0.001561 -1.508272 0.1315

R-squared 0.007911 Mean dependent var 0.144466


Adjusted R-squared 0.007139 S.D. dependent var 0.351596
S.E. of regression 0.350338 Akaike info criterion 0.741136
Sum squared resid 631.4819 Schwarz criterion 0.747492
Log likelihood -1903.426 F-statistic 10.25640
Durbin-Watson stat 1.948998 Prob(F-statistic) 0.000000

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.1 Introduction & model formulation 30 | 144

Example 1, Linear probability model, cont.

I 1 year more schooling ⇒ Probability of being childless increases


(ceteris paribus) by 0.76 percentage points.
I Probability of being childless for a white woman surveyed in 1994
with 20 years of schooling and 3 siblings?

21.07%

I Extreme example: time=0, educ=0, white=0, sibs=23


⇒ predicted probability: −1.4% (makes no sense)!
I Interpret the other coefficients, too!
I How would you estimate the error variance (for different women)?

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.1 Introduction & model formulation 31 | 144

Example 1, Linear probability model, cont.


I Heteroskedasticity: Use of White standard errors
Dependent Variable: CHILDLESS
Method: Least Squares
Date: 10/19/09 Time: 18:13
Sample: 1 9120 IF AGE >= 40
Included observations: 5150
White Heteroskedasticity-Consistent Standard Errors & Covariance

Variable Coefficient Std. Error t-Statistic Prob.

C 0.039729 0.029000 1.369958 0.1708


TIME 0.000602 0.000539 1.118592 0.2634
EDUC 0.007573 0.001877 4.034242 0.0001
WHITE 0.014248 0.013155 1.083089 0.2788
SIBS -0.002354 0.001538 -1.530040 0.1261

R-squared 0.007911 Mean dependent var 0.144466


Adjusted R-squared 0.007139 S.D. dependent var 0.351596
S.E. of regression 0.350338 Akaike info criterion 0.741136
Sum squared resid 631.4819 Schwarz criterion 0.747492
Log likelihood -1903.426 F-statistic 10.25640
Durbin-Watson stat 1.948998 Prob(F-statistic) 0.000000

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.1 Introduction & model formulation 32 | 144

Nonlinear models for probabilities


pi = P(yi = 1|xi ) = G(xi0 β)

I “Natural” requirements on (known) function G:


(i) G is monotonically increasing.
(ii) G(z) → 0 as z → −∞ and G(z) → 1 as z → ∞.
⇒ Practice: G is some (cumulative) distribution function (cdf).
I (i) ⇒ For βj > 0, P(yi = 1|xi ) is increasing in xij , that is, positive
(negative) coefficients correspond to positive (negative) effects on
pi .
I If G is a differentiable cdf, then g(z) = G 0 (z) is its density.
⇒ Level-dependent marginal effects (on probability of success):
∂E(yi |xi ) ∂P(yi = 1|xi )
= = g(xi0 β)βj (j = 2, ..., K )
∂xij ∂xij
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.1 Introduction & model formulation 33 | 144

I Usually, g is density with relatively smaller values in the tails and


relatively larger values near the mean.
⇒ Effects are smallest for individuals for which P(yi = 1|xi ) is near 0
(in the left tail of g) or near 1 (in the right tail of g).
. Corresponds to intuition: Individuals with clear-cut preferences
are less affected by changes in the explanatory variables.
I Sensitivity of decisions to changes in x depends on shape of g.
Z
W.o.l.o.g. (xi1 ≡ 1!) zg(z)dz = 0 (i.e.: E[Z ] = 0 if Z ∼ G)

. Typically, g is unimodal and symmetric around 0.


⇒ g(z) = g(−z) and max g(z) = g(0)
z
⇒ Marginal effects are maximal if g(xi0 β) is maximal (i.e. if
xi0 β ≈ 0). Then: P(yi = 1|xi ) = G(xi0 β) ≈ G(0) = 21 .

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.1 Introduction & model formulation 34 | 144

Examples for choice of G


a) Probit model
Z z
1 2
G(z) = Φ(z) = √ e −x /2 dx (cdf of N (0, 1))
−∞ 2π
b) Logit model
ez
G(z) = Λ(z) = [(standard) logistic cdf]
1 + ez
c) Complementary log-log model
G(z) = C (z) = 1 − exp(− exp(z)) (cdf of extreme value distr.)

I Remark: G(z) = z (Identity function, no cdf)


⇒ Linear probability model!

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.1 Introduction & model formulation 35 | 144

Identification considerations
I Identifiability of β requires: G(z) is strictly increasing, rk(X ) = K .
I Moreover, mean and variance must be fixed: Let G, G e be cdf’s with
associated densities g, ge, and suppose that
Z ∼ G , U := (Z + µ)/σ ∼ Ge (for σ > 0).
⇒ G(u)
e := P(U < u) = G(σu − µ) , ge(u) = σg(σu − µ)
⇒ P(yi = 1|xi ) = G(xi0 β)
 0  K
!
xi β + µ β 1 + µ X β j
=Ge =G e + xij
σ σ j=2
σ

⇒ β is not identifiable unless µ and σ are fixed:


βe1 = (β1 + µ)/σ, βej = βj /σ (j = 2, ..., K ) ⇒ G(xi0 β) = G(x
e 0 β).
i
e
⇒ β and β are not distinguishable (unless fixing µ and σ).
e
⇒ Typically, µ = 0 (as before); additionally, fix σ.
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.1 Introduction & model formulation 36 | 144

Model interpretation in terms of latent


variables
I y ∗ - latent (unobservable, continuous) variable to be explained
⇒ “Natural” regression model for y ∗ (Index function model):

yi∗ = xi0 β + εi , εi i.i.d.

. yi∗ depends on individual characteristics xi via an index


function xi0 β (representing systematic effects on yi∗ ).
. Model is not estimable, since y ∗ is not observed!
I Instead we observe only a binary variable y , which takes value 1 or 0
according to whether or not y ∗ crosses a threshold:
1, if yi∗ > 0

yi =
0, else
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.1 Introduction & model formulation 37 | 144

I Motivation of threshold 0:
. εi , xi independent with εi ∼ G (otherwise, εi |xi ∼ G) ⇒
P(yi = 1|xi ) = P(yi∗ > 0|xi ) = 1 − G(−xi0 β)
!
= G(xi0 β) [if g(z) = g(−z)]
I Again, identification of single-index model requires restriction on
V(εi ), because β is identifiable only up to scaling.
. Observe only, whether
yi∗ > 0 ⇔ xi0 β + εi > 0
⇔ xi0 (σβ) + (σεi ) > 0 (∀ σ > 0).
⇒ Uniqueness is achievable
 by an restriction on error variance,
1, in probit model
e.g. V(εi ) =
π 2 /3, logit model.

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.1 Introduction & model formulation 38 | 144

Choice of threshold

I Threshold is not necessarily 0, e.g.

yi = 1 ⇔ yi∗ > zi0 δ (if xi and zi deterministic).

⇒ P(yi = 1) = G(xi0 β − zi0 δ)

⇒ δ is separately identifiable only if all components of zi and xi are


distinguishable.

. Require that (X , Z ) has full rank.


. In particular, zi and xi should not include simultaneously an
intercept.

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.1 Introduction & model formulation 39 | 144

Model interpretation in terms of utilities


I Individual chooses between two alternative 0 and 1 such that the
utility is maximized.
I Let ui0 and ui1 be the utility when choosing 0 and 1, resp.
I Example: ui0 - utility of rental housing;
ui1 - utility of home ownership
I Specification by an additive random utility model:
ui0 = xi0 β 0 + ε0i
ui1 = xi0 β 1 + ε1i

. xi0 β j , εji are the deterministic and stochastic utility


components, respectively; j = 0, 1.
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.1 Introduction & model formulation 40 | 144

I Utility maximization yields


yi = 1 ⇔ ui0 < ui1 ⇔ xi0 (β 1 − β 0 ) + ε1i − ε0i > 0
| {z } | {z }
=:β =:εi

or yi = 0 ⇔ xi0 β + εi ≤ 0

⇒ Model as before: P(yi = 1|xi ) = G(xi0 β)


(under same assumption on distribution of εi )
I Model requires (for identification) a scale normalization, since:
ui1 > ui0 ⇔ σui1 > σui0 (∀ σ > 0)
This is usually done by specifying the variance of εi = ε1i − ε0i (as
before) or by specifying the variances of ε1i and ε0i separately.
I The random utility formulation is especially useful for specifying
unordered multinomial choice models.

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.1 Introduction & model formulation 41 | 144

Graphical illustration with artificial data

I xi2 = i for i = 1, . . . , 200


I yi∗ = −10 + 0.1 xi2 + εi simulated by drawing εi independently
from N (0, 1) for i = 1, . . . , 200.
I Binary dependent variable:

yi = 0 if yi∗ < 0
yi = 1 if yi∗ ≥ 0

I Regress yi on xi = (1, xi2 )0 (estimate LPM by OLS).


I Plot xi2 against yi (scatter plot) together with regression line
(x 0 βbOLS ).
i

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.1 Introduction & model formulation 42 | 144

1.2

1.0

0.8

Y 0.6

0.4

0.2

0.0

-0.2
0 40 80 120 160 200
X

Figure 7: Observed (yi ) and fitted values (xi0 βbOLS ) in LPM.

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.1 Introduction & model formulation 43 | 144

1.2

1.0

0.8

0.6
Y

0.4

0.2

0.0
PROB Y
-0.2
0 40 80 120 160 200
X

Figure 8: Nonlinear (probit) model: (yi ) and G(xi0 β)


b

Applied Econometrics – Chapter 5 .


5.2 Binary response models | 5.2.2 Probit and logit models 44 | 144

2.2.2 Probit and logit models


Z z
I Probit-model: G(z) = Φ(z) = φ(x )dx
−∞
1 2
. φ(z) = √ e −z /2 and Φ(z) denote the density and the cdf,

respectively, of the standard normal distribution.
ez
I Logit-model: G(z) = Λ(z) =
1 + ez
. Λ denotes the cdf of the standard logistic distribution.
. The associated density function is
ez
λ(z) [= Λ0 (z)] =
(1 + e z )2

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.2 Probit and logit models 45 | 144

The general logistic distribution


I Random variable Z is logistically distributed with the parameters µ
and κ, if
z−µ
1 e κ
Λ(z; µ, κ) =  z−µ
= z−µ (distribution function)
1 + exp − κ 1+e κ

(z−µ)
d e− κ
λ(z; µ, κ) = Λ(z; µ, κ) = (density function)
dz (z−µ) 2
h i
κ 1 + e− κ

I Moments for Z ∼ Λ(·; µ, κ), where Λ(z; 0, 1) = Λ(z):


1 2 2
E[Z ] = Median(Z ) = µ , V(Z ) = π ·κ
3
E[Z − µ]3 E[Z − µ]4 6
= 0 (Skweness) , 2
= 3 + (Kurtosis)
[V(Z )]3/2 [V(Z )] 5

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.2 Probit and logit models 46 | 144

Model comparison

I Moments
cdf Expectation Variance Skewness Kurtosis
Φ 0 1 0 3
π2 6
Λ 0 3 0 3+ 5

I For comparing the distributions,


√ we standardize the standard logistic
distribution by σ = π/ 3:
⇒ Standardized logistic distribution with
. cdf Λ(σz) and density σλ(σz)
⇒ Expectation 0, Variance 1 (as for Φ)

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.2 Probit and logit models 47 | 144

Cdf of standard normal law and standardized


logistic distribution
Distribution Functions

1.0
Standard normal distribution
0.8
0.6 Standardized logistic distribution
G(x)

0.4
0.2
0.0

−4 −2 0 2 4

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.2 Probit and logit models 48 | 144

Density function of standard normal law and


standardized logistic distribution
Density Functions

0.5
Standard normal distribution
Standardized logistic distribution
0.4
0.3
g(x)

0.2
0.1
0.0

−4 −2 0 2 4

x
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.2 Probit and logit models 49 | 144

I The above Figures show:


. Both distributions (models) are very similar.
. Differences exist mainly in the center (around the mean 0) and
also in both tails, where the logistic density function has larger
values than that of the normal distribution.
⇒ There, the (absolute) marginal effects in the logit model are
larger than in the probit model.
⇒ The results based on both models do not differ much, unless the
tails of the distributions are of importance (i.e. if the data are not
very “unbalanced” in the sense that
N
X
#{i|yi = 1}/N = yi /N
i=1

does not differ too much from 12 ).

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.2 Probit and logit models 50 | 144

Comparing parameters
I For comparing the parameter √ estimates in both models, a scaling
different from the factor π/ 3 ≈ 1.8 is recommended.
I The parameters should be scaled such that the maximal effects
(obtained at x 0 β = 0) are comparable:
√1 e 0
maxz φ(z) φ(0) 2π 4
= = e0
=√ ≈ 1.6 =: ρ
maxz λ(z) λ(0) (e 0 +1)2

⇒ For comparison, the (estimated) parameters of the probit model are


multiplied by ρ (≈ 1.6) (or those of the logit model are divided by
ρ).
I Note: λ density of Z ⇒ density λ e of Z
e = Z /ρ satisfies:

λ(z)
e = ρ · λ(ρz) and λ(0)
e = φ(0)
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.2 Probit and logit models 51 | 144

Reporting marginal effects


I Marginal effects vary across the individuals (depend on xi )!
I Average marginal effects of the individuals in the sample:
N N
1 X ∂P(yi = 1|xi ) 1 X
= βj g(xi0 β) (j = 2, ..., K ).
N i=1 ∂xij N i=1

I Sometimes: effect evaluated at the average values of xi ’s (simpler


to compute, but less clear interpretation):
N
∂P(yi = 1|xi ) 0 1 X
= g(x · β)β j , x · = xi
∂xij
xi =x · N i=1

I In general, different values because g is nonlinear and thus:


E[g(xi0 β)] 6= g(E[xi0 β]).

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.2 Probit and logit models 52 | 144

Case of discrete explanatory variables


I Recommend computation of the discrete change in the probabilities
associated with a discrete change in the j-th explanatory variable by
an amount ∆xij :
∆pij := G(xi0 β + ∆xij0 βj ) − G(xi0 β)

I Marginal effects of dummy variables (xij ∈ {0, 1}) ⇒ compare the


probabilities of succes in the situations xij = 0 and xij = 1:
P(yi = 1|xi , xij = 1) − P(yi = 1|xi , xij = 0)
   
X X
= G βj + βk xik  − G  βk xik  .
k6=j k6=j

I Again, one can compute the average effect, or the effect for the
average characteristics.
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.2 Probit and logit models 53 | 144

An advantage of the logit model


I For computing marginal effects, we need the density g(xi0 β). In case
of the logit model one obtains for g(xi0 β) = λ(xi0 β):
λ(z) = Λ(z)[1 − Λ(z)] , z = xi0 β

I An advantage of the logit model is that the logistic distribution


function Λ(z) and its inverse, the so-called logit function, are given
in closed form - in contrast to Φ(z) and its inverse (probit
function). It can be shown that, with p = Λ(z) (= Λ(x 0 β)),
 
0 −1 p
z = [x β =] logit(Λ(z)) = logit(p) = Λ (p) = ln ,
1−p
i.e. the logit function may be expressed as log-odds.
⇒ βj is the semielasticity of the odds w.r.t. the j-th regressor.

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.2 Probit and logit models 54 | 144

Special cases of generalized linear models


I (Conditional) expectation of the dependent variable µi = E(yi |xi ) is
related to a systematic component (linear predictor, i.e. ηi = xi0 β)
via a link function h:
ηi = h(µi ).
I Binary dependent variable ⇒ µi = P(yi = 1|xi ) = pi and:
 
µi
Logit model: ηi = logit(µi ) = ln (logit function)
1 − µi
Probit model: ηi = probit(µi ) = Φ−1 (µi ) (probit function)
complementary ηi = ln(− ln(1 − µi ))
Log-log model:
I Canonical link function (for binary y ): logit function (cp.
McCullagh & Nelder, 1983)

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.3 Maximum likelihood estimation 55 | 144

5.2.3 Maximum likelihood (ML) estimation


I The notation of this subsection distinguishes between random
variables Yi and their realizations yi . Moreover,
Y = (Y1 , . . . , YN )0 and y = (y1 , . . . , yN )0

I Assumption: Yi and xi are stochastic with (Yi , xi ) i.i.d.


. or xi deterministic and Yi independent random variables

⇒ Yi |xi (stochastically) independent (i = 1, ..., N)


⇒ Yi |xi ∼ Bernoulli(pi ) (since Yi binary)

I Model assumption: pi = G(xi0 β) [G(z) = Λ(z) or G(z) = Φ(z)]


I Marginal distribution of xi does not depend on β ⇒ it contains no
information about β (regressors are then weakly exogenous) and it
suffices to consider the (conditional) likelihood function.

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.3 Maximum likelihood estimation 56 | 144

(Conditional) likelihood function


I For binary random variables Yi with realizations yi ∈ {0, 1} we
obtain as probability mass function (“density”):
f (yi |xi ) = P(Yi = yi |xi ) = piyi (1 − pi )1−yi .

⇒ (Conditional) likelihood function (joint density of Y (given X ), read


as function of the parameter β for given observation y ):
N
Y N
Y
L(β) = L(β; y ) = f (yi |xi ) = piyi (1 − pi )1−yi
i=1 i=1
N
Y
= G(xi0 β)yi [1 − G(xi0 β)]1−yi .
i=1

. Usually, this function is called likelihood function, although it


is - strictly speaking - a conditional likelihood function.
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.3 Maximum likelihood estimation 57 | 144

ML estimator (MLE)
I Log-likelihood function:
N
X
`(β) = `(β; y ) = ln[L(β)] = {yi ln(pi ) + (1 − yi ) ln(1 − pi )}
i=1
N
X
= {yi ln[G(xi0 β)] + (1 − yi ) ln[1 − G(xi0 β)]} .
i=1

I Any maximizer βb of the (log-) likelihood function is called MLE.


⇒ Necessary condition for MLE (likelihood equations):
N
∂`(β) X (yi − pi ) !
s(β; y ) = = g(xi0 β)xi = 0 (pi = G(xi0 β)).
∂β i=1
pi (1 − pi )

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.3 Maximum likelihood estimation 58 | 144

Numerical procedures
I There is no explicit solution to the likelihood equations.

I However, in case of both the logit and the probit model the
log-likelihood function `(β) is globally concave, so that a unique
MLE exist.

⇒ Numerically procedures, such as the Newton-Raphson procedure will


converge quickly to the MLE in these models:
β(n+1) = β(n) − H−1 (βn )s(β(n) ) (Iteration),
∂`(β)
where s(β) = is the gradient (score) vector and H(β) is the
∂β
Hessian matrix of `(β):
∂ 2 `(β)
H(β) = .
∂β∂β 0
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.3 Maximum likelihood estimation 59 | 144

Fisher information
I The Fisher information is the negative expected Hessian matrix
(here: conditional expectation given X if X is random), i.e.:
N
g(xi0 β)2
  X
∂s(β)
I(β) = −Eβ [H(β)] = −Eβ 0
= xi xi0 .
∂β i=1
p i (1 − pi )

⇒ I(β) is positive definite for any value of β ∈ RK , if


. G is cdf with 0 < G(z) < 1 for all z ∈ R, and
. X = (x1 , . . . , xN )0 has full rank K (with probability one).
⇒ The expected Hessian is then negative definite.
I However, from this fact it does not follow that the actual Hessian is
negative definite (and thus that the log-likelihood is globally
concave).

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.3 Maximum likelihood estimation 60 | 144

Fisher information - Interpretation


I Cramér-Rao: I(β)−1  V[β]
e for any unbiased estimator β.
e
I Alternative definition (equivalent under regularity conditions):
 
∂`(β) ∂`(β)
I(β) = Eβ [s(β, Y )s(β, Y )0 ] = Eβ .
K ×K ∂β ∂β 0
2 " d #2
dβ f (Y , β)

∂`(β; Y )
⇒ For K = 1 : I(β) = Eβ = Eβ
∂β f (Y , β)
d
dβ f (y ,β)
I f (y ,β) describes the relative rate of change of the density in y .
⇒ The larger I(β) is in β = β0 , the easier it is to distinguish β0
from adjacent parameter values and the more precise one can
estimate the parameter value β = β0 .
⇒ I(β) describes the information contained in y about β.
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.3 Maximum likelihood estimation 61 | 144

Asymptotic properties of the MLE βb


I Assumption: The model is correctly specified. Then:
I βb is consistent, i.e.
p
βb −
→β as N → ∞.
. This follows basically from E(Yi |xi ) = pi .
I Under certain regularity conditions (see Amemiya, 1985), βb is
asymptotically normally distributed and asymptotically efficient, i.e.
√ d
N(βb − β) −−−−→ N (0, lim [I(β)/N]−1 ),
N→∞ N→∞

where I(β) is the Fisher information matrix.


⇒ Approximate (“asymptotic”) covariance matrix of the MLE:
b = I(β)−1 .
AV(β)
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.3 Maximum likelihood estimation 62 | 144

On the definition of asymptotic efficiency

I A consistent and asymptotic normally distributed estimator is called


asymptotically efficient, if the covariance matrix of the asymptotic
distribution is V∞ (β) := limN→∞ [I(β)/N]−1 or the resulting
approximate covariance matrix is I(β)−1 .
I There exist several asymptotically efficient estimators with very
different finite sample properties.
I Concept suggests V∞ (β) as lower bound for the covariance matrix
in the asymptotic distribution of consistent estimators.
However, the covariance matrix (in asympt. distribution) of
so-called super-efficient estimators may fall below that bound!

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.3 Maximum likelihood estimation 63 | 144

Statistical inference
I For large N, the following approximate distribution can be used:
βb ≈ NK (β, V b suitable estimator of I(β)−1 ).
b ) (V

(a) V b −1
b1 = I(β)
" N #−1
X ∂`i ∂`i
(b) V
b2 =
i=1
∂β ∂β 0 β=βb
" #−1
∂ 2 `

(c) V3 = −
b
∂β∂β 0 β=βb
" N
#−1
X g(xi0 β)
b2
(a) ⇒ V
b1 = xi x 0
bi ) i
bi (1 − p
p
i=1
N
" #−1
X (yi − pbi )2
(b) ⇒ V
b2 =
2 2
g(xi0 β)
b 2 xi x 0
i
p
i=1 i
b (1 − p
b i )
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.3 Maximum likelihood estimation 64 | 144

Special case of logit model


ez ez
I G(z) = Λ(z) = , g(z) = λ(z) = Λ0 (z) =
1 + ez (1 + e z )2
⇒ Λ(z)(1 − Λ(z)) = λ(z)
0
e xi β
I pi = P(Yi = 1|xi ) = Λ(xi0 β) = 0 =: Λi
1 + e xi β
⇒ λi := λ(xi0 β) = Λi (1 − Λi )
⇒ Likelihood equations:
 XN
∂`(β)
s(β) = = (yi − pi )xi = 0 .
∂β i=1

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.3 Maximum likelihood estimation 65 | 144

Logit model, cont.


I Vector of residuals (yi − p bi = Λ(xi0 β)
bi )i=1,...,N with p b is orthogonal
to the regressors (similar to the linear model).
I If xi1 ≡ 1 (i.e. model includes an intercept), then the first likelihood
equation yields:
PN
i=1 [yi − Λ(xi0 β)]
b =0
N
1 X
⇔ y· = Λ(xi0 β)
b =p

N i=1

⇒ The average estimated probability of success equals the observed


frequency of success.

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.3 Maximum likelihood estimation 66 | 144

Logit model: Uniqueness of MLE


I Hessian matrix of `(β) is negative definite and thus `(β) is globally
concave:
N N
∂ 2 `(β) ∂s(β) X ∂pi X
= = − x i = − λi xi xi0
∂β∂β 0 ∂β 0 i=1
∂β 0 |{z}
i=1 Λ (1−Λ )
i i
|{z}
λ(xi0 β)xi0

N
∂ 2 `(β) X
⇒ = − pi (1 − pi )xi xi0 n.d.
∂β∂β 0 i=1
N
X
I Assume that X 0 X = xi xi0 is regular (p.d.) and pi ∈ (0, 1),
i=1
implying λi = pi (1 − pi ) > 0 (∀i).
I Note that the Hessian matrix does not depend on y .

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.3 Maximum likelihood estimation 67 | 144

Logit model: Fisher Information


N
∂ 2 `(β)
  X
I(β) = −E = pi (1 − pi )xi xi0
∂β∂β 0 i=1
" N #−1
X
−1 0
⇒ Vb = I(β)
b = bi (1 − p
p bi )xi xi
i=1

I This estimator of AV(β)b corresponds to V b3 and coincides with


0b
representation of V1 [note λi = g(xi β) = p
b b bi (1 − p
bi )].
2
I Previous representation of V2 : use (yi − p
b bi ) instead of pbi (1 − p
bi ),
bi = g(x 0 β)
since again λ b =pbi (1 − p
bi ).
i
. Relation between the “outer product” of the score vector and
the Fisher information is obvious from
E[(Yi − pi )2 |xi ] = V(Yi |xi ) = pi (1 − pi )

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.3 Maximum likelihood estimation 68 | 144

Probit model

I The analysis is technically somewhat more involved.

I But the Hessian matrix is here again negative definite, so that there
are generally no problems with the numerical determination of the
MLE.

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.3 Maximum likelihood estimation 69 | 144

Perfect prediction
I An MLE does not always exist.
For example, if rank(X ) < K , then the parameter β is not
identifiable (as in the linear case).
Assuming rank(X ) = K (achievable e.g. by a re-parametrization of
the model) avoids that problem.
I However, in a nonlinear binary response model one may be
confronted with the so-called problem of perfect prediction.
. It is typically a problem of the sample at hand and not of
identification.
. It would possibly disappear if more data (or another sample)
were available.

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.3 Maximum likelihood estimation 70 | 144

Perfect prediction: Example


I Response yi ; regressors xi and a dummy variable di with:
yi = 1 whenever di = 1, and yi = 1 or 0 if di = 0.
⇒ Impossible to estimate the effect of di on P(Yi = 1|xi , di ):
N
X
`(β, δ) = {yi ln[G(xi0 β + δdi )] + (1 − yi ) ln[1 − G(xi0 β + δdi )]}
i=1
X
= ln[G(xi0 β + δ)] +
i: di =1
X
{yi ln[G(xi0 β)] + (1 − yi ) ln[1 − G(xi0 β)]}
i: di =0

I Only the first sum (over i with di = 1) depends on δ.


⇒ There is no (finite) MLE of δ!

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.3 Maximum likelihood estimation 71 | 144

Perfect prediction, cont.


I The same problem may also arise in the following cases:
. yi = 0 whenever di = 0,
. yi = 1 whenever di = 0, or
. yi = 0 whenever di = 1.
I The smaller the number of observations with di = 1, the more likely
the problem of perfect prediction occurs.
I In the extreme case with di = 1 for just one observation, perfect
prediction must even occur.
I To get rid of the problem, one should exclude the dummy variable
from the regressors.

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.3 Maximum likelihood estimation 72 | 144

Emp. Example 1 (determinants of fertility)


I Probit Model: Estimation and information criteria (STATA output)
. probit childless time educ white sibs if age>39

Iteration 0: log likelihood = -2126.8908


Iteration 1: log likelihood = -2107.1463
Iteration 2: log likelihood = -2107.1116
Iteration 3: log likelihood = -2107.1116

Probit regression Number of obs = 5150


LR chi2(4) = 39.56
Prob > chi2 = 0.0000
Log likelihood = -2107.1116 Pseudo R2 = 0.0093

-----------------------------------------------------------------------
childless | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------+---------------------------------------------------------
time | .0027483 .0025405 1.08 0.279 -.002231 .0077275
educ | .0314184 .0071296 4.41 0.000 .0174446 .0453923
white | .0625978 .0626362 1.00 0.318 -.0601669 .1853624
sibs |-.0117455 .0071229 -1.65 0.099 -.0257061 .0022152
_cons | -1.503 .1157881 -12.98 0.000 -1.729941 -1.27606

. estat ic
-----------------------------------------------------------------
Model | Obs ll(null) ll(model) df AIC BIC
-------------+---------------------------------------------------------------
. | 5150 -2126.891 -2107.112 5 4224.223 4256.957
-----------------------------------------------------------------------------

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.3 Maximum likelihood estimation 73 | 144

Logit model: Estimation and IC


. logit childless time educ white sibs if age>39

Iteration 0: log likelihood = -2126.8908


Iteration 1: log likelihood = -2106.1626
Iteration 2: log likelihood = -2105.9611
Iteration 3: log likelihood = -2105.9611

Logistic regression Number of obs = 5150


LR chi2(4) = 41.86
Prob > chi2 = 0.0000
Log likelihood = -2105.9611 Pseudo R2 = 0.0098
-----------------------------------------------------------------------
childless | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------+---------------------------------------------------------
time | .0049797 .00467 1.07 0.286 -.0041733 .0141328
educ | .0630456 .0136222 4.63 0.000 .0363466 .089744
white | .1287142 .1181644 1.09 0.276 -.1028837 .3603122
sibs |-.0210833 .0134808 -1.56 0.118 -.0475053 .0053387
_cons |-2.676407 .2251076 -11.89 0.000 -3.11761 -2.235204

. estat ic

-----------------------------------------------------------------------------
Model | Obs ll(null) ll(model) df AIC BIC
-------------+---------------------------------------------------------------
. | 5150 -2126.891 -2105.961 5 4221.922 4254.656
-----------------------------------------------------------------------------

→ See R code for further analysis results!


Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.4 Model diagnostics 74 | 144

5.2.4 Model diagnostics


Covariate patterns

I Two observations share the same covariate pattern if their regressors


are identical.

I Statistical information in the sample can be summarized by the


covariate patterns, the number of observations with that covariate
pattern, and the number of positive outcomes.

I For example, Stata calculates residuals and diagnostic statistics in


terms of covariate patterns.

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.4 Model diagnostics 75 | 144

I Assume: M covariate patterns; pattern j with nj observations


P
( nj = N)
⇒ Number of positive outcomes with pattern j:
X
yej := yi ∼ Bin(nj , pj )
i:xi =xj

M  
Y nj ey
⇒ Likelihood function: pj j (1 − pj )nj −eyj
yej
j=1

⇒ Maximized log-likelihood function of current model:


M    
X n
ln(b
Lc ) = ln j + yej ln(bpj ) + (nj − yej ) ln(1 − p
bj ) ,
yej
j=1

bj = G(xj0 β).
where p b

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.4 Model diagnostics 76 | 144

Pearson’s χ2 goodness-of-fit statistic


I Measure for discrepancy between the data and the (fitted) model:
M
X yj − nj p
(e bj )2
χ2 =
bj (1 − p
nj p bj )
j=1

. Given the model holds, χ2 follows approximately a χ2M−K


distribution (often inadequate if M is large).
. Asymptotic justification requires a fixed M and nj → ∞ (∀ j).
I Hosmer-Lemeshow goodness-of-fit χ2 :
. Similar to Pearson, but instead of using M covariate patterns
as groups it uses quantiles of the predicted probabilities to
form a smaller number m of groups (e.g. m = 10).
. m groups lead to a statistic with an approximate χ2m−2
distribution given the model holds.

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.4 Model diagnostics 77 | 144

I There is no direct generalization of R 2 to nonlinear models, since


the estimation procedures do not aim at maximizing the “fraction of
the explained variance”.
I Testing the overall goodness-of-fit corresponds to testing the null
hypothesis H0 : β2 = . . . = βK = 0.
I Likelihood ratio (LR) test ⇒ Test statistic:

LR = 2(ln(LU ) − ln(LR )),

where LU = max L(β) = L(β)


b and LR = max L(β).
β H0

I Under H0 :
d
LR −−−−→ χ2K −1 .
N→∞

⇒ An asymptotic α-test rejects H0 , if LR > χ2,1−α


K −1 .

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.4 Model diagnostics 78 | 144

McFadden’s pseudo R 2
ln(LU )
RF2 = 1 −
ln(LR )

I ln[L(β)] is a sum of log probabilities, LU ≥ LR ⇒


ln(LU )
0 ≥ ln(LU ) ≥ ln(LR ) ⇒ 1 ≥ ≥0
ln(LR )
⇒ 0 ≤ RF2 ≤ 1

I RF2 = 0 ⇔ βb2 = ... = βc


K = 0 (i.e. under H0 )

I RF2 = 1 ⇔ LU = 1 (⇔ ln(LU ) = 0)
⇔p
bi = yi (∀i) (practically, not achievable for finite β)
b
(i.e. model provides perfect prediction).
I But values between 0 and 1 have no natural interpretation!
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.4 Model diagnostics 79 | 144

Model selection
I Comparison of model candidates m ∈ M e.g. by
2b 2|m|
AIC (m) = − `m + , where
N N
. `m is the maximized log-likelihood for model m
b
. |m| denotes the dimension (number of parameters) of model m
⇒ Minimizing AIC (m) over m ∈ M provides trade-off between
good model fit (small bias) and low model complexity (small
estimation error/variance)
. AIC (m) - approximately unbiased estimate of (twice the)
expected Kullback-Leibler discrepancy of model m
2 2
. NLM: AIC (m) = ln(b σm ) + 2|m|/N (bσm MLE of σ 2 under m)
. Min.-AIC-procedure is (under ass.) asymptotically optimal
I BIC (m) uses factor ln(N) instead of 2 as penalty for |m|
(under assumptions: consistent model selection procedure)
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.4 Model diagnostics 80 | 144

Predictive quality

I Alternative model specifications may be compared by evaluating


their classification properties.

I The MLE βb of the model parameter β provides an estimator


bi = G(xi0 β)
p b of the probabilities of success pi = G(x 0 β) (for
i
choosing yi = 1).

I This can be used to predict yi (the choice):



1, if p
bi > c
ybi = ybi (c) =
0, else

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.4 Model diagnostics 81 | 144

On choosing the cutoff

I In practice one often chooses the threshold (cutoff) c = 0.5.

I This cutoff choice may be regarded as reasonable if the outcomes 0


and 1 are equally likely to occur in the population, and if the costs
of incorrectly predicting 0 and 1 are approximately the same.

I However, this threshold level has the weakness that if most


outcomes are successes (yi = 1), then it is very likely that for all
observations p bi > 0.5 and thus ybi = 1, leading to
2
P
i (yi − y
b i ) = N(1 − y ) as the number of wrong predictions.

I A similar argument holds if most outcomes are failures.

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.4 Model diagnostics 82 | 144

2 × 2 classification table
I Results can be summarized in a 2 × 2 classification table of the
predicted responses ybi against the observed responses yi :
Actual value
yi = 1 yi = 0 Total
Predicted ybi = 1 TP FP TP+FP
outcome ybi = 0 FN TN FN+TN
Total TP+FN FP+TN N

I TP(True Positives), FP(False Positives), FN(False Negatives),


TN(True Negatives)
I This contingency table could also be given for observations from an
independent validation sample (out-of-sample) instead of using the
actual sample (in-sample).
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.4 Model diagnostics 83 | 144

The hit rate


I The hit rate is defined as fraction of correct predictions
N PN
1 X TP + TN ! (yi − ybi )2
h = h(c) =
b b I(b
yi = yi ) = = 1 − i=1
N i=1 N N
and estimates the unconditional probability of correct classification
h(c) = P(b
y (c) = y ).
I Instead of treating the cutoff as given, say c = 0.5, one could try to
find an “optimal” cutoff for the data set by evaluating different
cutoff values and minimizing the associated proportion of incorrectly
predicted outcomes or, equivalently, by maximizing the hit rate bh(c).
I This approach seems reasonable when the data is a random sample
from the population of interest, and the costs of incorrectly
predicting 0 and 1 are the same.

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.4 Model diagnostics 84 | 144

Specificity and sensitivity


I Maximization of the hit rate (min. of the estimated unconditional
probability of misclassification) is not always meaningful.
I E.g., if the outcome 1 (“success”) is very rare, then one tends to
choose a very large c such that a failure is predicted for everyone.
⇒ Alternative: Consider the conditional probabilities of correct
classification given yi = 0 and yi = 1, respectively, i.e.
. Specificity: h0 = h0 (c) = P(by (c) = 0|y = 0),
. Sensitivity: h1 = h1 (c) = P(by (c) = 1|y = 1).
. There is a close relation to the errors of type I and II for tests.
I h0 and h1 can be estimated by the proportion of correct predictions
separately for the outcomes yi = 1 and yi = 0, resp. [proportion of
actual negatives (positives) which are correctly identified]:
TN TP
h0 = b
b h0 (c) = , h1 = b
b h1 (c) = .
FP + TN TP + FN
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.4 Model diagnostics 85 | 144

Receiver operating characteristic (ROC)


I ROC curves measure the predictive power in two class problems.
I There is usually a trade-off between h0 and h1 : The higher h0 , the
lower h1 , and vice versa.
I Graphical representation by the ROC curve: Plot of sensitivity,
h1 (c), versus (one-specificity), 1 − h0 (c), as the cutoff is varied
(from 1 to 0). [In practice: Plot b h1 (c) versus 1 − b
h0 (c).]
I The curve starts at (0, 0), corresponding to c = 1, and continuous
to (1, 1), corresponding to c = 0.
I A model with no predictive ability (complete randomness) yields a
straight (diagonal) line from (0, 0) to (1, 1)
I The curve in case of a perfect prediction goes straight from (0, 0)
via (0, 1) to (1, 1).
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.4 Model diagnostics 86 | 144

Area under the ROC curve (AUC)


I The predictive ability could be assessed by the area under the ROC
curve (AUC), which varies from 0.5 for “random prediction” to 1 for
“perfect prediction”.
I The greater the predictive power of a model, the more bowed the
curve, and hence the larger the area under the curve. This allows a
comparison of competing models.
I The ROC curve may be used to determine an optimal cutoff value
c, e.g. by minimizing the sum of the (conditional) error frequencies
or, equivalently, by maximizing h0 (c) + h1 (c).
Graphically, this point is obtained by shifting in parallel the diagonal
line to the northwest until it is just tangent to the ROC curve.

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.4 Model diagnostics 87 | 144

Analysis of residuals
I MLE is inconsistent if the model is not correctly specified.
I yi |xi ∼ Bernoulli(pi )
⇒ E(yi |xi ) = pi and Var(yi |xi ) = pi (1 − pi )
⇒ Pearson (or “standardized”) residuals:
yi − pbi
ri = p
bi (1 − p
p bi )
I Case of covariate patterns (as before)
P
. yej := i:xi =xj yi ∼ Bin(nj , pj )
⇒ E(e
yj |xj ) = nj pj and V(e
yj |xj ) = nj pj (1 − pj )
M
yej − nj p
bj X
⇒ rj = p ⇒ χ2 = rj2
bj (1 − p
nj p bj ) j=1
Applied Econometrics – Chapter 5
5.2 Binary response models | 5.2.4 Model diagnostics 88 | 144

I Outlier detection: e.g. histogram or box plot of (Pearson) residuals


I Potential heteroscedasticity? Plot rj vs. explanatory variables.
I Heteroskedasticity problems may also be caused by a
misspecification of the function G or by omitting a relevant
explanatory variable.
I If potential omitted variables are known, a Wald, LR or LM test
could be derived to test for that type of misspecification.
I Similarly, testing for heteroskedasticity is usually based on the LM
test statistic (assuming some heteroskedastic model under the
alternative).
I Similar to the linear regression case, more sophisticated tools for
identifying outliers, high leverage and influential values, etc. are
available in the literature (e.g. Pregibon, 1981).

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.4 Model diagnostics 89 | 144

Logit model diagnostics: Basic building blocks


I For identification of outlying and influential observations we need:
Residuals and an appropriate projection matrix.
I Likelihood equations (FOC):
N
X
s(β) = (yi −pi )xi = X 0 e = 0 with ei = yi −pi , e = (e1 , . . . , eN )0
i=1

I Hessian matrix of `(β)


N
X
H=− pi (1 − pi )xi xi0 = −X 0 WX with W = diag[pi (1 − pi )]
i=1

I Newton Raphson procedure


β(n+1) = β(n) − H−1 (β(n) )s(β(n) )

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.4 Model diagnostics 90 | 144
−1
⇒ With pseudo observations z(n) = X β(n) + W(n) e(n) :
β(n+1) = β(n) + (X 0 W(n) X )−1 X 0 e(n) = (X 0 W(n) X )−1 X 0 W(n) z(n)

I At convergence (β(n+1) = β(n) = β)


b

βb = (X 0 WX )−1 X 0 Wz (W , z are W(n) , z(n) evaluated at β)


b

I Define projection H = W 1/2 X (X 0 WX )−1 X 0 W 1/2 corresponding to


the hat matrix (P) in the linear model.
p
I Standardized Pearson residuals: erj = rj / 1 − hjj
I Pregibon (1981) influence statistic:
rj2 hjj rj2 hjj
∆βbj = (βb − βb−j )0 X 0 WX (βb − βb−j ) =
e
=
(1 − hjj )2 (1 − hjj )

. corresponds to K · Cj (Cj : Cook’s distance) in the linear model

Applied Econometrics – Chapter 5


5.2 Binary response models | 5.2.4 Model diagnostics 91 | 144

Emp. Example 1 (determinants of fertility)


I Logit Model: Probability of being childless in dependence of time,
educ, sibs and white
I Definition of variables and estimation results: Subsections 5.2.1,
5.2.3
I N = 5150 observations, K = 5
I Hosmer-Lemeshow: χ2 (8) = 54.35 (10 groups), p-value: 0
I Pearson χ2 goodness-of-fit statistic : χ2 (1603) = 1852.48 (1608
covariate patterns), p-value: 0
I Pseudo R 2 : RF2 = 0.0098
I Information criteria: AIC = 4221.922, BIC = 4254.656
I See R code for further model diagnostics.
Applied Econometrics – Chapter 5
5.3 Limited dependent variables | 92 | 144

Contents I
5.1 Introduction
5.2 Binary response models
5.2.1 Introduction & model formulation
5.2.2 Probit and logit models
5.2.3 Maximum likelihood estimation
5.2.4 Model diagnostics
5.3 Limited dependent variables
5.3.1 Introduction
5.3.2 Truncation and censoring
5.3.3 Truncated regression model
5.3.4 Censored regression model
5.A Literature

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.1 Introduction 93 | 144

5.3 Models for limited dependent variables


5.3.1 Introduction
Now, we consider regression problems in which
(i) the dependent variable of interest is not observed completely
(e.g. due to truncation or censoring),
or
(ii) the dependent variable is observed completely, but the chosen
sample is not representative for the population (e.g. because the
persons self-select into the sample).
⇒ The OLSE is inconsistent (even in case of a linear regression
function).

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.1 Introduction 94 | 144

Truncation

I In case of truncation we do not have all observations - neither for


the dependent variable y nor for the regressors x = (x1 , . . . , xK )0 .

I Focus is here on case of continuous dependent variable.

I Example 1: For the analysis of income equations we only have data


(yi , xi ), for persons with low income (say with income yi < a for
some threshold a; truncation from above).

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.1 Introduction 95 | 144

I Example 2: When the relation between car prices yi and the


characteristics of buyers xi (age, income etc.) is studied, we often
only have data (yi , xi ), where the car price yi is not below some
minimal car price (truncation from below); the sample does not
contain data about persons/households, for which all available cars
are too expensive.

I The truncation effect must be taken into account: If, e.g. in


Example 2, we want to forecast the potential interest for a cheaper
new car, most potential buyers are not contained in the sample.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.1 Introduction 96 | 144

Latent variable model for truncated data

yi∗ = xi0 β + εi , εi ∼ (0, σ 2 ) i.i.d.


I Model holds for the population.
[In Example 2 the population are potential and actual buyers.]
I Then, in case of truncation from below at a:

yi = yi∗ = xi0 β + εi if yi∗ > a


xi , yi are not observed if yi∗ ≤ a

⇒ Sample is drawn from restricted part of the population.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.1 Introduction 97 | 144

Graphical illustration of truncation effects


YSTAR vs. X Y vs. X
6 6

4 4

2 2
YSTAR

0 0

Y
-2 -2

-4 -4

-6 -6
-3 -2 -1 0 1 2 3 4 -3 -2 -1 0 1 2 3 4
X X

Figure 9: Truncation from below, truncation point a = 0

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.1 Introduction 98 | 144

Censoring

I In the case of censoring, information is lost only about the


dependent variable, but not about the regressors.

I Example 1: The sample contains persons of all income classes, for


which data of the relevant characteristics is available. However, for
confidentiality reasons all incomes above a certain threshold a (e.g.
a = 100.000 Euro) are coded (for these we only know that their
income is ≥ a; censoring from above).

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.1 Introduction 99 | 144

I Example 2: For the analysis of durable goods (e.g. cars,


refrigerators) we have data for all relevant explanatory variables, but
the purchases under a certain minimum value have the value 0.

I In contrast to truncated observations (where we have an information


loss), censored observations are available for the analysis.

I Standard model for the analysis: Tobit Model (Tobin, 1958).

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.1 Introduction 100 | 144

Latent variable model for censored data

yi∗ = xi0 β + εi , εi ∼ (0, σ 2 ) i.i.d.

I Model holds for the population.

I Observe, in case of censoring from below at a,

yi = yi∗ = xi0 β + εi if yi∗ > a


yi = a if yi∗ ≤ a

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.1 Introduction 101 | 144

Graphical illustration
YSTAR vs. X
of censoringY vs.effects
X
6 6

4 4

2 2
YSTAR

0 0

Y
-2 -2

-4 -4

-6 -6
-3 -2 -1 0 1 2 3 4 -3 -2 -1 0 1 2 3 4
X X

Figure 10: Censoring from below at a = 0

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.2 Truncation and censoring 102 | 144

5.3.2 Truncation and censoring

I Truncation is a property of the distribution, censoring is a property


of the sample.

(I) Truncated distributions


I Let Y be a random variable (RV) and a ∈ R some given threshold.
I Then the conditional distribution of Y given Y > a (or given
Y < a) is called distribution of Y truncated from below/from the
left (or from above/from the right) by a.

. Y is only observed above or below the threshold.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.2 Truncation and censoring 103 | 144

CDF of truncated distribution

I Let F denote the cdf of Y , i.e. F (y ) = P(Y < y ).

⇒ The distribution of Y truncated from below by a has the cdf

P(a < Y < y )


Fa (y ) = P(Y < y |Y > a) =
P(Y > a)
( F (y )−P(Y ≤a) F (y )−F (a)−P(Y =a)
1−P(Y ≤a) = 1−F (a)−P(Y =a) , if y > a
=
0, if y ≤ a

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.2 Truncation and censoring 104 | 144

Density / probability mass function under


truncation
I If Y is a discrete RV, then the truncated (from below by a)
distribution is characterized by the probability mass function
( P(Y =y )
P(Y >a) , if y > a
pa (y ) = P(Y = y |Y > a) =
0, otherwise.
I For a continuous RV Y with density f it holds P(Y = a) = 0 and
f (y ) = F 0 (y ). Consequently, the density of the truncated (from
below by a) distribution is
( f (y )
, if y > a
fa (y ) = 1−F (a)
0, otherwise.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.2 Truncation and censoring 105 | 144

Example of truncated normal distribution


I Let Y ∼ N (µ, σ 2 ), and φ(z) and Φ(z) denote the standard
normal pdf and cdf, respectively.
(  2 )  
1 1 y −µ 1 y −µ
⇒ f (y ) = √ exp − = φ
2πσ 2 σ σ σ
Z y  
y −µ
and F (y ) = f (x )dx = Φ .
−∞ σ

⇒ Density when truncating from below by a:


 1 y −µ
 σ φ( (a−µ)
σ )
, if y > a
fa (y ) = 1−Φ σ

0, otherwise.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.2 Truncation and censoring 106 | 144

0.8
truncated normal
normal

0.6
f(y)

0.4
0.2
0.0

a µ

Figure 11: Comparison of densities of N (µ, σ 2 ) and the corresponding


left-truncated distribution (µ = 3, σ = 1, a = 2).

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.2 Truncation and censoring 107 | 144

Moments of truncated normal distribution


Theorem 5.1. Let Y ∼ N (µ, σ 2 ). Then:

(a1) E(Y |Y > a) = µ + σλ( a−µ


σ ),
φ(z)
where λ(z) = 1−Φ(z) is the hazard function of N (0, 1).
e a−µ ),
(a2) E(Y |Y < a) = µ + σ λ( σ
φ(z)
where λ(z)
e = − Φ(z) is the negative “inverse Mills ratio”.

(b1) V(Y |Y > a) = σ 2 (1 − δ( a−µ


σ )),
where δ(z) = λ(z)(λ(z) − z).
e a−µ )),
(b2) V(Y |Y < a) = σ 2 (1 − δ( σ
where δ(z)
e = λ(z)(
e λ(z)
e − z).

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.2 Truncation and censoring 108 | 144

Remarks
(i) We always have (∀z): 0 < δ(z) < 1 and 0 < δ(z)
e < 1.
φ(−z) φ(z)
(ii) λ(−z) = = = −λ(z).
e
1 − Φ(−z) Φ(z)
(iii) Truncation reduces the variance.
(iv) Truncation from below (above) increases (reduces) the expectation.
(v) For a = 0 it follows:
φ(µ/σ)
E(Y |Y > 0) = µ + σ ,
Φ(µ/σ)
φ(z)
where λ(−z) = −λ(z)
e = is the “inverse Mills ratio”.
Φ(z)
(vi) (a2) follows from (a1), since −Y ∼ N(−µ, σ 2 ) and
E(Y |Y < a) = −E(−Y | − Y > −a).
Applied Econometrics – Chapter 5
5.3 Limited dependent variables | 5.3.2 Truncation and censoring 109 | 144

(II) Censored data


I The description of censored data is often done using latent variables.
I Let Y ∗ be a RV. Then the corresponding variable censored from
below/from left (or from above/from right) by a is given by

Y = max{Y ∗ , a} = I(Y ∗ > a)Y ∗ + I(Y ∗ ≤ a)a


(
Y ∗ , if Y ∗ > a
=
a, if Y ∗ ≤ a
( !
∗ Y ∗ , if Y ∗ < a
or Y = min{Y , a} =
a, else

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.2 Truncation and censoring 110 | 144

Example

I Let Yi∗ denote the price that an individual i is willing to pay for a
good (e.g. a refrigerator).

I We observe Yi = Yi∗ if some lower threshold a is exceeded.


Otherwise, we observe Yi = a (if Yi∗ ≤ a).

I Typically a = 0, which is not a restriction, if we use a linear model


with intercept for Y ∗ .

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.2 Truncation and censoring 111 | 144

Distribution of censored normal RV


I If Y ∗ ∼ N (µ, σ 2 ), then the distribution Y = max{Y ∗ , a} is mixed
continuous-discrete.
I The value y = a is attained with positive probability:
 
a−µ
P(Y = a) = P(Y ∗ ≤ a) = FY ∗ (a) = Φ .
σ
I For y > a, the density of Y is:
 
1 y −µ
f (y ) = φ .
σ σ
I Cdf of Y : (
y −µ

Φ σ , if y > a
F (y ) = P(Y < y ) =
0, otherwise.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.2 Truncation and censoring 112 | 144

1.00

1.00
0.75

0.75
0.50

0.50
F(y)
f(y)

Φ((a − µ) σ)

Φ((a − µ) σ)
(
o

a µ a µ

y y

Figure 12: Distribution of Y = max{Y ∗ , a}, where Y ∗ ∼ N(µ, σ 2 )


(censored normally distributed RV), with µ = 3, σ = 1, a = 2.
Left: Density / probability mass at point a; Right: cdf.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.2 Truncation and censoring 113 | 144

Moments of censored normal RVs


Theorem 5.2. Let Y ∗ ∼ N (µ, σ 2 ) and Y = max{Y ∗ , a} be the RV
censored from below by a. Then:
       
a−µ a−µ a−µ
(a) E(Y ) = Φ a+ 1−Φ µ + σλ
σ σ σ
  
a−µ
(b) V(Y ) = σ 2 1 − Φ
h σ
a−µ 2
i
· 1 − δ σ + a−µ
a−µ a−µ
 
σ − λ σ Φ σ

I Remarks:
(i) λ(z) and δ(z) are explained in Theorem 5.1.
(ii) For a = 0 it follows E(Y ) = Φ(µ/σ) · µ + σφ(µ/σ).

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.3 Truncated regression model 114 | 144

5.3.3 The truncated regression model


(truncated tobit model)

I Regression model without truncation:

yi = xi0 β + εi , i = 1, ..., N

. Under the assumption εi |xi ∼ N (0, σ 2 ) i.i.d. it follows:

E(yi |xi ) = xi0 β and

V(yi |xi ) = V(εi |xi ) = σ 2 = V(εi ),

i.e.: yi |xi ∼ N (xi0 β, σ 2 ) independent.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.3 Truncated regression model 115 | 144

I Truncated regression model:


We investigate the dependence of the conditional expectation of yi
given xi under the condition yi > a. It follows from Theorem 5.1:

a − xi0 β
 
E(yi |xi ; yi > a) = xi0 β + σλ ,
σ
a − xi0 β
  
V(yi |xi ; yi > a) = σ 2 1 − δ .
σ

⇒ Moments are shifted compared to the model without truncation.


I Truncation induces a nonlinear conditional expectation and
heteroscedasticity.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.3 Truncated regression model 116 | 144

Marginal effects

I Without truncation (i.e. in the latent variable model) :


∂E(yi |xi )
= βj
∂xij

I Under truncation:
a − xi0 β
 
∂E(yi |xi ; yi > a) ∂
= βj + σ λ
∂xij ∂xij σ
0
   
a − xi β βj
= βj + σδ · −
σ σ
0
  
a − xi β
= βj 1 − δ
σ

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.3 Truncated regression model 117 | 144

I There, we have used the following result for the derivative of λ(z):
 
φ(z) φ(z)
λ0 (z) = · −z = δ(z).
1 − Φ(z) 1 − Φ(z)
I Truncation leads to a shrinking of βj .
⇒ Correction of the truncation effect is necessary!
I For interpretation, we calculate average values of these effects (over
the individuals).
I The relative effects of the j-th and the k-th explanatory variable
remains βj /βk , since the shrinking factors for βj and βk are equal.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.3 Truncated regression model 118 | 144

Parameter estimation

I Without loss of generality (due to intercept): a = 0.

I The linear OLSE is inconsistent, since it doesn’t account for the


“truncation correction” and the bias doesn’t vanish asymptotically
(cf. lecture).

I The “convenient” estimation equation would be:


 0 
−xi β
yi = xi0 β + σλ + εi ,
σ

where E(εi |xi ; yi > 0) = 0 holds.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.3 Truncated regression model 119 | 144

⇒ Nonlinear OLS estimation:


N   0 2
X −xi β
yi − xi0 β − σλ → min.
i=1
σ β,σ

. For this nonlinear minimization problem, we can use e.g. the


Newton-Raphson-method.
. This estimator is consistent, albeit not efficient, since
heteroscedasticity is ignored (cf. Theorem 5.1b). For this
reason, this estimator is rarely used in practice.
. Additionally necessary: correct specification of the conditional
expectation (normal distribution and homoscedastic errors).
. Since λ(x 0 β/σ) might be almost linear in x 0 β, we possibly
have a multicollinearity problem and thus imprecise estimators.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.3 Truncated regression model 120 | 144

I Maximum likelihood estimation

. Data: yi , xi , given yi > a = 0, i = 1, ..., N.


. Likelihood function of the truncated distribution (notation
without the condition xi (i.e. f (yi ) denotes the conditional
density of yi given xi ; and fa (yi ) accordingly):

N N
Y Y f (yi )
L(β, σ 2 ) = fa (yi ) =
i=1 i=1
1 − F (0)
yi −xi0 β 0
   
N 1
φ N φ yi −xi β
Y σ σ Y σ
=  0 =
−xi β
 0 
xi β
i=1 1 − Φ σ i=1 σΦ σ

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.3 Truncated regression model 121 | 144

z 2
⇒ Log-Likelihood (note: φ(z) = √1 e − 2 )

`(β, σ 2 ) ln L(β, σ 2 )
 
=
N N
= − ln(σ 2 ) − ln(2π)
2 2
N N
1 X X
− 2 (yi − xi0 β)2 − ln[Φ(xi0 β/σ)]
2σ i=1 i=1

. The maximization of `(β, σ 2 ) is a nonlinear problem and


requires numerical methods.
. Consistency, asy. normality and asy. efficiency of the MLE
hold, as long as εi |xi ∼ N (0, σ 2 ) i.i.d. can be assumed.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 122 | 144

5.3.4 Regression with censored data (tobit


model)
I Assumption: Structural equation for the latent variable:
yi∗ = xi0 β + εi , εi |xi ∼ N (0, σ 2 ) i.i.d., i = 1, ..., N

I Model with censoring from below (w.o.l.o.g. a = 0):


(
0, if yi∗ ≤ 0
yi =
xi0 β + εi , if yi∗ > 0

. The constant a does not affect the estimation of β, since it is


captured by the intercept.
. Alternative: individual thresholds ai (no problem if ai is known)
. Censoring from above is treated in an analogous way.
Applied Econometrics – Chapter 5
5.3 Limited dependent variables | 5.3.4 Censored regression model 123 | 144

Marginal effects

I Two parts (βj > 0):

(i) yi = 0, xij ↑ ⇒ P(yi > 0|xi ) ↑ (obvious)

(ii) yi > 0, xij ↑ ⇒ E(y(i) |xi ) ↑

. Formally, Theorem 5.2 (Remark (ii)) provides with a = 0 and


µ = xi0 β:
 0   0 
xi β 0 xβ
E(yi |xi ) = Φ xi β + σφ i
σ σ
 0 
∂E(yi |xi ) xi β
⇒ = Φ · βj (> 0 for βj > 0).
∂xij σ

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 124 | 144

xi0 β
I The difference to βj is small (large), if σ is large (small).

xi0 β
I This is not surprising, since for large σ also yi∗ will be large, so
censoring occurs only rarely.

x 0β
I On the other hand, if iσ is small, we mostly get yi = 0 and
therefore large probabilities P(yi = 0|xi ).

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 125 | 144

Parameter estimation
I Linear OLS is based on

yi = xi0 β + ηi (i = 1, . . . , N).

(a) OLS for observations with positive yi (truncated model)

⇒ ηi = εi + σλi with
−xi0 β φ(xi0 β/σ)
 
λi = λ =
σ Φ(xi0 β/σ)
⇒ E(ηi |xi ; yi > 0) 6= 0

⇒ OLSE is biased and inconsistent!

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 126 | 144

(b) In case of censored data (yi = 0 for yi∗ ≤ 0):


 0   0 
xi β 0 xβ
E(yi |xi ) = Φ xi β + σ φ i
σ σ
| {z } | {z }
=:Φi =:φi

⇒ E(ηi |xi ) = (Φi − 1)xi0 β + σφi 6= 0 (in general)

⇒ E(βbOLS |X ) = β + (X 0 X )−1 X 0 E(η|X )


| {z }
6=0

⇒ Inconsistency of OLS estimator!

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 127 | 144

I Nonlinear OLS takes into account that

E(εi |xi ) = 0 , where εi := yi − Φi xi0 β − σφi :

X
(yi − Φi xi0 β − σφi )2 → min
β,σ
i

⇒ Consistent estimation, that is not asymptotically efficient because of


heteroscedasticity!
⇒ Rarely applied.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 128 | 144

I ML estimation:

. Let di = I(yi∗ > 0) = I(yi > 0) ⇒ yi = di yi∗

I0 = {i ∈ {1, ..., N} : di = 0}, |I0 | =: N0

I1 = {i ∈ {1, ..., N} : di = 1} = {1, ..., N} \ I0 ,

|I1 | =: N1 = N − N0

 0 1−di h  idi
x β y −x 0 β
⇒ f (yi |xi ) = Φ − iσ · σ1 φ i σ i

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 129 | 144

⇒ Likelihood function:
N
Y
L(β, σ 2 ) = f (yi |xi )
i=1

N   0 1−di  di
yi − xi0 β

Y xβ 1
= 1−Φ i · φ
i=1
σ σ σ
 0  Y 
Y yi − xi0 β
 
xβ 1
= 1−Φ i · φ
σ σ σ
i∈I0 i∈I1

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 130 | 144

⇒ Log-likelihood function:
  0 
X xβ
2
`(β, σ ) = ln 1 − Φ i
σ
i∈I0

(yi − xi0 β)2


 
1X 2
− ln(2π) + ln(σ ) +
2 σ2
i∈I1
N
X
= (1 − di ) ln(1 − Φi )
i=1
N
(yi − xi0 β)2
 
X di 2
− ln(2π) + ln(σ ) +
i=1
2 σ2

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 131 | 144

⇒ Likelihood equations:

N  
∂` 1 X σφi 0
= −(1 − di ) + di (yi − xi β) xi = 0,
∂β σ 2 i=1 1 − Φi

N 
φi xi0 β (yi − xi0 β)2
 
∂` X di 1
= (1 − di ) + − 2 = 0.
∂σ 2 i=1
3
2σ (1 − Φi ) 2 σ 4 σ

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 132 | 144

I The maximization of `(β, σ 2 ) (or solving of the likelihood equations


with respect to β and σ 2 ) yields the MLE and requires numerical
methods (e.g. Newton-Raphson).

I The log-likelihood function ` ist globally concave in dependence of


β ∗ = βσ and σ ∗ = σ1 !

I The MLE is consistent, asymptotically normal and asymptotically


efficient, if
εi |xi ∼ N (0, σ 2 ) i.i.d.,
i.e., if the model is correctly specified.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 133 | 144

Two-step estimation (Heckman)

I Idea: censored data are a combination of a binary dependent


variable
(
0, if yi = 0,
yei =
1, if yi > 0

followed by a linear relation for the truncated sample (yi > 0)

yi = xi0 β + εi .

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 134 | 144

I Tobit Model ⇒

yi = 1|xi )
P(e = P(yi > 0|xi )
 0   0 
xβ xβ
= 1−Φ − i =Φ i
σ σ
 0 

yi = 0|xi )
P(e = P(yi = 0|xi ) = 1 − Φ i
σ

I 1st step:
Estimate γ = β/σ using ML in the probit model:

yi = 1|xi ) = Φ(xi0 γ).


P(e

⇒ Consistent estimator γb of γ (i.e., the bias correction term is


estimated using a probit model).

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 135 | 144

I 2nd step: We consider the truncated sample with yi > 0:

φ (xi0 γ)
E(yi |xi , yi > 0) = xi0 β + σ
Φ(x 0 γ)
| {zi }
=λ(−xi0 γ)=:λi

bi = λ(−x 0 γ
. Replace λi by λ i b) and regress

yi on xi0 β + σ λ
bi (yi > 0).

⇒ Consistent estimators of β and σ (these are OLS estimators in


a linear model with estimated bias correction term λ
bi as an
additional regressor for the truncated sample).

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 136 | 144

I The two-step method of Heckman is easier than MLE, but not


efficient.

I However, the resulting estimators can serve as initial values for an


iterative method to determine the MLE.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 137 | 144

Consequences if assumptions are not satisfied


(i) εi is heteroscedastic. ⇒ MLE is inconsistent.

H1 : σi2 = exp(xi0 α)
⇒ H0 : α2 = . . . = αK = 0
. LM test requires only calculation of MLE under H0 :
!−1
∂ 2 `

∂` ∂` as. 2
− 0 ∼ χK −1 .
∂θ b θH ∂θ∂θ0 b θH ∂θ b θH H0
| {z }
=I(b θH )−1

. LR test requires MLE under H0 and H1 .

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 138 | 144

(ii) εi is not normally distributed. ⇒ MLE is inconsistent.


I Test idea, e.g. in the i.i.d. case:

yi∗ ∼ N (µ, σ 2 ) i.i.d.

. Estimate P(yi∗ > 0)


(a) by Φ(b
µ/b
σ ) (i.e. under the assumption of normality), and
(b) by the frequency of uncensored observations (N1 /N).

. Compare (a) and (b) using a Hausman statistic (Nelson test).

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 139 | 144

(iii) εi is autocorrelated. ⇒ No problem for consistency!

I However, autocorrelation cannot be neglected for inferential


purposes.

. The construction of asymptotically justified tests or confidence


regions requires a consistent estimation of the covariance
matrix of the estimator to get robust (correct) standard errors.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 140 | 144

Remarks

I Generalizations, e.g. twice censored Tobit model.

I Dependent variable is not completely observable:


partially continuous, partially censored.

I In contrast to the truncated model, the explanatory variables are


available for all observations.

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 141 | 144

I It is important to use all available information:


. 0/1 parts and continuous outcome.
⇒ Likelihood has two parts (probit part and linear OLS part).
I Both parts are determined by the same x 0 β.
. This restriction is not always plausible, since 0/1 decisions can
have different determinants (with different coefficients)
compared to the metric outcomes.
I Such relations between a binary (0/1) variable and a continuous
variable can generally be treated by
. assuming a parameter vector β(1) for the 0/1 decision, and β(2)
for the continuous part.
⇒ Hypothesis of identical parameters is testable:
H0 : β(1) = β(2) .

Applied Econometrics – Chapter 5


5.3 Limited dependent variables | 5.3.4 Censored regression model 142 | 144

Empirical Example

To be added.

Applied Econometrics – Chapter 5


5.A Literature | 143 | 144

Contents I
5.1 Introduction
5.2 Binary response models
5.2.1 Introduction & model formulation
5.2.2 Probit and logit models
5.2.3 Maximum likelihood estimation
5.2.4 Model diagnostics
5.3 Limited dependent variables
5.3.1 Introduction
5.3.2 Truncation and censoring
5.3.3 Truncated regression model
5.3.4 Censored regression model
5.A Literature

Applied Econometrics – Chapter 5


5.A Literature | 144 | 144

5.A Literature
I Amemiya, T. (1985). Advanced Econometrics. Harvard University Press.
Cambridge, Ma.
I Cameron, A. C. and Trivedi, P. K. (2005). Microeconometrics - Methods
and Applications. Cambridge University Press.
I Heij, C.; de Boer, P.; Franses, P. H.; Kloek, T. and van Dijk, H. K.
(2004). Econometric Methods with Applications in Business and
Economics. Oxford University Press.
I McCullagh, P. and Nelder, J. A. (1983). Generalized Linear Models.
Chapman and Hall, London.
I Nelson, F. D. (1977). Censored Regression Models with Unobserved,
Stochastic Censoring Threshold. Journal of Econometrics 6, 309-327.
I Nelson, F. D. (1981). A Test for Misspecification in the Censored Normal
Model. Econometrica 49, 1317-1329.

Applied Econometrics – Chapter 5

You might also like