Chap 5

Chapter 5
Models for qualitative and limited dependent

variables
Applied Econometrics
Winter Term 2020/21
Prof. Dr. Simone Maxand
Humboldt University Berlin
5.1 Introduction 2 | 144
Contents I
5.1 Introduction
5.2 Binary response models
5.2.1 Introduction & model formulation
5.2.2 Probit and logit models
5.2.3 Maximum likelihood estimation
5.2.4 Model diagnostics
5.3 Limited dependent variables
5.3.1 Introduction
5.3.2 Truncation and censoring
5.3.3 Truncated regression model
5.3.4 Censored regression model
5.A Literature
Applied Econometrics – Chapter 5

5.1 Introduction
What is Microeconometrics?
I Analysis of individual data, i.e. data concerning the behaviour and
attitudes of persons, households or firms.
. Econometric methods to study microeconomic phenomena.
. The underlying model is typically a microeconomic model
where individual decisions and behavior are a function of
exogenous parameters.
I Typical questions: How do individual characteristics affect
. decision to work (or to buy a new product)?
. choice of travel mode (train, bus, car, bike)?
. household purchases of durable goods?
. number of hours worked?
. number of children?
. duration of unemployment?
The classical linear regression model might need to be modified

for analyzing individual data, because
I the dependent variables are often non-continuous (qualitative, e.g.

binary),
I the dependent variables may be limited, e.g.,
. wages of workers are non-negative,
. the monthly household expenditures for a certain consumer
good may be described by a positive continuous variable which
coexists with a discrete cluster of observations at zero, or
I the data may be subject to systematic sample selection since
. the data are non-experimental (“observational”), i.e. they are
not “generated by a random experiment”; instead they are
collected from surveys and administration records.

Types of variables / data

I Basic distinction: Quantitative (metric) vs. qualitative (categorical,
non-metric) variables
I A quantitative variable may be continuous or discrete.
. One can “measure” the size of the difference between any two
variable values, so that it is sometimes called metric variable.
. Examples: (log)salary, or age in years
. Often, except for count data, the distinction between discrete
and continuous data doesn’t matter (usually one considers the
real line as the variable’s support).
⇒ The classical linear model may be applied in case of a quantitative
dependent variable (except for count data) provided that
. it is not limited (truncated or censored), and
. the observations form a random sample from the population.
I Qualitative variables have a finite number of mutually exclusive

categories, which can be
. non-ordered (nominal - binary or multinomial - variable), or
. ordered (ordinal variable).
I A binary (dichotomous) variable has two possible outcomes; it
indicates whether a certain property is present or not, e.g.:
. Has a credit application been approved (yes/no)?
. Is a person’s willingness-to-pay greater than the asking price
(yes/no)?
I A multinomial variable has three or more possible outcomes
(non-ordered categories), e.g.:
. Type of health insurance (state, private, or none)
. Employment status (full-time, part-time, unemployed, or not in
labor force)

I An ordered variable has categories which are ordered, but differences

between the categories are “not defined”, e.g.:
. Self-assessed health satisfaction in GSOEP with scale from 0
(totally dissatisfied) to 10 (totally satisfied).
In this course:
I Binary dependent variables: Only two possible values.
. E.g. decision of employees to take on a job.
I Limited dependent variables: Values of variables are limited, e.g. on
an interval.
. Example: Duration of unemployment is censored from below
at 0, where 0 possesses a positive probability.
⇒ Models for truncated & censored data

1.2
1.0
0.8
Y: BVG ticket 0.6
0.4
0.2
0.0
-0.2
0 1000 2000 3000 4000 5000
X: INC
Figure 1: Example of binary dependent variable from HU student survey

Censored dependent variable

I yi takes on values in a limited range.
I Often: yi takes on zero for significant fraction of observations.
I Examples:
. Household purchases of durable good
. Number of hours worked
I Problem: Modeling the conditional expectation of yi given xi by a
linear model may be misleading since
. the linear relation may hold in the population, but observed
sample should not be considered as representative!
I Solution: Censored regression model (e.g. Tobit model)

2000
1600
Y: RENT 1200
800
400
0 1000 2000 3000 4000 5000

X: INC
Figure 2: Limited dependent variable from HU student survey

30
25
20
Y: Wage of Wifes 15
10
-5
7 8 9 10 11 12
X: log(Family income)
Figure 3: Limited dependent variable (see Greene (2003), Ex. 22.8)

Truncated data
I Data for (xi , yi ) are not available if yi is above or below a certain
threshold.
I That is, some observations have been systematically excluded from
the sample.
I Example: Sample of (data on) households with income below
100,000 $
. The sample necessarily excludes all households with income
above that level. ⇒ No random sample of all households.
I Using truncated sample for investigating relationship between y and
x is potentially misleading (when using a linear model).
I Solution: Truncated regression model

Common aspects of models

I Often, interest is in the conditional probability distribution instead
of the conditional expectation (as in case of a linear model).
I Question: What is the ceteris paribus effect of a change in one
explanatory variable on the entire distribution of the dependent
variable?
I Here: Parametric approach, i.e. we assume that the (conditional)
distribution of the dependent variable is known up to a
finite-dimensional parameter which in turn is specified as a
(parametric) function of the explanatory variables.
I Assuming (cross-sectional) independence, the parameters of the
model are estimated by maximum likelihood (instead of least
squares in the linear regression model).

Empirical example 1: Determinants of fertility

I Ref.: Winkelmann & Boes
I Individual fertility decisions (no. of children born by a women)
depend on
. labor market opportunities and thus on education,
. social norms and values,
. marital status,
. health, etc.
I Focus in empirical studies is on women’s education:
If higher education of women leads to fewer children per woman,
then we have
. an explanation for fertility decline in developed world in 2nd
half of last century, and
. a recipe for reducing high population growth rates in some
parts of the developing world.
Example 1: Data
I US General Social Survey (GSS):
. Annual or biannual cross-sectional survey (started in 1972)
. Information on no. of children ever born by a women, etc.
I Number of children is a count variable!
I Alternatively, we could investigate the proportion of childless women
⇒ binary variable!
I Here: Every 4th year 1974 - 2002
I Restriction to women beyond child-bearing age (40 years) to avoid
interfering effect of age:
. “Younger women tend to have less children than older”.
. Otherwise: Consider no. of children for younger women as
censored.
Example 1: Descriptive statistics
I Pool observations over years ⇒ 5150 women (age ≥ 40)
No. of children ever born Frequencies
to women (age ≥ 40) Absolute Relative
0 744 14.45
1 706 13.71
2 1368 26.56
3 1002 19.46
4 593 11.51
5 309 6.00
6 190 3.69
7 89 1.73
8 or more 149 2.89
Table 1: Fertility distribution
Example 1: Research questions

1. Is there a downward trend in fertility (i.e. do earlier birth cohorts
have a higher fertility than later ones)?
2. If yes, to what extent can this trend be explained by the rising
education levels of women?
I Aim: Statistical explanation (more educated women have fewer
children; proportion of more educated women increases over time ⇒
average fertility declines)
I No analysis of: Why more educated women have less children?
⇒ Investigate, whether (over time)
. average levels of fertility went down?
. average levels of education increased?
I Count data model: Problem of “censoring” (category: “8 or more”)

Example 1: Year-by-year statistics

No. of No. of Proportion Years of
Year observations children of childless schooling
1974 410 3.17 0.09 11.07
(0.10) (0.01) (0.16)
1978 445 2.73 0.14 11.00
1982 577 2.96 0.14 11.05
1986 470 2.70 0.16 11.34
1990 431 2.50 0.15 12.41
1994 989 2.40 0.15 12.78
1998 911 2.42 0.15 12.94
2002 917 2.36 0.16 13.25
(0.06) (0.01) (0.10)
Table 2: Fertility and average education level by years

Example 1: Interpretation of Table 2

I Large no. of observations per year
⇒ Small confidence intervals for population parameters
I Clear evidence of a downward trend in fertility
I Possibly, this trend can (partially) be explained by increased levels of
education.
I Exercise:
. Can the average of a discrete variable be normally distributed?
(Consequences for inference?)
. Test whether the average no. of children is the same in 1974
and in 2002.
. Is the difference in education levels between 1974 and 2002
statistically significant?

Example 1: Linear regression analysis

Dependent Variable: KIDS
Method: Least Squares
Date: 10/12/09 Time: 12:48
Sample: 1 9120 IF AGE >= 40
Included observations: 5150
Variable Coefficient Std. Error t-Statistic Prob.
YEAR=1974 3.170732 0.093298 33.98506 0.0000

YEAR=1978 2.734831 0.089554 30.53846 0.0000
YEAR=1982 2.960139 0.078646 37.63887 0.0000
YEAR=1986 2.702128 0.087139 31.00926 0.0000
YEAR=1990 2.496520 0.090997 27.43533 0.0000
YEAR=1994 2.400404 0.060071 39.95942 0.0000
YEAR=1998 2.422613 0.062590 38.70613 0.0000
YEAR=2002 2.364231 0.062385 37.89755 0.0000
R-squared 0.018405 Mean dependent var 2.586408

Adjusted R-squared 0.017069 S.D. dependent var 1.905469
S.E. of regression 1.889137 Akaike info criterion 4.111669
Sum squared resid 18350.97 Schwarz criterion 4.121839
Log likelihood -10579.55 Durbin-Watson stat 1.808056
Figure 4: No. of children in dependence of year dummies

I Why model without constant?
I Predicted no. of children in 1982?

Date: 10/12/09 Time: 12:55
C 3.026134 0.055684 54.34433 0.0000

TIME -0.026256 0.002929 -8.963844 0.0000

Log likelihood -10587.50 F-statistic 80.35050
Durbin-Watson stat 1.802490 Prob(F-statistic) 0.000000
Figure 5: No. of children in dependence of time

I time:=year-1974
I Predicted no. of children in 1982?
I Prediction of no. of children in 2000?


Date: 10/12/09 Time: 12:57
C 4.391621 0.102744 42.74341 0.0000

TIME -0.014002 0.002967 -4.719580 0.0000
EDUC -0.128275 0.008187 -15.66723 0.0000

Figure 6: No. of children in dependence of time and years of schooling
I Which model would you prefer?

I Is education related to fertility?
I Discuss potential shortfalls of linear regression?

5.2 Binary response models 23 | 144
Contents I
5.1 Introduction
5.3.1 Introduction
5.A Literature

5.2 Binary response models | 5.2.1 Introduction & model formulation 24 | 144
5.2.1. Binary response models

5.2.1. Introduction and model formulation
I Binary dependent variables occur very frequently:
Outcome is true/false, yes/no decision, success/failure.
I E.g.: Event that a person buys a car, is unemployed, smokes, is
childless, has visited a doctor during the last quarter, has an
extramarital affair, has been granted a bank loan, etc.
I Depending on the aim of the analysis, binary variables may be
constructed from multinomial, ordered or continuous variables (with
loss of information!):
. Happy/unhappy instead of ordered variable happiness with
scale from 0 (“completely unhappy”) to 10 (“completely
happy”)
. Recession yes/no? based on GDP (continuous variable)
Example: Discrimination in mortgage market?

I Does a bank treat applicants for a mortgage with German and
Non-German origin (to buy an identical house) the same way?
I By law they must receive identical treatment.
I Loans are made and denied for many legitimate reasons.
I It is not sufficient to compare the fraction of both groups of
applicants who were denied a mortgage.
⇒ We need a method for comparing rates of denial, holding other
characteristics constant.
I Problem is similar to multiple regression, but now the dependent
variable is binary.

Bernoulli variable
I Two possible outcomes of y are usually coded by
1 (yes/“success”) and 0 (no/“failure”), i.e.:
y = 1 if “event occurred”, otherwise y = 0.
I No loss in generality if interest is only in the probability of success.
!
⇒ With p = P(y = 1) = 1 − P(y = 0):
y ∼ Bernoulli(p) = Bin(1, p)
⇒ Maximum likelihood approach is appropriate.

I Outcome of y ∈ {0, 1} (and thus p) might depend on individual’s
characteristics x = (x1 , . . . , xK )0 .
I Econometric analysis requires observations of (y , x ):
(yi , xi ), i = 1, . . . , N; with xi = (xi1 , . . . , xiK )0 , xi1 ≡ 1

Linear probability model

I Assumption: E(εi |xi ) = 0 (⇒ E(εi ) = 0)
K
X
yi = xi0 β + εi = β1 + βj xij + εi , i = 1, ..., N
j=2
I “Probability of success”
pi := P(yi = 1|xi ) = E(yi |xi ) = xi0 β
⇒ yi ∼ Bernoulli(pi ) = Bin(1, pi )
⇒ V(yi |xi ) = pi (1 − pi ) = xi0 β(1 − xi0 β)
I Marginal (probability) effect of xj (on E(yi |xi ) or pi ):

∂P(yi = 1|xi ) ∂E(yi |xi )
= βj = (j = 2, ..., K )
∂xij ∂xij
. The last equality holds (only) for the chosen coding.
Disadvantages of linear probability model

I Implicit restrictions on β : 0 ≤ xi0 β ≤ 1 (∀i)
I yi binary ⇒ εi binary (not normally distributed!):
two possible outcomes: 1 − xi0 β (with probability pi ) and
−xi0 β (with probability 1 − pi )
⇒ Given xi , εi is heteroskedastic:
V(εi |xi ) = V(yi |xi ) = xi0 β(1 − xi0 β)
⇒ OLSE βbOLS of β is unbiased, but inefficient.

I βbOLS neglects parameter restriction ⇒ possibly:
p b i = 1|xi ) = x 0 βbOLS ∈
bi = P(y / [0, 1]!
i
I OLSE with robust standard errors (or MLE for pi = xi0 β) may serve
as useful exploratory tool (often: reasonable direct estimation of
average marginal effects and hint to statistically relevant variables).
Emp. Example 1 (determinants of fertility)

I Linear probability model
I CHILDLESS=1 if woman has no children
I Explanatory variables: TIME, EDUC (years of schooling), WHITE
(dummy variable), SIBS (number of siblings)
Dependent Variable: CHILDLESS
Date: 10/19/09 Time: 17:54
C 0.039729 0.026033 1.526072 0.1271

TIME 0.000602 0.000564 1.067451 0.2858
EDUC 0.007573 0.001637 4.626005 0.0000
WHITE 0.014248 0.013702 1.039826 0.2985
SIBS -0.002354 0.001561 -1.508272 0.1315


Example 1, Linear probability model, cont.
I 1 year more schooling ⇒ Probability of being childless increases

(ceteris paribus) by 0.76 percentage points.
I Probability of being childless for a white woman surveyed in 1994
with 20 years of schooling and 3 siblings?
21.07%
I Extreme example: time=0, educ=0, white=0, sibs=23

⇒ predicted probability: −1.4% (makes no sense)!
I Interpret the other coefficients, too!
I How would you estimate the error variance (for different women)?

Example 1, Linear probability model, cont.

I Heteroskedasticity: Use of White standard errors
Dependent Variable: CHILDLESS
Date: 10/19/09 Time: 18:13
White Heteroskedasticity-Consistent Standard Errors & Covariance
C 0.039729 0.029000 1.369958 0.1708

TIME 0.000602 0.000539 1.118592 0.2634
EDUC 0.007573 0.001877 4.034242 0.0001
WHITE 0.014248 0.013155 1.083089 0.2788
SIBS -0.002354 0.001538 -1.530040 0.1261


Nonlinear models for probabilities

pi = P(yi = 1|xi ) = G(xi0 β)
I “Natural” requirements on (known) function G:

(i) G is monotonically increasing.
(ii) G(z) → 0 as z → −∞ and G(z) → 1 as z → ∞.
⇒ Practice: G is some (cumulative) distribution function (cdf).
I (i) ⇒ For βj > 0, P(yi = 1|xi ) is increasing in xij , that is, positive
(negative) coefficients correspond to positive (negative) effects on
pi .
I If G is a differentiable cdf, then g(z) = G 0 (z) is its density.
⇒ Level-dependent marginal effects (on probability of success):
∂E(yi |xi ) ∂P(yi = 1|xi )
= = g(xi0 β)βj (j = 2, ..., K )
∂xij ∂xij
I Usually, g is density with relatively smaller values in the tails and

relatively larger values near the mean.
⇒ Effects are smallest for individuals for which P(yi = 1|xi ) is near 0
(in the left tail of g) or near 1 (in the right tail of g).
. Corresponds to intuition: Individuals with clear-cut preferences
are less affected by changes in the explanatory variables.
I Sensitivity of decisions to changes in x depends on shape of g.
Z
W.o.l.o.g. (xi1 ≡ 1!) zg(z)dz = 0 (i.e.: E[Z ] = 0 if Z ∼ G)
. Typically, g is unimodal and symmetric around 0.

⇒ g(z) = g(−z) and max g(z) = g(0)
z
⇒ Marginal effects are maximal if g(xi0 β) is maximal (i.e. if
xi0 β ≈ 0). Then: P(yi = 1|xi ) = G(xi0 β) ≈ G(0) = 21 .

Examples for choice of G

a) Probit model
Z z
1 2
G(z) = Φ(z) = √ e −x /2 dx (cdf of N (0, 1))
−∞ 2π
b) Logit model
ez
G(z) = Λ(z) = [(standard) logistic cdf]
1 + ez
c) Complementary log-log model
G(z) = C (z) = 1 − exp(− exp(z)) (cdf of extreme value distr.)
I Remark: G(z) = z (Identity function, no cdf)

⇒ Linear probability model!

Identification considerations
I Identifiability of β requires: G(z) is strictly increasing, rk(X ) = K .
I Moreover, mean and variance must be fixed: Let G, G e be cdf’s with
associated densities g, ge, and suppose that
Z ∼ G , U := (Z + µ)/σ ∼ Ge (for σ > 0).
⇒ G(u)
e := P(U < u) = G(σu − µ) , ge(u) = σg(σu − µ)
⇒ P(yi = 1|xi ) = G(xi0 β)
0 K
!
xi β + µ β 1 + µ X β j
=Ge =G e + xij
σ σ j=2
σ
⇒ β is not identifiable unless µ and σ are fixed:

βe1 = (β1 + µ)/σ, βej = βj /σ (j = 2, ..., K ) ⇒ G(xi0 β) = G(x
e 0 β).
i
e
⇒ β and β are not distinguishable (unless fixing µ and σ).
e
⇒ Typically, µ = 0 (as before); additionally, fix σ.
Model interpretation in terms of latent

variables
I y ∗ - latent (unobservable, continuous) variable to be explained
⇒ “Natural” regression model for y ∗ (Index function model):
yi∗ = xi0 β + εi , εi i.i.d.
. yi∗ depends on individual characteristics xi via an index

function xi0 β (representing systematic effects on yi∗ ).
. Model is not estimable, since y ∗ is not observed!
I Instead we observe only a binary variable y , which takes value 1 or 0
according to whether or not y ∗ crosses a threshold:
1, if yi∗ > 0

yi =
0, else
I Motivation of threshold 0:
. εi , xi independent with εi ∼ G (otherwise, εi |xi ∼ G) ⇒
P(yi = 1|xi ) = P(yi∗ > 0|xi ) = 1 − G(−xi0 β)
!
= G(xi0 β) [if g(z) = g(−z)]
I Again, identification of single-index model requires restriction on
V(εi ), because β is identifiable only up to scaling.
. Observe only, whether
yi∗ > 0 ⇔ xi0 β + εi > 0
⇔ xi0 (σβ) + (σεi ) > 0 (∀ σ > 0).
⇒ Uniqueness is achievable
by an restriction on error variance,
1, in probit model
e.g. V(εi ) =
π 2 /3, logit model.

Choice of threshold
I Threshold is not necessarily 0, e.g.
yi = 1 ⇔ yi∗ > zi0 δ (if xi and zi deterministic).
⇒ P(yi = 1) = G(xi0 β − zi0 δ)
⇒ δ is separately identifiable only if all components of zi and xi are

distinguishable.
. Require that (X , Z ) has full rank.

. In particular, zi and xi should not include simultaneously an
intercept.

Model interpretation in terms of utilities

I Individual chooses between two alternative 0 and 1 such that the
utility is maximized.
I Let ui0 and ui1 be the utility when choosing 0 and 1, resp.
I Example: ui0 - utility of rental housing;
ui1 - utility of home ownership
I Specification by an additive random utility model:
ui0 = xi0 β 0 + ε0i
ui1 = xi0 β 1 + ε1i
. xi0 β j , εji are the deterministic and stochastic utility

components, respectively; j = 0, 1.
I Utility maximization yields

yi = 1 ⇔ ui0 < ui1 ⇔ xi0 (β 1 − β 0 ) + ε1i − ε0i > 0
| {z } | {z }
=:β =:εi
or yi = 0 ⇔ xi0 β + εi ≤ 0
⇒ Model as before: P(yi = 1|xi ) = G(xi0 β)

(under same assumption on distribution of εi )
I Model requires (for identification) a scale normalization, since:
ui1 > ui0 ⇔ σui1 > σui0 (∀ σ > 0)
This is usually done by specifying the variance of εi = ε1i − ε0i (as
before) or by specifying the variances of ε1i and ε0i separately.
I The random utility formulation is especially useful for specifying
unordered multinomial choice models.

Graphical illustration with artificial data
I xi2 = i for i = 1, . . . , 200

I yi∗ = −10 + 0.1 xi2 + εi simulated by drawing εi independently
from N (0, 1) for i = 1, . . . , 200.
I Binary dependent variable:
yi = 0 if yi∗ < 0
yi = 1 if yi∗ ≥ 0
I Regress yi on xi = (1, xi2 )0 (estimate LPM by OLS).

I Plot xi2 against yi (scatter plot) together with regression line
(x 0 βbOLS ).
i

1.2
1.0
0.8
Y 0.6
0.4
0.2
0.0
-0.2
0 40 80 120 160 200
X
Figure 7: Observed (yi ) and fitted values (xi0 βbOLS ) in LPM.

1.2
1.0
0.8
0.6
Y
0.4
0.2
0.0
PROB Y
-0.2
0 40 80 120 160 200
X
Figure 8: Nonlinear (probit) model: (yi ) and G(xi0 β)

b
Applied Econometrics – Chapter 5 .

5.2 Binary response models | 5.2.2 Probit and logit models 44 | 144

Z z
I Probit-model: G(z) = Φ(z) = φ(x )dx
−∞
1 2
. φ(z) = √ e −z /2 and Φ(z) denote the density and the cdf,
2π
respectively, of the standard normal distribution.
ez
I Logit-model: G(z) = Λ(z) =
1 + ez
. Λ denotes the cdf of the standard logistic distribution.
. The associated density function is
ez
λ(z) [= Λ0 (z)] =
(1 + e z )2

The general logistic distribution

I Random variable Z is logistically distributed with the parameters µ
and κ, if
z−µ
1 e κ
Λ(z; µ, κ) = z−µ
= z−µ (distribution function)
1 + exp − κ 1+e κ
(z−µ)
d e− κ
λ(z; µ, κ) = Λ(z; µ, κ) = (density function)
dz (z−µ) 2
h i
κ 1 + e− κ
I Moments for Z ∼ Λ(·; µ, κ), where Λ(z; 0, 1) = Λ(z):

1 2 2
E[Z ] = Median(Z ) = µ , V(Z ) = π ·κ
3
E[Z − µ]3 E[Z − µ]4 6
= 0 (Skweness) , 2
= 3 + (Kurtosis)
[V(Z )]3/2 [V(Z )] 5

Model comparison
I Moments
cdf Expectation Variance Skewness Kurtosis
Φ 0 1 0 3
π2 6
Λ 0 3 0 3+ 5
I For comparing the distributions,

√ we standardize the standard logistic
distribution by σ = π/ 3:
⇒ Standardized logistic distribution with
. cdf Λ(σz) and density σλ(σz)
⇒ Expectation 0, Variance 1 (as for Φ)

Cdf of standard normal law and standardized

logistic distribution
Distribution Functions
1.0
Standard normal distribution
0.8
0.6 Standardized logistic distribution
G(x)
0.4
0.2
0.0
−4 −2 0 2 4

Density function of standard normal law and

standardized logistic distribution
Density Functions
0.5
Standard normal distribution
Standardized logistic distribution
0.4
0.3
g(x)
0.2
0.1
0.0
−4 −2 0 2 4
x
I The above Figures show:

. Both distributions (models) are very similar.
. Differences exist mainly in the center (around the mean 0) and
also in both tails, where the logistic density function has larger
values than that of the normal distribution.
⇒ There, the (absolute) marginal effects in the logit model are
larger than in the probit model.
⇒ The results based on both models do not differ much, unless the
tails of the distributions are of importance (i.e. if the data are not
very “unbalanced” in the sense that
N
X
#{i|yi = 1}/N = yi /N
i=1
does not differ too much from 12 ).

Comparing parameters
I For comparing the parameter √ estimates in both models, a scaling
different from the factor π/ 3 ≈ 1.8 is recommended.
I The parameters should be scaled such that the maximal effects
(obtained at x 0 β = 0) are comparable:
√1 e 0
maxz φ(z) φ(0) 2π 4
= = e0
=√ ≈ 1.6 =: ρ
maxz λ(z) λ(0) (e 0 +1)2
2π
⇒ For comparison, the (estimated) parameters of the probit model are

multiplied by ρ (≈ 1.6) (or those of the logit model are divided by
ρ).
I Note: λ density of Z ⇒ density λ e of Z
e = Z /ρ satisfies:
λ(z)
e = ρ · λ(ρz) and λ(0)
e = φ(0)
Reporting marginal effects

I Marginal effects vary across the individuals (depend on xi )!
I Average marginal effects of the individuals in the sample:
N N
1 X ∂P(yi = 1|xi ) 1 X
= βj g(xi0 β) (j = 2, ..., K ).
N i=1 ∂xij N i=1
I Sometimes: effect evaluated at the average values of xi ’s (simpler

to compute, but less clear interpretation):
N
∂P(yi = 1|xi ) 0 1 X
= g(x · β)β j , x · = xi
∂xij
xi =x · N i=1
I In general, different values because g is nonlinear and thus:

E[g(xi0 β)] 6= g(E[xi0 β]).

Case of discrete explanatory variables

I Recommend computation of the discrete change in the probabilities
associated with a discrete change in the j-th explanatory variable by
an amount ∆xij :
∆pij := G(xi0 β + ∆xij0 βj ) − G(xi0 β)
I Marginal effects of dummy variables (xij ∈ {0, 1}) ⇒ compare the

probabilities of succes in the situations xij = 0 and xij = 1:
P(yi = 1|xi , xij = 1) − P(yi = 1|xi , xij = 0)
   
X X
= G βj + βk xik  − G  βk xik  .
k6=j k6=j
I Again, one can compute the average effect, or the effect for the
average characteristics.
An advantage of the logit model

I For computing marginal effects, we need the density g(xi0 β). In case
of the logit model one obtains for g(xi0 β) = λ(xi0 β):
λ(z) = Λ(z)[1 − Λ(z)] , z = xi0 β
I An advantage of the logit model is that the logistic distribution

function Λ(z) and its inverse, the so-called logit function, are given
in closed form - in contrast to Φ(z) and its inverse (probit
function). It can be shown that, with p = Λ(z) (= Λ(x 0 β)),

0 −1 p
z = [x β =] logit(Λ(z)) = logit(p) = Λ (p) = ln ,
1−p
i.e. the logit function may be expressed as log-odds.
⇒ βj is the semielasticity of the odds w.r.t. the j-th regressor.

Special cases of generalized linear models

I (Conditional) expectation of the dependent variable µi = E(yi |xi ) is
related to a systematic component (linear predictor, i.e. ηi = xi0 β)
via a link function h:
ηi = h(µi ).
I Binary dependent variable ⇒ µi = P(yi = 1|xi ) = pi and:

µi
Logit model: ηi = logit(µi ) = ln (logit function)
1 − µi
Probit model: ηi = probit(µi ) = Φ−1 (µi ) (probit function)
complementary ηi = ln(− ln(1 − µi ))
Log-log model:
I Canonical link function (for binary y ): logit function (cp.
McCullagh & Nelder, 1983)

5.2 Binary response models | 5.2.3 Maximum likelihood estimation 55 | 144
5.2.3 Maximum likelihood (ML) estimation

I The notation of this subsection distinguishes between random
variables Yi and their realizations yi . Moreover,
Y = (Y1 , . . . , YN )0 and y = (y1 , . . . , yN )0
I Assumption: Yi and xi are stochastic with (Yi , xi ) i.i.d.

. or xi deterministic and Yi independent random variables
⇒ Yi |xi (stochastically) independent (i = 1, ..., N)

⇒ Yi |xi ∼ Bernoulli(pi ) (since Yi binary)
I Model assumption: pi = G(xi0 β) [G(z) = Λ(z) or G(z) = Φ(z)]

I Marginal distribution of xi does not depend on β ⇒ it contains no
information about β (regressors are then weakly exogenous) and it
suffices to consider the (conditional) likelihood function.

(Conditional) likelihood function

I For binary random variables Yi with realizations yi ∈ {0, 1} we
obtain as probability mass function (“density”):
f (yi |xi ) = P(Yi = yi |xi ) = piyi (1 − pi )1−yi .
⇒ (Conditional) likelihood function (joint density of Y (given X ), read

as function of the parameter β for given observation y ):
N
Y N
Y
L(β) = L(β; y ) = f (yi |xi ) = piyi (1 − pi )1−yi
i=1 i=1
N
Y
= G(xi0 β)yi [1 − G(xi0 β)]1−yi .
i=1
. Usually, this function is called likelihood function, although it

is - strictly speaking - a conditional likelihood function.
ML estimator (MLE)
I Log-likelihood function:
N
X
`(β) = `(β; y ) = ln[L(β)] = {yi ln(pi ) + (1 − yi ) ln(1 − pi )}
i=1
N
X
= {yi ln[G(xi0 β)] + (1 − yi ) ln[1 − G(xi0 β)]} .
i=1
I Any maximizer βb of the (log-) likelihood function is called MLE.

⇒ Necessary condition for MLE (likelihood equations):
N
∂`(β) X (yi − pi ) !
s(β; y ) = = g(xi0 β)xi = 0 (pi = G(xi0 β)).
∂β i=1
pi (1 − pi )

Numerical procedures
I There is no explicit solution to the likelihood equations.
I However, in case of both the logit and the probit model the
log-likelihood function `(β) is globally concave, so that a unique
MLE exist.
⇒ Numerically procedures, such as the Newton-Raphson procedure will

converge quickly to the MLE in these models:
β(n+1) = β(n) − H−1 (βn )s(β(n) ) (Iteration),
∂`(β)
where s(β) = is the gradient (score) vector and H(β) is the
∂β
Hessian matrix of `(β):
∂ 2 `(β)
H(β) = .
∂β∂β 0
Fisher information
I The Fisher information is the negative expected Hessian matrix
(here: conditional expectation given X if X is random), i.e.:
N
g(xi0 β)2
X
∂s(β)
I(β) = −Eβ [H(β)] = −Eβ 0
= xi xi0 .
∂β i=1
p i (1 − pi )
⇒ I(β) is positive definite for any value of β ∈ RK , if

. G is cdf with 0 < G(z) < 1 for all z ∈ R, and
. X = (x1 , . . . , xN )0 has full rank K (with probability one).
⇒ The expected Hessian is then negative definite.
I However, from this fact it does not follow that the actual Hessian is
negative definite (and thus that the log-likelihood is globally
concave).

Fisher information - Interpretation

I Cramér-Rao: I(β)−1 V[β]
e for any unbiased estimator β.
e
I Alternative definition (equivalent under regularity conditions):

∂`(β) ∂`(β)
I(β) = Eβ [s(β, Y )s(β, Y )0 ] = Eβ .
K ×K ∂β ∂β 0
2 " d #2
dβ f (Y , β)

∂`(β; Y )
⇒ For K = 1 : I(β) = Eβ = Eβ
∂β f (Y , β)
d
dβ f (y ,β)
I f (y ,β) describes the relative rate of change of the density in y .
⇒ The larger I(β) is in β = β0 , the easier it is to distinguish β0
from adjacent parameter values and the more precise one can
estimate the parameter value β = β0 .
⇒ I(β) describes the information contained in y about β.
Asymptotic properties of the MLE βb

I Assumption: The model is correctly specified. Then:
I βb is consistent, i.e.
p
βb −
→β as N → ∞.
. This follows basically from E(Yi |xi ) = pi .
I Under certain regularity conditions (see Amemiya, 1985), βb is
asymptotically normally distributed and asymptotically efficient, i.e.
√ d
N(βb − β) −−−−→ N (0, lim [I(β)/N]−1 ),
N→∞ N→∞
where I(β) is the Fisher information matrix.

⇒ Approximate (“asymptotic”) covariance matrix of the MLE:
b = I(β)−1 .
AV(β)
On the definition of asymptotic efficiency
I A consistent and asymptotic normally distributed estimator is called

asymptotically efficient, if the covariance matrix of the asymptotic
distribution is V∞ (β) := limN→∞ [I(β)/N]−1 or the resulting
approximate covariance matrix is I(β)−1 .
I There exist several asymptotically efficient estimators with very
different finite sample properties.
I Concept suggests V∞ (β) as lower bound for the covariance matrix
in the asymptotic distribution of consistent estimators.
However, the covariance matrix (in asympt. distribution) of
so-called super-efficient estimators may fall below that bound!

Statistical inference
I For large N, the following approximate distribution can be used:
βb ≈ NK (β, V b suitable estimator of I(β)−1 ).
b ) (V
(a) V b −1
b1 = I(β)
" N #−1
X ∂`i ∂`i
(b) V
b2 =
i=1
∂β ∂β 0 β=βb
" #−1
∂ 2 `

(c) V3 = −
b
∂β∂β 0 β=βb
" N
#−1
X g(xi0 β)
b2
(a) ⇒ V
b1 = xi x 0
bi ) i
bi (1 − p
p
i=1
N
" #−1
X (yi − pbi )2
(b) ⇒ V
b2 =
2 2
g(xi0 β)
b 2 xi x 0
i
p
i=1 i
b (1 − p
b i )
Special case of logit model

ez ez
I G(z) = Λ(z) = , g(z) = λ(z) = Λ0 (z) =
1 + ez (1 + e z )2
⇒ Λ(z)(1 − Λ(z)) = λ(z)
0
e xi β
I pi = P(Yi = 1|xi ) = Λ(xi0 β) = 0 =: Λi
1 + e xi β
⇒ λi := λ(xi0 β) = Λi (1 − Λi )
⇒ Likelihood equations:
XN
∂`(β)
s(β) = = (yi − pi )xi = 0 .
∂β i=1

Logit model, cont.

I Vector of residuals (yi − p bi = Λ(xi0 β)
bi )i=1,...,N with p b is orthogonal
to the regressors (similar to the linear model).
I If xi1 ≡ 1 (i.e. model includes an intercept), then the first likelihood
equation yields:
PN
i=1 [yi − Λ(xi0 β)]
b =0
N
1 X
⇔ y· = Λ(xi0 β)
b =p
b·
N i=1
⇒ The average estimated probability of success equals the observed

frequency of success.

Logit model: Uniqueness of MLE

I Hessian matrix of `(β) is negative definite and thus `(β) is globally
concave:
N N
∂ 2 `(β) ∂s(β) X ∂pi X
= = − x i = − λi xi xi0
∂β∂β 0 ∂β 0 i=1
∂β 0 |{z}
i=1 Λ (1−Λ )
i i
|{z}
λ(xi0 β)xi0
N
∂ 2 `(β) X
⇒ = − pi (1 − pi )xi xi0 n.d.
∂β∂β 0 i=1
N
X
I Assume that X 0 X = xi xi0 is regular (p.d.) and pi ∈ (0, 1),
i=1
implying λi = pi (1 − pi ) > 0 (∀i).
I Note that the Hessian matrix does not depend on y .

Logit model: Fisher Information

N
∂ 2 `(β)
X
I(β) = −E = pi (1 − pi )xi xi0
∂β∂β 0 i=1
" N #−1
X
−1 0
⇒ Vb = I(β)
b = bi (1 − p
p bi )xi xi
i=1
I This estimator of AV(β)b corresponds to V b3 and coincides with

0b
representation of V1 [note λi = g(xi β) = p
b b bi (1 − p
bi )].
2
I Previous representation of V2 : use (yi − p
b bi ) instead of pbi (1 − p
bi ),
bi = g(x 0 β)
since again λ b =pbi (1 − p
bi ).
i
. Relation between the “outer product” of the score vector and
the Fisher information is obvious from
E[(Yi − pi )2 |xi ] = V(Yi |xi ) = pi (1 − pi )

Probit model
I The analysis is technically somewhat more involved.
I But the Hessian matrix is here again negative definite, so that there
are generally no problems with the numerical determination of the
MLE.

Perfect prediction
I An MLE does not always exist.
For example, if rank(X ) < K , then the parameter β is not
identifiable (as in the linear case).
Assuming rank(X ) = K (achievable e.g. by a re-parametrization of
the model) avoids that problem.
I However, in a nonlinear binary response model one may be
confronted with the so-called problem of perfect prediction.
. It is typically a problem of the sample at hand and not of
identification.
. It would possibly disappear if more data (or another sample)
were available.

Perfect prediction: Example

I Response yi ; regressors xi and a dummy variable di with:
yi = 1 whenever di = 1, and yi = 1 or 0 if di = 0.
⇒ Impossible to estimate the effect of di on P(Yi = 1|xi , di ):
N
X
`(β, δ) = {yi ln[G(xi0 β + δdi )] + (1 − yi ) ln[1 − G(xi0 β + δdi )]}
i=1
X
= ln[G(xi0 β + δ)] +
i: di =1
X
{yi ln[G(xi0 β)] + (1 − yi ) ln[1 − G(xi0 β)]}
i: di =0
I Only the first sum (over i with di = 1) depends on δ.

⇒ There is no (finite) MLE of δ!

Perfect prediction, cont.

I The same problem may also arise in the following cases:
. yi = 0 whenever di = 0,
. yi = 1 whenever di = 0, or
. yi = 0 whenever di = 1.
I The smaller the number of observations with di = 1, the more likely
the problem of perfect prediction occurs.
I In the extreme case with di = 1 for just one observation, perfect
prediction must even occur.
I To get rid of the problem, one should exclude the dummy variable
from the regressors.


I Probit Model: Estimation and information criteria (STATA output)
. probit childless time educ white sibs if age>39
Iteration 0: log likelihood = -2126.8908

Probit regression Number of obs = 5150

LR chi2(4) = 39.56
Prob > chi2 = 0.0000
Log likelihood = -2107.1116 Pseudo R2 = 0.0093
-----------------------------------------------------------------------
childless | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------+---------------------------------------------------------
time | .0027483 .0025405 1.08 0.279 -.002231 .0077275
educ | .0314184 .0071296 4.41 0.000 .0174446 .0453923
white | .0625978 .0626362 1.00 0.318 -.0601669 .1853624
sibs |-.0117455 .0071229 -1.65 0.099 -.0257061 .0022152
_cons | -1.503 .1157881 -12.98 0.000 -1.729941 -1.27606
. estat ic
-----------------------------------------------------------------
Model | Obs ll(null) ll(model) df AIC BIC
-------------+---------------------------------------------------------------
. | 5150 -2126.891 -2107.112 5 4224.223 4256.957
-----------------------------------------------------------------------------

Logit model: Estimation and IC

. logit childless time educ white sibs if age>39

Logistic regression Number of obs = 5150

LR chi2(4) = 41.86
Prob > chi2 = 0.0000
Log likelihood = -2105.9611 Pseudo R2 = 0.0098
-----------------------------------------------------------------------
childless | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------+---------------------------------------------------------
time | .0049797 .00467 1.07 0.286 -.0041733 .0141328
educ | .0630456 .0136222 4.63 0.000 .0363466 .089744
white | .1287142 .1181644 1.09 0.276 -.1028837 .3603122
sibs |-.0210833 .0134808 -1.56 0.118 -.0475053 .0053387
_cons |-2.676407 .2251076 -11.89 0.000 -3.11761 -2.235204
. estat ic
-----------------------------------------------------------------------------
Model | Obs ll(null) ll(model) df AIC BIC
-------------+---------------------------------------------------------------
. | 5150 -2126.891 -2105.961 5 4221.922 4254.656
-----------------------------------------------------------------------------
→ See R code for further analysis results!

5.2 Binary response models | 5.2.4 Model diagnostics 74 | 144

Covariate patterns
I Two observations share the same covariate pattern if their regressors

are identical.
I Statistical information in the sample can be summarized by the

covariate patterns, the number of observations with that covariate
pattern, and the number of positive outcomes.
I For example, Stata calculates residuals and diagnostic statistics in

terms of covariate patterns.

I Assume: M covariate patterns; pattern j with nj observations

P
( nj = N)
⇒ Number of positive outcomes with pattern j:
X
yej := yi ∼ Bin(nj , pj )
i:xi =xj
M
Y nj ey
⇒ Likelihood function: pj j (1 − pj )nj −eyj
yej
j=1
⇒ Maximized log-likelihood function of current model:

M
X n
ln(b
Lc ) = ln j + yej ln(bpj ) + (nj − yej ) ln(1 − p
bj ) ,
yej
j=1
bj = G(xj0 β).
where p b

Pearson’s χ2 goodness-of-fit statistic

I Measure for discrepancy between the data and the (fitted) model:
M
X yj − nj p
(e bj )2
χ2 =
bj (1 − p
nj p bj )
j=1
. Given the model holds, χ2 follows approximately a χ2M−K

distribution (often inadequate if M is large).
. Asymptotic justification requires a fixed M and nj → ∞ (∀ j).
I Hosmer-Lemeshow goodness-of-fit χ2 :
. Similar to Pearson, but instead of using M covariate patterns
as groups it uses quantiles of the predicted probabilities to
form a smaller number m of groups (e.g. m = 10).
. m groups lead to a statistic with an approximate χ2m−2
distribution given the model holds.

I There is no direct generalization of R 2 to nonlinear models, since

the estimation procedures do not aim at maximizing the “fraction of
the explained variance”.
I Testing the overall goodness-of-fit corresponds to testing the null
hypothesis H0 : β2 = . . . = βK = 0.
I Likelihood ratio (LR) test ⇒ Test statistic:
LR = 2(ln(LU ) − ln(LR )),
where LU = max L(β) = L(β)

b and LR = max L(β).
β H0
I Under H0 :
d
LR −−−−→ χ2K −1 .
N→∞
⇒ An asymptotic α-test rejects H0 , if LR > χ2,1−α

K −1 .

McFadden’s pseudo R 2
ln(LU )
RF2 = 1 −
ln(LR )
I ln[L(β)] is a sum of log probabilities, LU ≥ LR ⇒

ln(LU )
0 ≥ ln(LU ) ≥ ln(LR ) ⇒ 1 ≥ ≥0
ln(LR )
⇒ 0 ≤ RF2 ≤ 1
I RF2 = 0 ⇔ βb2 = ... = βc

K = 0 (i.e. under H0 )
I RF2 = 1 ⇔ LU = 1 (⇔ ln(LU ) = 0)
⇔p
bi = yi (∀i) (practically, not achievable for finite β)
b
(i.e. model provides perfect prediction).
I But values between 0 and 1 have no natural interpretation!
Model selection
I Comparison of model candidates m ∈ M e.g. by
2b 2|m|
AIC (m) = − `m + , where
N N
. `m is the maximized log-likelihood for model m
b
. |m| denotes the dimension (number of parameters) of model m
⇒ Minimizing AIC (m) over m ∈ M provides trade-off between
good model fit (small bias) and low model complexity (small
estimation error/variance)
. AIC (m) - approximately unbiased estimate of (twice the)
expected Kullback-Leibler discrepancy of model m
2 2
. NLM: AIC (m) = ln(b σm ) + 2|m|/N (bσm MLE of σ 2 under m)
. Min.-AIC-procedure is (under ass.) asymptotically optimal
I BIC (m) uses factor ln(N) instead of 2 as penalty for |m|
(under assumptions: consistent model selection procedure)
Predictive quality
I Alternative model specifications may be compared by evaluating

their classification properties.
I The MLE βb of the model parameter β provides an estimator

bi = G(xi0 β)
p b of the probabilities of success pi = G(x 0 β) (for
i
choosing yi = 1).
I This can be used to predict yi (the choice):

1, if p
bi > c
ybi = ybi (c) =
0, else

On choosing the cutoff
I In practice one often chooses the threshold (cutoff) c = 0.5.
I This cutoff choice may be regarded as reasonable if the outcomes 0

and 1 are equally likely to occur in the population, and if the costs
of incorrectly predicting 0 and 1 are approximately the same.
I However, this threshold level has the weakness that if most

outcomes are successes (yi = 1), then it is very likely that for all
observations p bi > 0.5 and thus ybi = 1, leading to
2
P
i (yi − y
b i ) = N(1 − y ) as the number of wrong predictions.
I A similar argument holds if most outcomes are failures.

2 × 2 classification table
I Results can be summarized in a 2 × 2 classification table of the
predicted responses ybi against the observed responses yi :
Actual value
yi = 1 yi = 0 Total
Predicted ybi = 1 TP FP TP+FP
outcome ybi = 0 FN TN FN+TN
Total TP+FN FP+TN N
I TP(True Positives), FP(False Positives), FN(False Negatives),

TN(True Negatives)
I This contingency table could also be given for observations from an
independent validation sample (out-of-sample) instead of using the
actual sample (in-sample).
The hit rate

I The hit rate is defined as fraction of correct predictions
N PN
1 X TP + TN ! (yi − ybi )2
h = h(c) =
b b I(b
yi = yi ) = = 1 − i=1
N i=1 N N
and estimates the unconditional probability of correct classification
h(c) = P(b
y (c) = y ).
I Instead of treating the cutoff as given, say c = 0.5, one could try to
find an “optimal” cutoff for the data set by evaluating different
cutoff values and minimizing the associated proportion of incorrectly
predicted outcomes or, equivalently, by maximizing the hit rate bh(c).
I This approach seems reasonable when the data is a random sample
from the population of interest, and the costs of incorrectly
predicting 0 and 1 are the same.

Specificity and sensitivity

I Maximization of the hit rate (min. of the estimated unconditional
probability of misclassification) is not always meaningful.
I E.g., if the outcome 1 (“success”) is very rare, then one tends to
choose a very large c such that a failure is predicted for everyone.
⇒ Alternative: Consider the conditional probabilities of correct
classification given yi = 0 and yi = 1, respectively, i.e.
. Specificity: h0 = h0 (c) = P(by (c) = 0|y = 0),
. Sensitivity: h1 = h1 (c) = P(by (c) = 1|y = 1).
. There is a close relation to the errors of type I and II for tests.
I h0 and h1 can be estimated by the proportion of correct predictions
separately for the outcomes yi = 1 and yi = 0, resp. [proportion of
actual negatives (positives) which are correctly identified]:
TN TP
h0 = b
b h0 (c) = , h1 = b
b h1 (c) = .
FP + TN TP + FN
Receiver operating characteristic (ROC)

I ROC curves measure the predictive power in two class problems.
I There is usually a trade-off between h0 and h1 : The higher h0 , the
lower h1 , and vice versa.
I Graphical representation by the ROC curve: Plot of sensitivity,
h1 (c), versus (one-specificity), 1 − h0 (c), as the cutoff is varied
(from 1 to 0). [In practice: Plot b h1 (c) versus 1 − b
h0 (c).]
I The curve starts at (0, 0), corresponding to c = 1, and continuous
to (1, 1), corresponding to c = 0.
I A model with no predictive ability (complete randomness) yields a
straight (diagonal) line from (0, 0) to (1, 1)
I The curve in case of a perfect prediction goes straight from (0, 0)
via (0, 1) to (1, 1).
Area under the ROC curve (AUC)

I The predictive ability could be assessed by the area under the ROC
curve (AUC), which varies from 0.5 for “random prediction” to 1 for
“perfect prediction”.
I The greater the predictive power of a model, the more bowed the
curve, and hence the larger the area under the curve. This allows a
comparison of competing models.
I The ROC curve may be used to determine an optimal cutoff value
c, e.g. by minimizing the sum of the (conditional) error frequencies
or, equivalently, by maximizing h0 (c) + h1 (c).
Graphically, this point is obtained by shifting in parallel the diagonal
line to the northwest until it is just tangent to the ROC curve.

Analysis of residuals
I MLE is inconsistent if the model is not correctly specified.
I yi |xi ∼ Bernoulli(pi )
⇒ E(yi |xi ) = pi and Var(yi |xi ) = pi (1 − pi )
⇒ Pearson (or “standardized”) residuals:
yi − pbi
ri = p
bi (1 − p
p bi )
I Case of covariate patterns (as before)
P
. yej := i:xi =xj yi ∼ Bin(nj , pj )
⇒ E(e
yj |xj ) = nj pj and V(e
yj |xj ) = nj pj (1 − pj )
M
yej − nj p
bj X
⇒ rj = p ⇒ χ2 = rj2
bj (1 − p
nj p bj ) j=1
I Outlier detection: e.g. histogram or box plot of (Pearson) residuals

I Potential heteroscedasticity? Plot rj vs. explanatory variables.
I Heteroskedasticity problems may also be caused by a
misspecification of the function G or by omitting a relevant
explanatory variable.
I If potential omitted variables are known, a Wald, LR or LM test
could be derived to test for that type of misspecification.
I Similarly, testing for heteroskedasticity is usually based on the LM
test statistic (assuming some heteroskedastic model under the
alternative).
I Similar to the linear regression case, more sophisticated tools for
identifying outliers, high leverage and influential values, etc. are
available in the literature (e.g. Pregibon, 1981).

Logit model diagnostics: Basic building blocks

I For identification of outlying and influential observations we need:
Residuals and an appropriate projection matrix.
I Likelihood equations (FOC):
N
X
s(β) = (yi −pi )xi = X 0 e = 0 with ei = yi −pi , e = (e1 , . . . , eN )0
i=1
I Hessian matrix of `(β)

N
X
H=− pi (1 − pi )xi xi0 = −X 0 WX with W = diag[pi (1 − pi )]
i=1
I Newton Raphson procedure

β(n+1) = β(n) − H−1 (β(n) )s(β(n) )

−1
⇒ With pseudo observations z(n) = X β(n) + W(n) e(n) :
β(n+1) = β(n) + (X 0 W(n) X )−1 X 0 e(n) = (X 0 W(n) X )−1 X 0 W(n) z(n)
I At convergence (β(n+1) = β(n) = β)

b
βb = (X 0 WX )−1 X 0 Wz (W , z are W(n) , z(n) evaluated at β)

b
I Define projection H = W 1/2 X (X 0 WX )−1 X 0 W 1/2 corresponding to

the hat matrix (P) in the linear model.
p
I Standardized Pearson residuals: erj = rj / 1 − hjj
I Pregibon (1981) influence statistic:
rj2 hjj rj2 hjj
∆βbj = (βb − βb−j )0 X 0 WX (βb − βb−j ) =
e
=
(1 − hjj )2 (1 − hjj )
. corresponds to K · Cj (Cj : Cook’s distance) in the linear model


I Logit Model: Probability of being childless in dependence of time,
educ, sibs and white
I Definition of variables and estimation results: Subsections 5.2.1,
5.2.3
I N = 5150 observations, K = 5
I Hosmer-Lemeshow: χ2 (8) = 54.35 (10 groups), p-value: 0
I Pearson χ2 goodness-of-fit statistic : χ2 (1603) = 1852.48 (1608
covariate patterns), p-value: 0
I Pseudo R 2 : RF2 = 0.0098
I Information criteria: AIC = 4221.922, BIC = 4254.656
I See R code for further model diagnostics.
5.3 Limited dependent variables | 92 | 144
Contents I
5.1 Introduction
5.3.1 Introduction
5.A Literature

5.3 Limited dependent variables | 5.3.1 Introduction 93 | 144
5.3 Models for limited dependent variables

5.3.1 Introduction
Now, we consider regression problems in which
(i) the dependent variable of interest is not observed completely
(e.g. due to truncation or censoring),
or
(ii) the dependent variable is observed completely, but the chosen
sample is not representative for the population (e.g. because the
persons self-select into the sample).
⇒ The OLSE is inconsistent (even in case of a linear regression
function).

Truncation
I In case of truncation we do not have all observations - neither for

the dependent variable y nor for the regressors x = (x1 , . . . , xK )0 .
I Focus is here on case of continuous dependent variable.
I Example 1: For the analysis of income equations we only have data

(yi , xi ), for persons with low income (say with income yi < a for
some threshold a; truncation from above).

I Example 2: When the relation between car prices yi and the

characteristics of buyers xi (age, income etc.) is studied, we often
only have data (yi , xi ), where the car price yi is not below some
minimal car price (truncation from below); the sample does not
contain data about persons/households, for which all available cars
are too expensive.
I The truncation effect must be taken into account: If, e.g. in

Example 2, we want to forecast the potential interest for a cheaper
new car, most potential buyers are not contained in the sample.

Latent variable model for truncated data
yi∗ = xi0 β + εi , εi ∼ (0, σ 2 ) i.i.d.

I Model holds for the population.
[In Example 2 the population are potential and actual buyers.]
I Then, in case of truncation from below at a:
yi = yi∗ = xi0 β + εi if yi∗ > a

xi , yi are not observed if yi∗ ≤ a
⇒ Sample is drawn from restricted part of the population.

Graphical illustration of truncation effects

YSTAR vs. X Y vs. X
6 6
4 4
2 2
YSTAR
0 0
Y
-2 -2
-4 -4
-6 -6
-3 -2 -1 0 1 2 3 4 -3 -2 -1 0 1 2 3 4
X X
Figure 9: Truncation from below, truncation point a = 0

Censoring
I In the case of censoring, information is lost only about the

dependent variable, but not about the regressors.
I Example 1: The sample contains persons of all income classes, for

which data of the relevant characteristics is available. However, for
confidentiality reasons all incomes above a certain threshold a (e.g.
a = 100.000 Euro) are coded (for these we only know that their
income is ≥ a; censoring from above).

I Example 2: For the analysis of durable goods (e.g. cars,

refrigerators) we have data for all relevant explanatory variables, but
the purchases under a certain minimum value have the value 0.
I In contrast to truncated observations (where we have an information

loss), censored observations are available for the analysis.
I Standard model for the analysis: Tobit Model (Tobin, 1958).

Latent variable model for censored data
yi∗ = xi0 β + εi , εi ∼ (0, σ 2 ) i.i.d.
I Model holds for the population.
I Observe, in case of censoring from below at a,
yi = yi∗ = xi0 β + εi if yi∗ > a

yi = a if yi∗ ≤ a

Graphical illustration
YSTAR vs. X
of censoringY vs.effects
X
6 6
4 4
2 2
YSTAR
0 0
Y
-2 -2
-4 -4
-6 -6
-3 -2 -1 0 1 2 3 4 -3 -2 -1 0 1 2 3 4
X X
Figure 10: Censoring from below at a = 0

5.3 Limited dependent variables | 5.3.2 Truncation and censoring 102 | 144
I Truncation is a property of the distribution, censoring is a property

of the sample.
(I) Truncated distributions

I Let Y be a random variable (RV) and a ∈ R some given threshold.
I Then the conditional distribution of Y given Y > a (or given
Y < a) is called distribution of Y truncated from below/from the
left (or from above/from the right) by a.
. Y is only observed above or below the threshold.

CDF of truncated distribution
I Let F denote the cdf of Y , i.e. F (y ) = P(Y < y ).
⇒ The distribution of Y truncated from below by a has the cdf
P(a < Y < y )

Fa (y ) = P(Y < y |Y > a) =
P(Y > a)
( F (y )−P(Y ≤a) F (y )−F (a)−P(Y =a)
1−P(Y ≤a) = 1−F (a)−P(Y =a) , if y > a
=
0, if y ≤ a

Density / probability mass function under

truncation
I If Y is a discrete RV, then the truncated (from below by a)
distribution is characterized by the probability mass function
( P(Y =y )
P(Y >a) , if y > a
pa (y ) = P(Y = y |Y > a) =
0, otherwise.
I For a continuous RV Y with density f it holds P(Y = a) = 0 and
f (y ) = F 0 (y ). Consequently, the density of the truncated (from
below by a) distribution is
( f (y )
, if y > a
fa (y ) = 1−F (a)
0, otherwise.

Example of truncated normal distribution

I Let Y ∼ N (µ, σ 2 ), and φ(z) and Φ(z) denote the standard
normal pdf and cdf, respectively.
( 2 )
1 1 y −µ 1 y −µ
⇒ f (y ) = √ exp − = φ
2πσ 2 σ σ σ
Z y
y −µ
and F (y ) = f (x )dx = Φ .
−∞ σ
⇒ Density when truncating from below by a:

 1 y −µ
 σ φ( (a−µ)
σ )
, if y > a
fa (y ) = 1−Φ σ
0, otherwise.


0.8
truncated normal
normal
0.6
f(y)
0.4
0.2
0.0
a µ
Figure 11: Comparison of densities of N (µ, σ 2 ) and the corresponding

left-truncated distribution (µ = 3, σ = 1, a = 2).

Moments of truncated normal distribution

Theorem 5.1. Let Y ∼ N (µ, σ 2 ). Then:
(a1) E(Y |Y > a) = µ + σλ( a−µ

σ ),
φ(z)
where λ(z) = 1−Φ(z) is the hazard function of N (0, 1).
e a−µ ),
(a2) E(Y |Y < a) = µ + σ λ( σ
φ(z)
where λ(z)
e = − Φ(z) is the negative “inverse Mills ratio”.
(b1) V(Y |Y > a) = σ 2 (1 − δ( a−µ

σ )),
where δ(z) = λ(z)(λ(z) − z).
e a−µ )),
(b2) V(Y |Y < a) = σ 2 (1 − δ( σ
where δ(z)
e = λ(z)(
e λ(z)
e − z).

Remarks
(i) We always have (∀z): 0 < δ(z) < 1 and 0 < δ(z)
e < 1.
φ(−z) φ(z)
(ii) λ(−z) = = = −λ(z).
e
1 − Φ(−z) Φ(z)
(iii) Truncation reduces the variance.
(iv) Truncation from below (above) increases (reduces) the expectation.
(v) For a = 0 it follows:
φ(µ/σ)
E(Y |Y > 0) = µ + σ ,
Φ(µ/σ)
φ(z)
where λ(−z) = −λ(z)
e = is the “inverse Mills ratio”.
Φ(z)
(vi) (a2) follows from (a1), since −Y ∼ N(−µ, σ 2 ) and
E(Y |Y < a) = −E(−Y | − Y > −a).
(II) Censored data

I The description of censored data is often done using latent variables.
I Let Y ∗ be a RV. Then the corresponding variable censored from
below/from left (or from above/from right) by a is given by
Y = max{Y ∗ , a} = I(Y ∗ > a)Y ∗ + I(Y ∗ ≤ a)a

(
Y ∗ , if Y ∗ > a
=
a, if Y ∗ ≤ a
( !
∗ Y ∗ , if Y ∗ < a
or Y = min{Y , a} =
a, else

Example
I Let Yi∗ denote the price that an individual i is willing to pay for a
good (e.g. a refrigerator).
I We observe Yi = Yi∗ if some lower threshold a is exceeded.

Otherwise, we observe Yi = a (if Yi∗ ≤ a).
I Typically a = 0, which is not a restriction, if we use a linear model

with intercept for Y ∗ .

Distribution of censored normal RV

I If Y ∗ ∼ N (µ, σ 2 ), then the distribution Y = max{Y ∗ , a} is mixed
continuous-discrete.
I The value y = a is attained with positive probability:

a−µ
P(Y = a) = P(Y ∗ ≤ a) = FY ∗ (a) = Φ .
σ
I For y > a, the density of Y is:

1 y −µ
f (y ) = φ .
σ σ
I Cdf of Y : (
y −µ

Φ σ , if y > a
F (y ) = P(Y < y ) =
0, otherwise.

1.00
1.00
0.75
0.75
0.50
0.50
F(y)
f(y)
Φ((a − µ) σ)
Φ((a − µ) σ)
(
o
a µ a µ
y y
Figure 12: Distribution of Y = max{Y ∗ , a}, where Y ∗ ∼ N(µ, σ 2 )

(censored normally distributed RV), with µ = 3, σ = 1, a = 2.
Left: Density / probability mass at point a; Right: cdf.

Moments of censored normal RVs

Theorem 5.2. Let Y ∗ ∼ N (µ, σ 2 ) and Y = max{Y ∗ , a} be the RV
censored from below by a. Then:

a−µ a−µ a−µ
(a) E(Y ) = Φ a+ 1−Φ µ + σλ
σ σ σ

a−µ
(b) V(Y ) = σ 2 1 − Φ
h σ
a−µ 2
i
· 1 − δ σ + a−µ
a−µ a−µ

σ − λ σ Φ σ
I Remarks:
(i) λ(z) and δ(z) are explained in Theorem 5.1.
(ii) For a = 0 it follows E(Y ) = Φ(µ/σ) · µ + σφ(µ/σ).

5.3 Limited dependent variables | 5.3.3 Truncated regression model 114 | 144
5.3.3 The truncated regression model

(truncated tobit model)
I Regression model without truncation:
yi = xi0 β + εi , i = 1, ..., N
. Under the assumption εi |xi ∼ N (0, σ 2 ) i.i.d. it follows:
E(yi |xi ) = xi0 β and
V(yi |xi ) = V(εi |xi ) = σ 2 = V(εi ),
i.e.: yi |xi ∼ N (xi0 β, σ 2 ) independent.

I Truncated regression model:

We investigate the dependence of the conditional expectation of yi
given xi under the condition yi > a. It follows from Theorem 5.1:
a − xi0 β

E(yi |xi ; yi > a) = xi0 β + σλ ,
σ
a − xi0 β

V(yi |xi ; yi > a) = σ 2 1 − δ .
σ
⇒ Moments are shifted compared to the model without truncation.

I Truncation induces a nonlinear conditional expectation and
heteroscedasticity.

Marginal effects
I Without truncation (i.e. in the latent variable model) :

∂E(yi |xi )
= βj
∂xij
I Under truncation:
a − xi0 β

∂E(yi |xi ; yi > a) ∂
= βj + σ λ
∂xij ∂xij σ
0

a − xi β βj
= βj + σδ · −
σ σ
0

a − xi β
= βj 1 − δ
σ

I There, we have used the following result for the derivative of λ(z):

φ(z) φ(z)
λ0 (z) = · −z = δ(z).
1 − Φ(z) 1 − Φ(z)
I Truncation leads to a shrinking of βj .
⇒ Correction of the truncation effect is necessary!
I For interpretation, we calculate average values of these effects (over
the individuals).
I The relative effects of the j-th and the k-th explanatory variable
remains βj /βk , since the shrinking factors for βj and βk are equal.

Parameter estimation
I Without loss of generality (due to intercept): a = 0.
I The linear OLSE is inconsistent, since it doesn’t account for the

“truncation correction” and the bias doesn’t vanish asymptotically
(cf. lecture).
I The “convenient” estimation equation would be:

0
−xi β
yi = xi0 β + σλ + εi ,
σ
where E(εi |xi ; yi > 0) = 0 holds.

⇒ Nonlinear OLS estimation:

N 0 2
X −xi β
yi − xi0 β − σλ → min.
i=1
σ β,σ
. For this nonlinear minimization problem, we can use e.g. the

Newton-Raphson-method.
. This estimator is consistent, albeit not efficient, since
heteroscedasticity is ignored (cf. Theorem 5.1b). For this
reason, this estimator is rarely used in practice.
. Additionally necessary: correct specification of the conditional
expectation (normal distribution and homoscedastic errors).
. Since λ(x 0 β/σ) might be almost linear in x 0 β, we possibly
have a multicollinearity problem and thus imprecise estimators.

I Maximum likelihood estimation
. Data: yi , xi , given yi > a = 0, i = 1, ..., N.

. Likelihood function of the truncated distribution (notation
without the condition xi (i.e. f (yi ) denotes the conditional
density of yi given xi ; and fa (yi ) accordingly):
N N
Y Y f (yi )
L(β, σ 2 ) = fa (yi ) =
i=1 i=1
1 − F (0)
yi −xi0 β 0

N 1
φ N φ yi −xi β
Y σ σ Y σ
= 0 =
−xi β
0
xi β
i=1 1 − Φ σ i=1 σΦ σ

z 2
⇒ Log-Likelihood (note: φ(z) = √1 e − 2 )
2π
`(β, σ 2 ) ln L(β, σ 2 )

=
N N
= − ln(σ 2 ) − ln(2π)
2 2
N N
1 X X
− 2 (yi − xi0 β)2 − ln[Φ(xi0 β/σ)]
2σ i=1 i=1
. The maximization of `(β, σ 2 ) is a nonlinear problem and

requires numerical methods.
. Consistency, asy. normality and asy. efficiency of the MLE
hold, as long as εi |xi ∼ N (0, σ 2 ) i.i.d. can be assumed.

5.3 Limited dependent variables | 5.3.4 Censored regression model 122 | 144
5.3.4 Regression with censored data (tobit

model)
I Assumption: Structural equation for the latent variable:
yi∗ = xi0 β + εi , εi |xi ∼ N (0, σ 2 ) i.i.d., i = 1, ..., N
I Model with censoring from below (w.o.l.o.g. a = 0):

(
0, if yi∗ ≤ 0
yi =
xi0 β + εi , if yi∗ > 0
. The constant a does not affect the estimation of β, since it is

captured by the intercept.
. Alternative: individual thresholds ai (no problem if ai is known)
. Censoring from above is treated in an analogous way.
Marginal effects
I Two parts (βj > 0):
(i) yi = 0, xij ↑ ⇒ P(yi > 0|xi ) ↑ (obvious)
(ii) yi > 0, xij ↑ ⇒ E(y(i) |xi ) ↑
. Formally, Theorem 5.2 (Remark (ii)) provides with a = 0 and

µ = xi0 β:
0 0
xi β 0 xβ
E(yi |xi ) = Φ xi β + σφ i
σ σ
0
∂E(yi |xi ) xi β
⇒ = Φ · βj (> 0 for βj > 0).
∂xij σ

xi0 β
I The difference to βj is small (large), if σ is large (small).
xi0 β
I This is not surprising, since for large σ also yi∗ will be large, so
censoring occurs only rarely.
x 0β
I On the other hand, if iσ is small, we mostly get yi = 0 and
therefore large probabilities P(yi = 0|xi ).

Parameter estimation
I Linear OLS is based on
yi = xi0 β + ηi (i = 1, . . . , N).
(a) OLS for observations with positive yi (truncated model)
⇒ ηi = εi + σλi with
−xi0 β φ(xi0 β/σ)

λi = λ =
σ Φ(xi0 β/σ)
⇒ E(ηi |xi ; yi > 0) 6= 0
⇒ OLSE is biased and inconsistent!

(b) In case of censored data (yi = 0 for yi∗ ≤ 0):

0 0
xi β 0 xβ
E(yi |xi ) = Φ xi β + σ φ i
σ σ
| {z } | {z }
=:Φi =:φi
⇒ E(ηi |xi ) = (Φi − 1)xi0 β + σφi 6= 0 (in general)
⇒ E(βbOLS |X ) = β + (X 0 X )−1 X 0 E(η|X )

| {z }
6=0
⇒ Inconsistency of OLS estimator!

I Nonlinear OLS takes into account that
E(εi |xi ) = 0 , where εi := yi − Φi xi0 β − σφi :
X
(yi − Φi xi0 β − σφi )2 → min
β,σ
i
⇒ Consistent estimation, that is not asymptotically efficient because of

heteroscedasticity!
⇒ Rarely applied.

I ML estimation:
. Let di = I(yi∗ > 0) = I(yi > 0) ⇒ yi = di yi∗
I0 = {i ∈ {1, ..., N} : di = 0}, |I0 | =: N0
I1 = {i ∈ {1, ..., N} : di = 1} = {1, ..., N} \ I0 ,
|I1 | =: N1 = N − N0
0 1−di h idi
x β y −x 0 β
⇒ f (yi |xi ) = Φ − iσ · σ1 φ i σ i

⇒ Likelihood function:
N
Y
L(β, σ 2 ) = f (yi |xi )
i=1
N 0 1−di di
yi − xi0 β

Y xβ 1
= 1−Φ i · φ
i=1
σ σ σ
0 Y
Y yi − xi0 β

xβ 1
= 1−Φ i · φ
σ σ σ
i∈I0 i∈I1

⇒ Log-likelihood function:
0
X xβ
2
`(β, σ ) = ln 1 − Φ i
σ
i∈I0
(yi − xi0 β)2

1X 2
− ln(2π) + ln(σ ) +
2 σ2
i∈I1
N
X
= (1 − di ) ln(1 − Φi )
i=1
N
(yi − xi0 β)2

X di 2
− ln(2π) + ln(σ ) +
i=1
2 σ2

⇒ Likelihood equations:
N
∂` 1 X σφi 0
= −(1 − di ) + di (yi − xi β) xi = 0,
∂β σ 2 i=1 1 − Φi
N
φi xi0 β (yi − xi0 β)2

∂` X di 1
= (1 − di ) + − 2 = 0.
∂σ 2 i=1
3
2σ (1 − Φi ) 2 σ 4 σ

I The maximization of `(β, σ 2 ) (or solving of the likelihood equations

with respect to β and σ 2 ) yields the MLE and requires numerical
methods (e.g. Newton-Raphson).
I The log-likelihood function ` ist globally concave in dependence of

β ∗ = βσ and σ ∗ = σ1 !
I The MLE is consistent, asymptotically normal and asymptotically

efficient, if
εi |xi ∼ N (0, σ 2 ) i.i.d.,
i.e., if the model is correctly specified.

Two-step estimation (Heckman)
I Idea: censored data are a combination of a binary dependent

variable
(
0, if yi = 0,
yei =
1, if yi > 0
followed by a linear relation for the truncated sample (yi > 0)
yi = xi0 β + εi .

I Tobit Model ⇒
yi = 1|xi )
P(e = P(yi > 0|xi )
0 0
xβ xβ
= 1−Φ − i =Φ i
σ σ
0
xβ
yi = 0|xi )
P(e = P(yi = 0|xi ) = 1 − Φ i
σ
I 1st step:
Estimate γ = β/σ using ML in the probit model:
yi = 1|xi ) = Φ(xi0 γ).

P(e
⇒ Consistent estimator γb of γ (i.e., the bias correction term is

estimated using a probit model).

I 2nd step: We consider the truncated sample with yi > 0:
φ (xi0 γ)
E(yi |xi , yi > 0) = xi0 β + σ
Φ(x 0 γ)
| {zi }
=λ(−xi0 γ)=:λi
bi = λ(−x 0 γ
. Replace λi by λ i b) and regress
yi on xi0 β + σ λ
bi (yi > 0).
⇒ Consistent estimators of β and σ (these are OLS estimators in

a linear model with estimated bias correction term λ
bi as an
additional regressor for the truncated sample).

I The two-step method of Heckman is easier than MLE, but not

efficient.
I However, the resulting estimators can serve as initial values for an

iterative method to determine the MLE.

Consequences if assumptions are not satisfied

(i) εi is heteroscedastic. ⇒ MLE is inconsistent.
H1 : σi2 = exp(xi0 α)
⇒ H0 : α2 = . . . = αK = 0
. LM test requires only calculation of MLE under H0 :
!−1
∂ 2 `

∂` ∂` as. 2
− 0 ∼ χK −1 .
∂θ b θH ∂θ∂θ0 b θH ∂θ b θH H0
| {z }
=I(b θH )−1
. LR test requires MLE under H0 and H1 .

(ii) εi is not normally distributed. ⇒ MLE is inconsistent.

I Test idea, e.g. in the i.i.d. case:
yi∗ ∼ N (µ, σ 2 ) i.i.d.
. Estimate P(yi∗ > 0)

(a) by Φ(b
µ/b
σ ) (i.e. under the assumption of normality), and
(b) by the frequency of uncensored observations (N1 /N).
. Compare (a) and (b) using a Hausman statistic (Nelson test).

(iii) εi is autocorrelated. ⇒ No problem for consistency!
I However, autocorrelation cannot be neglected for inferential

purposes.
. The construction of asymptotically justified tests or confidence

regions requires a consistent estimation of the covariance
matrix of the estimator to get robust (correct) standard errors.

Remarks
I Generalizations, e.g. twice censored Tobit model.
I Dependent variable is not completely observable:

partially continuous, partially censored.
I In contrast to the truncated model, the explanatory variables are

available for all observations.

I It is important to use all available information:

. 0/1 parts and continuous outcome.
⇒ Likelihood has two parts (probit part and linear OLS part).
I Both parts are determined by the same x 0 β.
. This restriction is not always plausible, since 0/1 decisions can
have different determinants (with different coefficients)
compared to the metric outcomes.
I Such relations between a binary (0/1) variable and a continuous
variable can generally be treated by
. assuming a parameter vector β(1) for the 0/1 decision, and β(2)
for the continuous part.
⇒ Hypothesis of identical parameters is testable:
H0 : β(1) = β(2) .

Empirical Example
To be added.

5.A Literature | 143 | 144
Contents I
5.1 Introduction
5.3.1 Introduction
5.A Literature

5.A Literature | 144 | 144
5.A Literature
I Amemiya, T. (1985). Advanced Econometrics. Harvard University Press.
Cambridge, Ma.
I Cameron, A. C. and Trivedi, P. K. (2005). Microeconometrics - Methods
and Applications. Cambridge University Press.
I Heij, C.; de Boer, P.; Franses, P. H.; Kloek, T. and van Dijk, H. K.
(2004). Econometric Methods with Applications in Business and
Economics. Oxford University Press.
I McCullagh, P. and Nelder, J. A. (1983). Generalized Linear Models.
Chapman and Hall, London.
I Nelson, F. D. (1977). Censored Regression Models with Unobserved,
Stochastic Censoring Threshold. Journal of Econometrics 6, 309-327.
I Nelson, F. D. (1981). A Test for Misspecification in the Censored Normal
Model. Econometrica 49, 1317-1329.

Chap 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap 5

Uploaded by

Copyright:

Available Formats

Chapter 5

Models for qualitative and limited dependent

Applied Econometrics – Chapter 5

The classical linear regression model might need to be modified

I the dependent variables are often non-continuous (qualitative, e.g.

Applied Econometrics – Chapter 5

Types of variables / data

I Qualitative variables have a finite number of mutually exclusive

Applied Econometrics – Chapter 5

I An ordered variable has categories which are ordered, but differences

Applied Econometrics – Chapter 5

Y: BVG ticket 0.6

Figure 1: Example of binary dependent variable from HU student survey

Applied Econometrics – Chapter 5

Censored dependent variable

Applied Econometrics – Chapter 5

0 1000 2000 3000 4000 5000

Figure 2: Limited dependent variable from HU student survey

Applied Econometrics – Chapter 5

Figure 3: Limited dependent variable (see Greene (2003), Ex. 22.8)

Applied Econometrics – Chapter 5

Applied Econometrics – Chapter 5

Common aspects of models

Applied Econometrics – Chapter 5

Empirical example 1: Determinants of fertility

Example 1: Research questions

Applied Econometrics – Chapter 5

Example 1: Year-by-year statistics

Table 2: Fertility and average education level by years

Example 1: Interpretation of Table 2

Applied Econometrics – Chapter 5

Example 1: Linear regression analysis

Variable Coefficient Std. Error t-Statistic Prob.

YEAR=1974 3.170732 0.093298 33.98506 0.0000

R-squared 0.018405 Mean dependent var 2.586408

Figure 4: No. of children in dependence of year dummies

Dependent Variable: KIDS

Variable Coefficient Std. Error t-Statistic Prob.

C 3.026134 0.055684 54.34433 0.0000

R-squared 0.015368 Mean dependent var 2.586408

Figure 5: No. of children in dependence of time

Applied Econometrics – Chapter 5

Dependent Variable: KIDS

Variable Coefficient Std. Error t-Statistic Prob.

C 4.391621 0.102744 42.74341 0.0000

R-squared 0.060188 Mean dependent var 2.586408

Figure 6: No. of children in dependence of time and years of schooling

I Which model would you prefer?

Applied Econometrics – Chapter 5

Applied Econometrics – Chapter 5

5.2.1. Binary response models

Example: Discrimination in mortgage market?

Applied Econometrics – Chapter 5

⇒ Maximum likelihood approach is appropriate.

Applied Econometrics – Chapter 5

Linear probability model

I Marginal (probability) effect of xj (on E(yi |xi ) or pi ):

Disadvantages of linear probability model

⇒ OLSE βbOLS of β is unbiased, but inefficient.

Emp. Example 1 (determinants of fertility)

Variable Coefficient Std. Error t-Statistic Prob.

C 0.039729 0.026033 1.526072 0.1271

R-squared 0.007911 Mean dependent var 0.144466