You are on page 1of 96

Nonparametric and Semiparametric Models

Wolfgang Hardle
Marlene M uller
Stefan Sperlich
Axel Werwatz
Institut f ur Statistik and

Okonometrie
CASE - Center for Applied Statistics
and Economics
Humboldt-Universitat zu Berlin

Outline
Introduction
Part I. Nonparametric Function Estimation

Histogram

Nonparametric Density Estimation

Nonparametric Regression
Part II. Generalized and Semiparametric Models

Semiparametric Regression (Overview)

Single Index Models

Generalized Partial Linear Models

Additive Models

Generalized Additive Models


Introduction 1-1
Chapter 1:
Introduction
SPM
Introduction 1-2
Linear Regression
E(Y[X) = X
1

1
+ . . . + X
d

d
= X

Example: Wage equation


Y = log wages,
X = schooling (measured in years), labor market experience
(measured as: AGE SCHOOL 6) and experience squared.
E(Y[SCHOOL, EXP)
=
1
+
2
SCHOOL +
3
EXP +
4
EXP
2
.
CPS 1985, n = 534
SPM
Introduction 1-3
coecient estimates for the wage equation:
Dependent Variable: Log Wages
Variable Coecients S.E. t-values
SCHOOL 0.0898 0.0083 10.788
EXP 0.0349 0.0056 6.185
EXP
2
0.0005 0.0001 4.307
constant 0.5202 0.1236 4.209
R
2
= 0.24, sample size n = 534
Table 1: Results from OLS estimation
SPM
Introduction 1-4
Wage <-- Schooling
5 10 15
x1
0
.
5
1
1
.
5
Wage <-- Experience
0 10 20 30 40 50
x2
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
Figure 1: wage-schooling prole and wage-experience prole
SPMcps85lin
SPM
Introduction 1-5
Wage <-- Schooling, Experience
6.0
10.0
Schooling
14.0
13.8
27.5 Experience
41.2
1.5
1.9
2.3
Figure 2: Parametrically estimated regression function
SPMcps85lin
SPM
Introduction 1-6
Linear regression
E(Y[SCHOOL, EXP) = const +
1
SCHOOL +
2
EXP
Nonparametric Regression
E(Y[SCHOOL, EXP) = m(SCHOOL, EXP)
m() is a smooth function
SPM
Introduction 1-7
Wage <-- Schooling, Experience
6.0
10.0
Schooling
14.0
13.8
27.5 Experience
41.2
1.7
2.0
2.4
Figure 3: Nonparametrically estimated regression function
SPMcps85reg
SPM
Introduction 1-8
Example: Engel curve
Engel (1857)
..., da je armer eine Familie ist, einen desto gr oeren Anteil
von der Gesammtausgabe mu zur Beschaung der
Nahrung aufgewendet werden.
(The poorer a family, the bigger the share of total expenditures
that has to be used for food.)
Ernst Engel on BBI:
SPM
Introduction 1-9
Engel Curve
0 0.5 1 1.5 2 2.5 3
Net-income
0
0
.
2
0
.
4
0
.
6
F
o
o
d
Figure 4: Engel curve, U.K. Family Expenditure Survey 1973
SPMengelcurve2
SPM
Introduction 1-10
Example: Income distribution changes?
data: UK net-income 19691983 (Family Expenditure Survey)
Lognormal Density Estimates
71
73
75
77
79
81
Kernel Density Estimates
71
73
75
77
79
81
Figure 5: Log-normal and kernel density estimates of net-income
SPMfesdensities
SPM
Introduction 1-11
Example: Foreign exchange rates volatility
data: DM/US Dollar exchange rate S
t
(October 1992 until
September 1993), returns Y
t
= ln(S
t
/S
t1
)
model:
Y
t
= m(Y
t1
) + (Y
t1
)
t
where
m() is the mean function
() is the volatility function
SPM
Introduction 1-12
FX Rate DM/US$ (10/92-09/93)
0 5000 10000 15000 20000 25000
1
.
4
1
.
5
1
.
6
1
.
7
FX Returns DM/US$ (10/92-09/93)
0 5000 10000 15000 20000 25000
-
0
.
0
1
-
0
.
0
1
0
0
.
0
1
0
.
0
1
Figure 6: DM/US Dollar exchange rate and returns
SPMfxrate
SPM
Introduction 1-13
FX Volatility Function
-2 -1 0 1 2
y(t-1)*E-3
2
4
6
8
4
e
-
0
7
+
y
(
t
)
*
E
-
7
Figure 7: The estimated variance function for DM/US Dollar with uniform
condence bands
SPMfxvolatility
SPM
Introduction 1-14
Nonparametric regression
E(Y[SCHOOL, EXP) = m(SCHOOL, EXP)
Semiparametric Regression
E(Y[SCHOOL, EXP) = + g
1
(SCHOOL) + g
2
(EXP)
g
1
(.), g
2
(.) are smooth functions
SPM
Introduction 1-15
Example: Wage equation
Wage <-- Schooling
5 10 15
X
-
1
-
0
.
5
0
Y
Wage <-- Experience
0 10 20 30 40 50
X
-
0
.
4
-
0
.
2
0
0
.
2
Y
Figure 8: Additive model t vs. parametric t, wage-schooling (left) and
wage-experience (right) proles
SPMcps85add
SPM
Introduction 1-16
Wage <-- Schooling, Experience
6.0
10.0
Schooling
14.0
13.8
27.5 Experience
41.2
-0.1
0.1
0.4
Figure 9: Surface plot for the additive model
SPMcps85add
SPM
Introduction 1-17
Example: Binary choice model
Y =
_
1 if person imagines to move to west,
0 otherwise.
E(Y[x) = P(Y = 1[x) = G(

x)
typically: logistic link function (logit model)
E(Y[x) = P(Y = 1[x) =
1
1 + exp(

x)
SPM
Introduction 1-18
Logit Model
-3 -2 -1 0 1 2
Index
0
0
.
5
1
L
i
n
k

F
u
n
c
t
i
o
n
,

R
e
s
p
o
n
s
e
s
Figure 10: Logit model for migration
SPMlogit
SPM
Introduction 1-19
Heteroscedasticity problems
binary choice model with
Var () =
1
4
_
1 + (

x)
2
_
2
Var (u),
where u has a (standard) logistic distribution
SPM
Introduction 1-20
True versus Logit Link
-4 -3 -2 -1 0 1 2 3 4
Index
0
0
.
5
1
G
(
I
n
d
e
x
)
Figure 11: The link function of a homoscedastic logit model (thin) vs. a
heteroscedastic model (solid)
SPMtruelogit
SPM
Introduction 1-21
Wrong link consequences
True Ratio vs. Sampling Distribution
-1.5 -1 -0.5 0 0.5 1 1.5
True+Estimated Ratio
0
0
.2
0
.4
0
.6
S
a
m
p
l
i
n
g
D
i
s
t
r
i
b
u
t
i
o
n
Figure 12: Sampling distribution of the ratio of estimated coecients and
the ratios true value
SPMsimulogit
SPM
Introduction 1-22
Single Index Model
-3 -2 -1 0 1 2
Index
0
0
.
5
1
L
i
n
k

F
u
n
c
t
i
o
n
,

R
e
s
p
o
n
s
e
s
Figure 13: Single index model for migration
SPMsim
SPM
Introduction 1-23
Example: Implied volatility surface (IVS)
Black Scholes (BS) formula values European call options:
C
BS
t
= S
t
(d1) Ke
r
(d2)
d
1
=
log(St/K) + (r +
1
2

2
)

d
2
= d
1

CDF of the standard normal distribution


r Interest rate S
t
Asset price at time t
K Strike price = T t Time to maturity with mat. date T
SPM
Introduction 1-24
IVS Example. Cont.
IVS :

C
t
C
BS
t
(S
t
, K, r , , ) = 0
Empirical IVS by using nonparametric method (IVS2001.mov)
0.13
0.26
0.39
0.52
0.65
0.80
0.88
0.96
1.03
1.11
0.20
0.40
0.60
0.80
1.00
Figure 14: t = 20020516 ODAX
SPM
Introduction 1-25
Example: Inner-German Migration
SPM
Introduction 1-26
Y =
_
1 if person imagines to move to west,
0 otherwise.
E(Y[X) = P(Y = 1[X) = G(

X),
where X , a vector of personal features, and f (X) = G(

X) the
related parameter.
Leads to the log-likelihood
L() =
n

i =1
_
Y
i
log
G(

X
i
)
1 G(

X
i
)
+ log
_
1 G(

X
i
)
_
_
.
SPM
Introduction 1-27
The choice of the logistic link function G(u) = (1 + e
u
)
1
(logit
model) corresponds to the canonical parametrization:
L() =
n

i =1
_
Y
i

X
i
+ log
_
1 + e

X
i
__
.
SPM
Introduction 1-28
influence of household income
1000 2000 3000 4000
income
1
.
9
2
2
.
1
2
.
2
2
.
3
m
(
i
n
c
o
m
e
)
semiparametric fit
parametric fit
Figure 15: Estimated inuence of income

f (t)
SPM
Introduction 1-29
v
Figure 16: Logit model for migration
SPM
Introduction 1-30
Summary: Introduction
Parametric models are fully determined up to a parameter
(vector). They lead to an easy interpretation of the resulting
ts.
Nonparametric models only impose qualitative restrictions like
a smooth density function or a smooth regression function m.
They may support known parametric models. They open the
way for new models by their exibility.
Semiparametric model combine parametric and nonparametric
parts. This keeps the easy interpretation of the results, but
gives more exibility in some aspects of the model.
SPM
Histogram 2-1
Chapter 2:
Histogram
SPM
Histogram 2-2
Histogram
X continuous random variable with f probability density function
X
1
, X
2
, . . . , X
n
i.i.d. observations
aim: estimate distribution (density function), display it graphically
0 0.5 1 1.5 2
X
0
0.5
1
1.5
2
Y
Figure 17: Histogram for exponential data
SPMsimulatedexponential
SPM
Histogram 2-3
Example: Bualo Snowfall Data
0.005
0.010
0.016
0.021
31.52 63.15 94.78 126.40
Figure 18: Winter snowfall (in inches) at Bualo, New York, 1910/11 to
1972/73 (n = 63) SPMbuadata
Emanuel Parzen on BBI:
SPM
Histogram 2-4
Construction of the Histogram
(a) divide range (support of f ) into bins
B
j
: [x
0
+ (j 1)h, x
0
+ jh), j Z,
with origin x
0
and binwidth h
Example:
bins: [x
0
, x
0
+ h), [x
0
+ h, x
0
+ 2h), [x
0
+ 2h, x
0
+ 3h), . . .
(b) count observations in each B
j
(=: n
j
)
(c) normalize to 1: f
j
=
n
j
nh
(relative frequencies, divided by h)
(d) draw bars with height f
j
for bin B
j
SPM
Histogram 2-5
Buffalo Snowfall Data and Bin Grid
0.005
0.010
0.016
0.021
31.52 63.15 94.78 126.40
Figure 19: Bualo snowfall data and bin grid, binwidth h = 10 and origin
x
0
= 20
SPMbuagrid
SPM
Histogram 2-6
Histogram of Buffalo Snowfall Data
0.005
0.010
0.015
0.021
31.60 63.20 94.80 126.40
Figure 20: Binwidth h = 10 and origin x
0
= 20
SPMbuahisto
SPM
Histogram 2-7
Example: Histogram of stock returns
Histogram of Stock Returns
-0.1 0 0.1 0.2
Data x
0
5
1
0
H
i
s
t
o
g
r
a
m
Figure 21: Stock returns, binwidth h = 0.02 and origin x
0
= 0
SPMstockreturnhisto
SPM
Histogram 2-8
Motivation of the histogram
Construction of Histogram
0 1 2 3 4
Data x
0
0
.
1
0
.
2
0
.
3
0
.
4
D
e
n
s
i
t
y

f
Figure 22: Approximation of the area under the density over an interval by
erecting a rectangle over the interval
SPM
Histogram 2-9
Mathematical Motivation
with m
j
= center of j th bin, it holds
P(X B
j
) =
_
B
j
f (u) du f (m
j
) h
approximation by
P(X B
j
)
1
n
#X
i
B
j

=

f
h
(m
j
) =
1
nh
#X
i
B
j

SPM
Histogram 2-10
Histogram as a Maximum-Likelihood
Estimator
likelihood function:
L =
n

i =1
f (x
i
)
support of the random variable
X [x, x]
divide the support into d bins
B
j
= [x + (j 1)h, x + jh), j 1, . . . , d
with length
h =
x x
d
SPM
Histogram 2-11
if the density function is a step function
f (x) =
d

j =1
f
j
I (x B
j
)
with
_
x
x
f (x)dx = 1,
then it follows that
d

j =1
f
j
h = 1
SPM
Histogram 2-12
the likelihood function is
L =
d

j =1
f
n
j
j
with constraint
d

j =1
f
j
h = 1
hence, maximize
L(f
1
, , f
d
) =
d

j =1
n
j
log f
j
+

1
d

j =1
hf
j

SPM
Histogram 2-13
the rst order conditions are
n
j
= h

f
j
, j = 1, . . . , d (2.1)
d

j =1

f
j
h = 1 (2.2)
Summing over the d equations in (2.1) and using (2.2) gives
n = ,
thus, it follows from (2.1) that

f
j
=
n
j
nh
SPM
Histogram 2-14
Example: Two-steps densities
consider
T =
_
f : f = f
1
I (x B
1
) + f
2
I (x B
2
),
_
f = 1, h = 1
_
L =
n

i =1
f (x
i
) = f
n
1
1
f
n
2
2
n
1
log f
1
+ n
2
log f
2
= n
1
log f
1
+ n
2
log(1 f
1
)
n
1
f
1
+
n
2
1 f
1
(1) = 0 n
1
(1 f
1
) = n
2
f
1
n
1
= f
1
(n
1
+ n
2
)
f
1
=
n
1
n
1
+ n
2
=
n
1
n
=
#X
i
B
1

n
.
SPM
Histogram 2-15
(Asymptotic) Statistical Properties
for X
i
(i = 1, . . . , n) f (density)
we have

f
h
(x) =
1
nh
n

i =1

j
I (X
i
B
j
)I (x B
j
)
consistency:

f
h
(x)
P
f (x) if bias and variance 0
notice that

f
h
is constant within a bin
SPM
Histogram 2-16
Bias, Variance and MSE
for f (x), x xed, m
j
= bin center in which x falls
Bias
E

f
h
(x) f (x) f

(m
j
) (m
j
x)
Variance
Var

f
h
(x)
1
nh
f (m
j
)
Mean Squared Error (MSE)
MSE

f
h
(x) = E

f
h
(x) f (x)
2
MSE(x) = Var + Bias
2

1
nh
f (m
j
) +
_
f

(m
j
)
_
2
(m
j
x)
2
SPM
Histogram 2-17
Expectation of the histogram
E

f
h
(x) =
1
nh
n

i =1
EI (X
i
B
j
) , x B
j
=
1
nh
n EI (X
i
B
j
)
=
1
h
_
jh
(j 1)h
f (u) du
=
1
h
bin probability
SPM
Histogram 2-18
Bias of the histogram
in general
E

f
h
(x) f (x) =
1
h
_
jh
(j 1)h
f (u) du f (x) ,= 0
normally, we have
1
h
_
jh
(j 1)h
f (u) du ,= f (x)
_
jh
(j 1)h
f (u) du
. .
bin probability
,= h f (x)
. .
rectangle
SPM
Histogram 2-19
Bias of the Histogram
-3 -2 -1 0 1 2 3 4
0
0
.2
0
.4
0
.6
N
o
r
m
a
l
d
i
s
t
r
i
b
u
t
i
o
n
x
Figure 23: Binwidth h = 2, x = 1 (bin center), true distribution is normal
SPMhistobias2
SPM
Histogram 2-20
approximation of the bias using a Taylor expansion:
1
h
_
B
j
f (u) f (x) du =
1
h
_
B
j
f

(m
j
)(u x) +O(h) du

1
h
_
B
j
f

(m
j
)(u x) du
f

(m
j
) m
j
x
. .
O(h)
f (u)f (x) f (m
j
)+f

(m
j
)(um
j
)f (m
j
)+f

(m
j
)(x m
j
)
SPM
Histogram 2-21
Variance of the histogram
Var

f
h
(x) =
1
n
2
h
2
nVar I (X
i
B
j
)
=
(Bernoulli)
1
n
2
h
2
n
_
B
j
f (u) du
_
1
_
B
j
f (u) du
_
=
1
nh
_
1
h
_
B
j
f (u) du
_
1 O(h)

1
nh
f (x) +O
_
1
nh
_

1
nh
f (x)
SPM
Histogram 2-22
Mean squared error (MSE)
MSE

f
h
(x) = E

f
h
(x) f (x)
2
MSE(

f
h
(x)) = variance + bias
2

1
nh
f (m
j
) +
_
f

(m
j
)
_
2
(m
j
x)
2
if we let h 0 and nh with (m
j
x) h/2, we obtain
MSE(

f
h
(x)) 0
SPM
Histogram 2-23
Bias^2, Variance and MSE
0.05 0.1 0.15 0.2
Bandwidth h
0
0
.
0
1
0
.
0
1
0
.
0
2
B
i
a
s
^
2
,

V
a
r
i
a
n
c
e

a
n
d

M
S
E
Figure 24: Bias (thin solid line), variance (dashed line) and MSE (thick
line) for the histogram (at x = 0.5)
SPMhistmse
SPM
Histogram 2-24
MISE - a Global Quality Measure
MISE = E
__

f
h
(x) f (x)
2
dx
_
=
_

E
_

f
h
(x) f (x)
2
_
dx
=
_

MSE[

f
h
(x)]dx
two interpretations:
average global error
accumulated pointwise error
SPM
Histogram 2-25
AMISE
Using the approximation for the MSE
MSE
_

f
h
(x)
_

1
nh
f (x) + f

(m
j
)
2
(m
j
x)
2
we derive the Asymptotic MISE (AMISE)
MISE =
_

MSE
_

f
h
(x)
_
dx
1
nh
+
h
2
12
_

(x)
2
dx = AMISE
SPM
Histogram 2-26
derivation of the AMISE:
MISE
_

1
nh
f (x) +

j
I (x B
j
)f

(m
j
)
2
(m
j
x)
2

dx

1
nh
_

f (x)dx +

j
f

(m
j
)
2
_
B
j
(m
j
x)
2
dx

1
nh
+

j
f

(m
j
)
2
_
jh
(j 1)h
__
j
1
2
_
h x
_
2
dx

1
nh
+

j
f

(m
j
)
2
2
3
_
h
2
_
3

1
nh
+
h
2
12

j
hf

(m
j
)
2
SPM
Histogram 2-27
Example: Another approximation
g(x)
def
= f

(x)
2
and
g(x) g(m
j
) + g

(m
j
)(x m
j
)
it follows that
_
B
j
g(x)dx
_
B
j
g(m
j
)dx +
_
B
j
g

(m
j
)(x m
j
)dx
g(m
j
)
_
m
j
+
h
2
m
j

h
2
dx + g

(m
j
)
_
m
j
+
h
2
m
j

h
2
(x m
j
)dx
g(m
j
)h
thus
f

(m
j
)
2
h
_
B
j
f

(x)
2
dx
SPM
Histogram 2-28
using f

(m
j
)
2
h
_
B
j
f

(x)
2
dx
MISE
1
nh
+
h
2
12

j
_
B
j
f

(x)
2
dx
from

j
_
B
j
f

(x)
2
dx =
_

(x)
2
dx
we obtain
AMISE =
1
nh
+
h
2
12
_

(x)
2
dx
SPM
Histogram 2-29
optimization of AMISE w.r.t. h
h
opt
=
_
6
n
_

(x)
2
dx
_
1/3
= constant n
1/3
n
1/3
best convergence rate of MISE:
MISE(h
opt
) =
1
nh
opt
+
h
2
opt
12
_

(x)
2
dx +O(h
2
opt
) +O
_
1
nh
opt
_
= constant n
2/3
+O(constant n
2/3
)
+O(constant n
2/3
)
thus
MISE(h
opt
) n

2
3
SPM
Histogram 2-30
the solution h
opt
does not help us in practice, since we need
_

(x)
2
dx (2.3)
How can we obtain that value?
Possible approach:
plug-in method:
choose a density and calculate (2.3)
SPM
Histogram 2-31
Example: Normal density
Assuming a normal density with variance
2
, i.e.
f (x) =
1

2
2
exp
_

(x )
2
2
2
_
we can calculate
_

(x)
2
dx =
1

2
2
1

2
_

(x )
2
1
_
2

2
2
exp
_

(x )
2
2

2
2
_
dx
=
1
2
5

2
2
=
1
4
3

The integral term is the formula for computing the variance of


normal distributed RV with mean and variance

2
2
.So the
optimal bandwith:
h
opt
=
_
6
n
1
4
3

_1
3
= 3.5n
1/3
SPM
Histogram 2-32
h=0.007, x0=0
-20 -15 -10 -5 0 5 10 15 20
x*E-2
0
5
1
0
1
5
fh
h=0.02, x0=0
-0.2 -0.1 0 0.1 0.2
x
0
5
1
0
1
5
fh
h=0.05, x0=0
-0.2 -0.1 0 0.1 0.2
x
0
5
1
0
1
5
fh
h=0.1, x0=0
-0.2 -0.1 0 0.1 0.2
x
0
5
1
0
1
5
fh
Figure 25: Four histograms for the stock returns data with binwidths h =
0.007, h = 0.02, h = 0.05, h = 0.1; origin x
0
= 0
SPMhisdibin
SPM
Histogram 2-33
h=0.04, x0=0
-0.2 -0.1 0 0.1 0.2
x
0
5
1
0
1
5
fh
h=0.04, x0=0.01
-0.2 -0.1 0 0.1 0.2
x
0
5
1
0
1
5
fh
h=0.04, x0=0.02
-0.2 -0.1 0 0.1 0.2
x
0
5
1
0
1
5
fh
h=0.04, x0=0.03
-20 -15 -10 -5 0 5 10 15 20
x*E-2
0
5
1
0
1
5
fh
Figure 26: Four histograms for the stock returns data corresponding to
dierent origins: x
0
= 0, x
0
= 0.01, x
0
= 0.02, x
0
= 0.03; binwidth
h = 0.04
SPMhisdiori
SPM
Histogram 2-34
Drawbacks of the Histogram
constant over intervals, step function
results depend strongly on origin
binwidth choice
(slow rate of convergence)
solution to the strong dependence on the origin x
0
:
ASH = Averaged Shifted Histograms
SPM
Histogram 2-35
Average Shifted Histogram (ASH)
having a histogram with origin x
0
= 0, i.e.
B
j
= [(j 1)h, jh) , j Z,
we generate M 1 new bin grids by shifting:
B
j ,l
def
= [(j 1 + l /M)h, (j + l /M)h) , l 1, . . . , M 1
the ASH is then obtained by

f
h
(x) =
1
M
M1

l =0
1
nh
n

i =1

j
I (X
i
B
j ,l
)I (x B
j ,l
)

=
1
n
n

i =1

1
Mh
M1

l =0

j
I (X
i
B
j ,l
)I (x B
j ,l
)

.
SPM
Histogram 2-36
Example
X
1
, . . . , X
10
= 11, 12, 9, 8, 10, 14, 18, 5, 7, 3.
we take M = 5 and h = 5:

B
4,0
B
3,1
B
3,2
B
3,3
B
3,4
1 2 3 4 5 4 3 2 1
B

19
B

11
3 5 7 9 11 13 15 17 19 21
b b b b b b b b b b
Figure 27: Bins for ASH
SPM
Histogram 2-37
Example: Stock returns
Average Shifted Histogram
-20 -15 -10 -5 0 5 10 15 20
Stock Returns*E-2
0
5
1
0
1
5
A
S
H
Figure 28: Average shifted histogram for stock returns data; average of 8
histograms (M = 8) with binwidth h = 0.04 and origin x
0
= 0.
SPMashstock
SPM
Histogram 2-38
for comparison:
Ordinary Histogram
-20 -15 -10 -5 0 5 10 15 20
Stock Returns*E-2
0
5
1
0
1
5
H
i
s
t
o
g
r
a
m
Figure 29: Ordinary histogram for stock returns; binwidth h = 0.005; origin
x
0
= 0
SPMhiststock
SPM
Histogram 2-39
Summary: Histogram
The bias of a histogram is
E[

f
h
(x) f (x)] f

(m
j
)(m
j
x).
The variance of a histogram is Var [

f
h
(x)]
1
nh
f (x).
The asymptotic MISE is given by AMISE(

f
h
) =
1
nh
+
h
2
12
|f

|
2
2
.
The optimal binwidth h
opt
that minimizes AMISE is
h
0
=
_
6
n|f

|
2
2
_
1/3
n
1/3
.
The optimal binwidth h
opt
that minimizes AMISE for
N(,
2
) is
h
opt
= 3.5n
1/3
.
SPM
Kernel Density Estimation 3-1
Chapter 3:
Kernel Density Estimation
SPM
Kernel Density Estimation 3-2
Kernel Density Estimation
kernel density estimate (KDE)

f
h
(x) =
1
nh
n

i =1
K
_
x X
i
h
_
with kernel function K(u)
Example
K(u) =
1
2
I ([u[ 1)

f
h
(x) =
1
nh
n

i =1
1
2
I
_

x X
i
h

1
_
SPM
Kernel Density Estimation 3-3
KDE as a generalization of the histogram
histogram: (nh)
1
# X
i
in interval containing x
KDE with uniform kernel: (n2h)
1
# X
i
in interval around x

f
h
(x) =
1
2hn
n

i =1
I
_

x X
i
h

1
_
=
1
2hn
#X
i
in interval around x
SPM
Kernel Density Estimation 3-4
Required properties of kernels
K() is a density function:
_

K(u)du = 1 and K(u) 0


K() is symmetric:
_

uK(u)du = 0
SPM
Kernel Density Estimation 3-5
Dierent Kernel Functions
Kernel K(u)
Uniform
1
2
I ([u[ 1)
Triangle (1 [u[)I ([u[ 1)
Epanechnikov
3
4
(1 u
2
)I ([u[ 1)
Quartic
15
16
(1 u
2
)
2
I ([u[ 1)
Triweight
35
32
(1 u
2
)
3
I ([u[ 1)
Gaussian
1

2
exp(
1
2
u
2
)
Cosinus

4
cos(

2
u)I ([u[ 1)
Table 2: Kernel functions
SPM
Kernel Density Estimation 3-6
Uniform
-1.5 -1 -0.5 0 0.5 1 1.5
u
0
0
.5
1
K
(u
)
Triangle
-1.5 -1 -0.5 0 0.5 1 1.5
u
0
0
.5
1
K
(u
)
Epanechnikov
-1.5 -1 -0.5 0 0.5 1 1.5
u
0
0
.5
1
K
(u
)
Quartic
-1.5 -1 -0.5 0 0.5 1 1.5
u
0
0
.5
1
K
(u
)
Figure 30: Some kernel functions: Uniform (top left), Triangle (bottom
left), Epanechnikov (top right), Quartic (bottom right)
SPMkernel
SPM
Kernel Density Estimation 3-7
Example: Construction of the KDE
consider the KDE using a Gaussian kernel

f
h
(x) =
1
nh
n

i =1
K
_
x X
i
h
_
=
1
nh
n

i =1
1

2
exp(
1
2
u
2
)
here we have
u =
x X
i
h
SPM
Kernel Density Estimation 3-8
Construction of Kernel Density
-2 -1 0 1
Data x
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
D
e
n
s
i
t
y

e
s
t
i
m
a
t
e

f
h
,

W
e
i
g
h
t
s

K
h
Figure 31: Kernel density estimate as a sum of bumps
SPMkdeconstruct
SPM
Kernel Density Estimation 3-9
h=0.004
-0.15 -0.1 -0.05 0 0.05 0.1 0.15
x
0
5
1
0
1
5
fh
h=0.015
-0.15 -0.1 -0.05 0 0.05 0.1 0.15
x
0
5
1
0
1
5
fh
h=0.008
-0.15 -0.1 -0.05 0 0.05 0.1 0.15
x
0
5
1
0
1
5
fh
h=0.050
-0.15 -0.1 -0.05 0 0.05 0.1 0.15
x
0
5
1
0
1
5
fh
Figure 32: Four kernel density estimates for the stock returns data with
dierent bandwidths and Quartic kernel
SPMdensity
SPM
Kernel Density Estimation 3-10
Quartic Kernel, h=0.018
-0.15 -0.1 -0.05 0 0.05 0.1 0.15
x
0
5
1
0
f
h
Uniform Kernel, h=0.018
-0.15 -0.1 -0.05 0 0.05 0.1 0.15
x
0
5
1
0
f
h
Figure 33: Two kernel density estimates for the stock returns data with
dierent kernels and h = 0.018
SPMdenquauni
SPM
Kernel Density Estimation 3-11
Epanechnikov Kernel, h=0.18
0 0.5 1 1.5 2 2.5 3
x
0
0
.
2
0
.
4
0
.
6
0
.
8
f
h
Triweight Kernel, h=0.18
0 0.5 1 1.5 2 2.5 3
x
0
0
.
2
0
.
4
0
.
6
0
.
8
f
h
Figure 34: Dierent continuous kernels, net-income data from the U.K.
Family Expenditure Survey
SPMdenepatri
SPM
Kernel Density Estimation 3-12
(Asymptotic) Statistical Properties
bias of the kernel density estimator
Bias
_

f
h
(x)
_
= E
_

f
h
(x)
_
f (x)
= E
_
1
nh
n

i =1
K
_
x X
i
h
_
_
f (x)

h
2
2
f

(x)
2
(K) for h 0
where we dene

2
(K)
def
=
_

u
2
K(u)du
SPM
Kernel Density Estimation 3-13
Derivation of the Bias Approximation
E
_

f
h
(x)
_
=
1
nh
n

i =1
E
_
K
_
x X
i
h
__
=
1
h
E
_
K
_
x X
h
__
=
_

1
h
K
_
x z
h
_
f (z)dz
=
_

K(s)f (x + sh)ds
=
(Taylor)
_

K(s)
_
f (x) + shf

(x) +
s
2
h
2
2
f

(x) +O(h
2
)
_
ds
SPM
Kernel Density Estimation 3-14
we obtain
E
_

f
h
(x)
_
=
_

K(s)
_
f (x) + shf

(x) +
s
2
h
2
2
f

(x) +O(h
2
)
_
ds
= f (x) +
h
2
2
f

(x)
_
s
2
K(s)ds +O(h
2
)
f (x) +
h
2
2
f

(x)
2
(K) for h 0
and thus
Bias
_

f
h
(x)
_
= E
_

f
h
(x)
_
f (x)

h
2
2
f

(x)
2
(K)
SPM
Kernel Density Estimation 3-15
Kernel Density Bias Effects
-6 -4 -2 0 2
x
0
5
1
0
1
5
2
0
D
e
n
s
i
t
y

f
,

B
i
a
s
*
E
-
2
Figure 35: f (x) and approximation for E
_

f
h
(x)
_
SPMkdebias
SPM
Kernel Density Estimation 3-16
Variance of the Kernel Density Estimator
Var
_

f
h
(x)
_
= Var
_
1
nh
n

i =1
K
_
x X
i
h
_
_

1
nh
|K|
2
2
f (x) for nh
where we dene
|K|
2
2
def
=
_

K(u)
2
du
SPM
Kernel Density Estimation 3-17
Derivation of the Variance Approximation
Var
_

f
h
(x)
_
=
1
n
2
Var
_
1
h
n

i =1
K
_
x X
i
h
_
_
=
1
n
Var
_
1
h
K
_
x X
h
__
=
1
n
E
_
1
h
2
K
_
x X
h
_
2
_

1
n
_
E
_
1
h
K
_
x X
h
___
2
=
1
nh
_

K(s)
2
f (x + sh)ds
1
n
_
E
_

f
h
(x)
__
2
using the Taylor expansion of f (x + sh) and
_

K(s)
2
sds = 0
(from the symmetry of the kernel)
SPM
Kernel Density Estimation 3-18
we obtain
Var
_

f
h
(x)
_
=
1
nh
_

K(s)
2
f (x + sh)ds
1
n
E
_

f
h
(x)
_
2
=
1
nh
|K|
2
2
f (x) +O(h)
1
n
f (x) +O(h)
2
=
1
nh
|K|
2
2
f (x) +O
_
1
nh
_

1
nh
|K|
2
2
f (x) for nh
SPM
Kernel Density Estimation 3-19
MSE of the kernel density estimator
MSE
_

f
h
(x)
_
= E
_

f
h
(x) f (x)
2
_

h
4
4
f

(x)
2

2
(K)
2
+
1
nh
|K|
2
2
f (x)
for h 0 and nh
recall the denitions

2
(K)
def
=
_

u
2
K(u)du
|K|
2
2
def
=
_

K(u)
2
du
SPM
Kernel Density Estimation 3-20
Bias^2, Variance and MSE
2 4 6 8 10
Bandwidth h*E-2
0
5
1
0
1
5
B
i
a
s
^
2
,

V
a
r
i
a
n
c
e

a
n
d

M
S
E
*
E
-
4
Figure 36: Bias
2
(thin solid), variance (thin dashed) and MSE (thick blue
solid) for kernel density estimate as a function of the bandwidth
SPMkdemse
SPM
Kernel Density Estimation 3-21
How to choose the bandwidth for the KDE?
nd the bandwidth which minimizes the MISE
MISE
_

f
h
(x)
_
= E
__

f
h
(x) f (x)
2
dx
_
=
_

MSE[

f
h
(x)]dx
=
1
nh
|K|
2
2
+
h
4
4

2
(K)
2
|f

|
2
2
+O
_
1
nh
_
+O
_
h
4
_
SPM
Kernel Density Estimation 3-22
we obtain for h 0 and nh
AMISE
_

f
h
_
=
1
nh
|K|
2
2
+
h
4
4

2
(K)
2
|f

|
2
2
and thus
h
opt
=
_
|K|
2
2
|f

|
2
2

2
(K)
2
n
_
1/5
n
1/5
SPM
Kernel Density Estimation 3-23
Comparing MISE for histogram and KDE
histogram: MISE(h
opt
) n
2/3
kernel density estimator: MISE(h
opt
) n
4/5
the error is smaller (4/5 < 2/3) for the kernel density
estimator
the kernel density estimator has better asymptotic properties
SPM
Kernel Density Estimation 3-24
Methods for Bandwidth Selection
we are trying to estimate the optimal bandwidth h
opt
, that
depends on
|f

|
2
2
(which is unknown)
we mainly distinguish
plug-in methods
jackknife methods
these methods can be compared via their asymptotic properties
(the speed of their convergence to the true value of h
opt
)
SPM
Kernel Density Estimation 3-25
Plug-in Methods
Silvermans rule of thumb
assuming a normal density function
f (x) =
1

_
x

_
and using a Gaussian kernel K(u) = (u) gives
|f

|
2
2
=
5
_

(z)
2
dz
=
5
3
8

0.212
5
SPM
Kernel Density Estimation 3-26
thus
|

|
2
2
= 0.212
5
Silvermans rule-of-thumb bandwidth estimator is

h
ROT
=
_
||
2
2
|

|
2
2

2
2
() n
_
1/5
= 1.06 n
1/5
(3.1)
SPM
Kernel Density Estimation 3-27
Better rule of thumb
Practical problem: A single outlier may cause a too large estimate
of .
A more robust estimator is derived by calculating the interquartile
range R = X
[0.75n]
X
[0.25n]
, whereas we still assume
X N(,
2
) and Z =
X

N(0, 1).
Including these assumptions in R and inserting the quartiles for the
standard normal distribution results in:
=
R
1.34
(3.2)
SPM
Kernel Density Estimation 3-28
Plugging the relation (3.2) into (3.1) leads to

h
0
= 1.06

R
1.34
n
1/5
= 0.79

Rn
1/5
and combing this with Silvermans rule of thumb results in the
better rule of thumb:

h
ROT
= 1.06 min
_
,
R
1.34
_
n
1/5
.
SPM
Kernel Density Estimation 3-29
Park and Marrons plug-in estimator
estimate

h
(x) =
1
n

h
3
n

i =1
K

_
x X
i

h
_
with bandwidth

h (calculated using the rule of thumb)
using a bias correction

|f

|
2
2
= |

|
2
2

1
n

h
5
|K

|
2
2
gives

h
PM
=
_
|K|
2
2

|f

|
2
2

2
(K)
2
n
_
1/5
SPM
Kernel Density Estimation 3-30
Park and Marron showed that

h
PM
h
opt
h
opt
= O
p
_
n
4/13
_
The performance of

h
PM
in simulations is quite good. In practice
though, for small bandwidths,

[[f

[[
2
may be negative.
SPM
Kernel Density Estimation 3-31
Cross Validation
instead of estimating |f

|
2
2
, least squares cross-validation
minimizes the integrated squared error
ISE
_

f
h
_
=
_

f
h
(x) f (x)
_
2
dx
= |

f
h
|
2
2
2
_

f
h
(x)f (x) dx +|f |
2
2
note: only the rst and second terms depend on h
SPM
Kernel Density Estimation 3-32
the rst term can be rewritten:
|

f
h
|
2
2
=
_

_
1
nh
n

i
K
_
x X
i
h
_
_
2
dx
=
1
n
2
h
2
_

i =1
n

j =1
K
_
x X
i
h
_
K
_
x X
j
h
_
dx
=
1
n
2
h
n

i =1
n

j =1
_

K(s)K
_
X
i
X
j
h
+ s
_
ds
=
1
n
2
h
n

i =1
n

j =1
_

K(s)K
_
X
j
X
i
h
s
_
ds
=
1
n
2
h
n

i =1
n

j =1
K K
_
X
j
X
i
h
_
SPM
Kernel Density Estimation 3-33
Example
X, Y N(0, 1) with f
X
= and f
Y
=
Z = X + Y
L(Z) = N(0, 2)
f
Z
(z) =
1

22
exp
_

1
2
_
z
2
2
__
=
1

2
(z/

2)
=
SPM
Kernel Density Estimation 3-34
the second term can be estimated:
_

f
h
(x)f (x)dx = E
_

f
h
(X)
_

1
n
n

i =1

f
h,i
(X
i
)
=
1
n
n

i =1
1
h(n 1)

i =j
K
_
X
i
X
j
h
_
=
1
n(n 1)h
n

i =1

i =j
K
_
X
i
X
j
h
_
SPM
Kernel Density Estimation 3-35
cross-validation criterion
CV(h) = |

f
h
|
2
2
2
_

f
h
(x)f (x)dx
=
1
n
2
h
n

i =1
n

j =1
K K
_
X
j
X
i
h
_

2
n(n 1)h
n

i =1
n

j =1,i =j
K
_
X
i
X
j
h
_
we choose the bandwidth

h
CV
= arg min
h
CV(h)
SPM
Kernel Density Estimation 3-36
Example: Gaussian kernel (K = )
CV(h) =
1
n
2
h
n

i =1
n

j =1

_
X
j
X
i
h
_

2
n(n 1)h
n

i =1
n

j =1,i =j

_
X
i
X
j
h
_
Applying the Gaussian kernel and =
1

2
(u/

2) lead to
CV(h) =
1
2n
2
h

i =1
n

j =1
exp
_

1
2
_
X
j
X
i
h

2
_
2
_

2
n(n 1)h
1

2
n

i =1
n

j =1,i =j
exp
_

1
2
_
X
i
X
j
h
_
2
_
.
SPM
Kernel Density Estimation 3-37
The CV bandwidth fulls:
ISE
_

h
CV
_
min
h
ISE (h)
a.s.
1
This is called asymptotic optimality. The result goes back to Stone
(1984).
SPM
Kernel Density Estimation 3-38
Fan and Marron (1992):

n
_

h h
opt
h
opt
_
L
N
_
0,
2
f
_

2
f
=
4
25
_
_
f
(4)
(x)
2
f (x)dx
[[f

[[
2
2
1
_
The relative speed is n
1/2
!
2
f
is the smallest possible variance.
SPM
Kernel Density Estimation 3-39
An optimal bandwidth selector?
one best method does not exist!
even asymptotically optimal criteria may show bad behavior in
small sample simulations
recommendation:
use dierent selection methods and compare the resulting
density estimates
SPM
Kernel Density Estimation 3-40
How to choose the kernel?
or: how can we compare density estimators based on dierent
kernels and bandwidths?
compare the AMISE for dierent kernels where the bandwidths are
h
(l )
opt
=
_
|K
(l )
|
2
2

2
(K
(l )
)
2
_
1/5 _
1
|f

|
2
2
n
_
1/5
=
(l )
c
and (l ) Uniform, Gaussian, . . .
SPM
Kernel Density Estimation 3-41
we have that
AMISE
_

f
h
(l )
opt
_
=
1
nh
(l )
opt
|K
(l )
|
2
2
+
(h
(l )
opt
)
4
4
|f

|
2
2

2
2
(K
(l )
) ,
which is equivalent to
AMISE
_

f
h
(l )
opt
_
=
_
1
nc
+
c
4
4
|f

|
2
2
_
T(K
(l )
) (3.3)
if we set
T(K
(l )
)
def
=
_
(|K
(l )
|
2
2
)
4

2
(K
(l )
)
2
_
1/5
SPM
Kernel Density Estimation 3-42
derivation of equation (3.3):
AMISE
_

f
h
(l )
opt
_
=
1
nh
(l )
opt
|K
(l )
|
2
2
+
(h
(l )
opt
)
4
4
|f

|
2
2

2
2
(K
(l )
)
=
1
n
(l )
c
|K
(l )
|
2
2
+
(
(l )
c)
4
4
|f

|
2
2

2
2
(K
(l )
)
=
1
nc
|K
(l )
|
2
2
_

2
(K
(l )
)
2
|K
(l )
|
2
2
_
1/5
+
c
4
|f

|
2
2
4
_
|K
(l )
|
2
2

2
(K
(l )
)
2
_
4/5

2
(K
(l )
)
2
=
_
1
nc
+
c
4
4
|f

|
2
2
_
_
(|K
(l )
|
2
2
)
4

2
(K
(l )
)
2
_
1/5
SPM
Kernel Density Estimation 3-43
AMISE with h
i
opt
consists of a kernel-independent term and a
factor of proportionality which depends on K
i
AMISE
_

f
h
i
opt
_
=
_
1
nc
+
c
4
4
|f

|
2
2
_
T(K
i
)
=
5
4
(|f

|
2
2
)
1/5
T(K
i
)n
4/5
where
c
def
=
_
1
|f

|
2
2
n
_
1/5
SPM
Kernel Density Estimation 3-44
using the fact that the AMISE is the product of a
kernel-independent term and the T(K
i
), we obtain
AMISE
_

f
h
i
opt
_
AMISE
_

f
h
j
opt
_ =
T(K
i
)
T(K
j
)
thus, to obtain a minimal AMISE, kernel i should be preferred to
kernel j if
T(K
i
) T(K
j
)
SPM
Kernel Density Estimation 3-45
Eciency of Kernels
Kernel T(K)
T(K)
T(K
Epa
)
Uniform 0.3701 1.0602
Triangle 0.3531 1.0114
Epanechnikov 0.3491 1.0000
Quartic 0.3507 1.0049
Triweight 0.3699 1.0595
Gaussian 0.3633 1.0408
Cosinus 0.3494 1.0004
Table 3: Eciency of kernels
SPM
Kernel Density Estimation 3-46
Results for choosing the kernel
since using the optimal bandwidths for kernels i and j , we have
AMISE
_

f
h
i
opt
_
AMISE
_

f
h
j
opt
_ 1
we can conclude:
the choice of the kernel has only a small inuence on AMISE
SPM
Kernel Density Estimation 3-47
Canonical Kernels
recall: we have derived for
h
i
opt
=
i
c
that AMISE is the product of a kernel-independent term and a
kernel-dependent scaling factor
this result holds also for non-optimal bandwidths h ,= c if they
fulll
h
i
=
i
h .
in that case:
AMISE

f
h
i =
_
1
nh
+
(h)
4
4
|f

|
2
2
_
T(K
i
)
SPM
Kernel Density Estimation 3-48
thus, if h
i
=
i
h
AMISE
_

f
h
i
_
AMISE
_

f
h
j
_ =
T(K
i
)
T(K
j
)
the factors of proportionality
i
which guarantee an equal amount
of smoothing are the canonical bandwidths and it holds that
h
i
h
j
=

i

j
h
i
= h
j

i

j
SPM
Kernel Density Estimation 3-49
Canonical Bandwidths for Dierent Kernels
Kernel
i
Uniform
_
9
2
_
1/5
1.3510
Epanechnikov 15
1/5
1.7188
Quartic 35
1/5
2.0362
Triweight
_
9450
143
_
1/5
2.3122
Gaussian
_
1
4
_
1/10
0.7764
Table 4: Canonical bandwidths
i
for dierent kernels
SPM
Kernel Density Estimation 3-50
The averaged squared error
0.1 0.2 0.3 0.4
h
0
.0
4
0
.0
6
0
.0
8
0
.1
0
.1
2
0
.1
4
0
.1
6
0
.1
8
1
2
Figure 37: Averaged squarred error for Gaussian (label 1) and quartic (label
2) kernel smoothers. The ASE is smallest for Gaussian kernel with h
1
=
0.08, for the quartic kernel with h
2
= 0.21.
SPMsimase
SPM
Kernel Density Estimation 3-51
Adjusting the Bandwidth across Kernels
we want
AMISE(h
A
, K
A
) AMISE(h
B
, K
B
)
which holds if
h
B
= h
A

A
.
Example
kernel K
A
A = Uniform
kernel K
B
B = Gauss

A
= 1.35, h
A
= 1.2,
B
= 0.77 = h
B
= 0.68
SPM
Kernel Density Estimation 3-52
Condence Intervals
let h = h
n
= cn
1/5
, then the KDE has the following asymptotic
normal distribution:
n
2/5
_

f
h
n
(x) f (x)
_
L
N

c
2
2
f

(x)
2
(K)
. .
b
x
,
1
c
f (x)|K|
2
2
. .
v
2
x

,
consequently:
P
_
b
x
z
1

2
v
x
n
2/5

f
h
(x) f (x) b
x
+ z
1

2
v
x
_
1
SPM
Kernel Density Estimation 3-53
1 condence interval for f (x)
_

f
h
(x)
h
2
2
f

(x)
2
(K) z
1

2
_
f (x)|K|
2
2
nh
,

f
h
(x)
h
2
2
f

(x)
2
(K) + z
1

2
_
f (x)|K|
2
2
nh
_
SPM
Kernel Density Estimation 3-54
Condence Bands
lim
n
P
_

f
h
(x)
_

f
h
(x)|K|
2
2
nh
_
1/2 _
z
(2 log n)
1/2
+ d
n
_
1/2
f (x)

f
h
(x) +
_

f
h
(x)|K|
2
2
nh
_
1/2 _
z
(2 log n)
1/2
+ d
n
_
1/2
x
_
= exp2 exp(z) = 0.95
with
d
n
= (2 log n)
1/2
+ (2 log n)
1/2
log
_
1
2
|K

|
2
|K|
2
_
h = n

, (
1
5
,
1
2
)

e
e ! undersmoothing for bias reduction
SPM
Kernel Density Estimation 3-55
Example: Average hourly earnings
Lognormal and Kernel Density
10 20 30 40
Wages
0
5
1
0
D
e
n
s
ity
*
E
-
2
Figure 38: Average hourly earnings, parametric (log-normal, broken line)
and nonparametric density estimate (Quartic kernel, h = 5, solid line),
CPS 1985, n = 534
SPMcps85dist
SPM
Kernel Density Estimation 3-56
Kernel Density and Confidence Bands
10 20 30 40
Wages
0
5
1
0
D
e
n
s
i
t
y
*
E
-
2
Figure 39: Average hourly earnings, parametric (log-normal, broken line)
with nonparametric kernel density estimate (thick solid line) estimate and
condence bands (thin solid lines) for kernel estimate, CPS 1985, n = 534
SPMcps85dist
SPM
Kernel Density Estimation 3-57
Example: Dierence between condence intervals and bands
Confidence Bands and Intervals
10 20 30 40
Wages
0
5
1
0
D
e
n
s
ity
*
E
-
2
Figure 40: Average hourly earnings, 95% condence bands (solid lines)
and condence intervals (broken lines) of the kernel density estimate, CPS
1985, n = 534
SPMcps85dist
SPM
Kernel Density Estimation 3-58
Summary: Univariate Density
The asymptotic MISE is given by
AMISE(

f
h
) =
1
nh
|K|
2
2
+
h
4
4

2
(K)
2
|f

|
2
2
The optimal bandwidth h
opt
that minimizes AMISE is
h
opt
=
_
|K|
2
2
|f

|
2
2

2
(K)
2
n
_
1/5
n
1/5
hence MISE(h
opt
) n
4/5
The AMISE optimal bandwidth according to Silvermans rule of
thumb

h
opt
1.06 n
1/5
The bandwidth can be estimated by Cross-Validation (based on the
delete-one estimates) or by Plug-In Methods (based on some
estimate of the term |f

|
2
2
in the expression for AMISE).
SPM
Kernel Density Estimation 3-59
Multivariate Kernel Density Estimation
d-dimensional data and d-dimensional kernel
X = (X
1
, . . . , X
d
)

, / : R
d
R
+
multivariate kernel density estimator (simple)

f
h
(x) =
1
n
n

i =1
1
h
d
/
_
x X
i
h
_
each component is scaled equally.
multivariate kernel density estimator (more general)

f
h
(x) =
1
n
n

i =1
1
h
1
. . . h
d
/
_
x
1
X
1
h
1
, . . . ,
x
d
X
d
h
d
_
SPM
Kernel Density Estimation 3-60
multivariate kernel density estimator (most general)

f
H
(x) =
1
n
n

i =1
1
det(H)
/
_
H
1
(x X
i
)
_
where H is a (symmetric) bandwidth matrix.
Each component is scaled separately, correlation between
components can be handled.
SPM
Kernel Density Estimation 3-61
How to get a Multivariate Kernel?
u = (u
1
, . . . , u
d
)

product kernel
/(u) = K(u
1
) . . . K(u
d
)
radially symmetric or spherical kernel
/(u) =
K(|u|)
_
R
d
K(|u|)du
with |u|
def
=

u
SPM
Kernel Density Estimation 3-62
Example: Bandwidth matrix H =
_
1 0
0 1
_
Product Kernel
-1 -0.5 0 0.5 1
x1
-
1
-
0
.5
0
0
.5
1
x
2
Radial-symmetric Kernel
-1 -0.5 0 0.5 1
x1
-
1
-
0
.5
0
0
.5
1
x
2
Figure 41: Contours from bivariate product (left) and bivariate radially
symmetric (right) Epanechnikov kernel
SPMkernelcontours
SPM
Kernel Density Estimation 3-63
Example: Bandwidth matrix H =
_
1 0
0 0.5
_
Product Kernel
-1 -0.5 0 0.5 1
x1
-
1
-
0
.8
-
0
.6
-
0
.4
-
0
.2
0
0
.2
0
.4
0
.6
0
.8
1
x
2
Radial-symmetric Kernel
-1 -0.5 0 0.5 1
x1
-
1
-
0
.8
-
0
.6
-
0
.4
-
0
.2
0
0
.2
0
.4
0
.6
0
.8
1
x
2
Figure 42: Contours from bivariate product (left) and bivariate radially
symmetric (right) Epanechnikov kernel
SPMkernelcontours
SPM
Kernel Density Estimation 3-64
Example: Bandwidth matrix H =
_
1 0.5
0.5 1
_
1/2
Product Kernel
-1 -0.5 0 0.5 1
x1
-
1
-
0
.5
0
0
.5
1
x
2
Radial-symmetric Kernel
-1 -0.5 0 0.5 1
x1
-
1
-
0
.5
0
0
.5
1
x
2
Figure 43: Contours from bivariate product (left) and bivariate radially
symmetric (right) Epanechnikov kernel
SPMkernelcontours
SPM
Kernel Density Estimation 3-65
Kernel properties
/ is a density function
_
R
d
/(u)du = 1, /(u) 0
/ is symmetric
_
R
d
u/(u)du = 0
d
/ has a second moment (matrix)
_
R
d
uu

/(u)du =
2
(/)I
d
,
where I
d
denotes the d d identity matrix
/ has a kernel norm
|/|
2
2
=
_
/
2
(u)du
SPM
Kernel Density Estimation 3-66
Statistical Properties
E
_

f
H
(x)
_
f (x)
1
2

2
(/) trH

H
f
(x)H
Var

f
H
(x)
1
ndet(H)
|/|
2
2
f (x)
AMISE(H) =
1
4

2
2
(/)
_
R
d
_
trH

H
f
(x)H
_
2
dx +
1
ndet(H)
|/|
2
2
derivation using the multivariate Taylor expansion
f (x + t) = f (x) + t

f
(x) +
1
2
t

H
f
(x)t +O(t

t)
where
f
denotes the gradient and H
f
the Hessian of f
SPM
Kernel Density Estimation 3-67
E
_

f
H
(x)
_
= E
_
1
n
n

i =1
1
det(H)
/
_
H
1
(x X
i
)
_
_
=
1
det(H)
_
R
d
/
_
H
1
(u x)
_
f (u)du
=
1
det(H)
_
R
d
/(s)f (Hs + x)det(H)ds

(Taylor)
_
R
d
/(s)
_
f (x) + s

f
(x)
+
1
2
s

H
f
(x)Hs
_
ds
= f (x) +
1
2
_
R
d
tr
_
H

H
f
(x)Hss

_
/(s)ds
= f (x) +
1
2

2
(/) trH

H
f
(x)H
SPM
Kernel Density Estimation 3-68
Var
_

f
H
(x)
_
=
1
n
2
Var
_
n

i =1
1
det(H)
/
_
H
1
(X
i
x)
_
_
=
1
n
Var
_
1
det(H)
/
_
H
1
(X x)
_
_
=
1
n
E
_
1
det(H)
2
/
_
H
1
(X x)
_
2
_

1
n
E
_

f
H
(x)
_
2

1
n det(H)
2
_
R
d
/
_
H
1
(u x)
_
2
f (u)du
=
1
n det(H)
_
R
d
/(s)
2
f (x + Hs)ds

(Taylor)
1
n det(H)
_
R
d
/(z)
2
_
f (x) + s

f
(x)
_
ds

1
n det(H)
|/|
2
2
f (x)
SPM
Kernel Density Estimation 3-69
Univariate Expressions as a Special Case
for d = 1 we obtain H = h, / = K, H
f
(x) = f

(x) and
E
_

f
H
(x)
_
f (x)
1
2

2
(/) trH

H
f
(x)H

1
2

2
(K)h
2
f

(x)
Var
_

f
H
(x)
_

1
n det(H)
|/|
2
2
f (x)

1
nh
|K|
2
2
f (x)
SPM
Kernel Density Estimation 3-70
Bandwidth selection
plug-in bandwidths (rule-of-thumb bandwidths)
resampling methods (cross-validation)
Rule-of-thumb bandwidth
nd bandwidth matrix that minimizes
AMISE(H) =
1
4

2
2
(/)
_
R
d
_
trH

H
f
(x)H
_
2
dx +
1
n det(H)
|/|
2
2
SPM
Kernel Density Estimation 3-71
reformulation of the rst term of AMISE under the assumptions
/ = multivariate Gaussian, f N
d
(, ).
Hence
2
(/) = 1, |/|
2
2
= 2
d

d/2
_
R
d
[trH

H
f
(t)H]
2
dt
=
1
2
d+2

d/2
det()
1/2
_
2 tr(H

1
H)
2
+tr(H

1
H)
2
_
SPM
Kernel Density Estimation 3-72
simple case H = diag(h
1
, . . . , h
d
)

h
j ,opt
=
_
4
d + 2
_
1/(d+4)
n
1/(d+4)

j
note: for d = 1 we arrive at Silvermans rule of thumb

h = 1.06n
1/5

SPM
Kernel Density Estimation 3-73
Scotts rule of thumb
Since factor [0.924, 1.059] David Scott proposes:

h
j ,opt
= n
1/(d+4)

j
David Scott on BBI:
General matrix rule of thumb

H = n
1/(d+4)

1/2
SPM
Kernel Density Estimation 3-74
Cross-Validation
ISE(H) =
_

f
H
(x) f (x)
2
dx
=
_

f
2
H
(x) dx 2
_

f
H
(x)f (x) dx +
_
f
2
(x) dx

E
_

f
H
(X)
_
=
1
n
n

i =1

f
H,i
(X
i
),

f
H,i
(x) =
1
n 1
n

i =j ,j =1
/
H
(X
j
x)
CV(H) =
1
n
2
det(H)
n

i =1
n

j =1
/ /
_
H
1
(X
j
X
i
)
_

2
n(n 1)
n

i =1
n

j =1
j =i
/
H
(X
j
X
i
)
SPM
Kernel Density Estimation 3-75
Canonical bandwidths
Let j denote a kernel out of
multivariate Gaussian, multivariate Quartic, . . ., then the
canonical bandwidth of kernel j is

j
=
_
|/
j
|
2
2

2
(/
j
)
2
_
1/(d+4)
.
consequently, it holds
AMISE(H
j
, /
j
) AMISE(H
i
, /
i
)
where
H
i
=

i

j
H
j
SPM
Kernel Density Estimation 3-76
Example: Adjust from Gaussian to Quartic product kernel
d
G

Q
/
G
1 0.7764 2.0362 2.6226
2 0.6558 1.7100 2.6073
3 0.5814 1.5095 2.5964
4 0.5311 1.3747 2.5883
5 0.4951 1.2783 2.5820
Table 5: Bandwidth adjusting factors for Gaussian and multiplicative Quar-
tic kernel for dierent dimensions d
SPM
Kernel Density Estimation 3-77
Example: East-West German migration intention
A two-dimensional density estimation

f
h
(x) =

f
h
(x
1
, X 2) for two
explanatory variables on East-West German migration intention in
Spring 1991.
2D Density Estimate
30.5
42.3
54.1
1262.1
2153.0
3044.0
Figure 44: Two-dimensional density estimation for age and household in-
come from East German SOEP 1991
SPMdensity2D
SPM
Kernel Density Estimation 3-78
2D Density Estimate
20 30 40 50 60
Age
1
2
3
4
I
n
c
o
m
e
*
E
3
Figure 45: Two-dimensional density estimation for age and household in-
come from East German SOEP 1991
SPMcontour2D
distribution is considerably left skewed
younger people are more likely to move to the western part of
Germany.
SPM
Kernel Density Estimation 3-79
Example: Credit scoring
Credit scoring explained by duration of credit, income and age.
To visualize this three dimensional density estimation, one
variable is hold xed and the two others are displayed.
SPM
Kernel Density Estimation 3-80
Duration fixed at 38
4780.9
9288.6
13796.4
33.0
47.2
61.5
Income fixed at 9337
20.4
37.3
54.3
33.0
47.2
61.5
Age fixed at 47
20.4
37.3
54.3
4780.9
9288.6
13796.4
Figure 46: Three-dimensional density estimation, graphical representation
by two-dimensional intersections
SPMslices3D
In this example a lower income and a lower age will probably
lead to a higher credit scoring.
Results for the variable duration are mixed.
SPM
Kernel Density Estimation 3-81
Contours, 3D Density Estimate
11.9
19.9
27.8
1650.9
3051.9
4452.8
28.3
37.5
46.8
Figure 47: Three-dimensional density estimation for duration of credit,
household income and age, graphical representation by contour plot
SPMcontour3D
SPM
Kernel Density Estimation 3-82
Curse of Dimensionality
recall the AMISE for the multivariate KDE
AMISE(H) =
1
4

2
2
(/)
_
R
d
_
trH

H
f
(x)H
_
2
dx +
1
ndet(H)
|/|
2
2
consider the special case H = h I
d
h
opt
n

1
4+d
, AMISE(h
opt
) n

4
4+d
Hence the convergence rate decreases with dimension
SPM
Kernel Density Estimation 3-83
Summary: Multivariate Density
Straightforward generalization of the one-dimensional density
estimator.
Multivariate kernels based on the univariate ones (product or
radially-symmetric).
Bandwidth selection based on same principles as bandwidth
selection for univariate densities.
New problem is the graphical representation of the estimates.
We can use contour plots, 3D plots or 2D intersections.
We nd a curse of dimensionality: the rate of convergence of
the estimate towards the true density decreases with the
dimension.
SPM
Nonparametric Regression 4-1
Chapter 4:
Nonparametric Regression
SPM
Nonparametric Regression 4-2
Nonparametric Regression
(X
i
, Y
i
), i = 1, . . . , n; X R
d
, Y R
Engel curve: X = net-income, Y = expenditure
Y = m(X) +
CHARN model: time series of the form
Y
t
= m(Y
t1
) + (Y
t1
)
t
SPM
Nonparametric Regression 4-3
Univariate Kernel Regression
model
Y
i
= m(X
i
) +
i
, i = 1, . . . , n
m() smooth regression function,
i
i.i.d. error terms with E
i
= 0
we aim to estimate the conditional expectation of Y given X = x
m(x) = E(Y[X = x) =
_
y f (y[x) dy =
_
y
f (x, y)
f
X
(x)
dy
where f (x, y) denotes the joint density of (X, Y) and f
X
(x) the
marginal density of X
Example: Normal variables
(X, Y) N(, ) = m(x) is linear
SPM
Nonparametric Regression 4-4
Let X =
_
X
1
X
2
_
N
p
(, ), =
_

11

12

21

22
_
and
X
2.1
= X
2

21

1
11
X
1
.
The conditional distribution of X
2
given X
1
= x
1
is normal with mean

2
+
21

1
11
(x
1

1
) and covariance
22.1
, i.e.,
(X
2
[X
1
= x
1
) N
pr
(
2
+
21

1
11
(x
1

1
),
22.1
).
Proof
Since X
2
= X
2.1
+
21

1
11
X
1
, for a xed value of X
1
= x
1
, X
2
is
equivalent to X
2.1
plus a constant term:
(X
2
[X
1
= x
1
) = (X
2.1
+
21

1
11
x
1
),
which has the normal distribution N(
2.1
+
21

1
11
x
1
,
22.1
).
Note that the conditional mean of (X
2
[X
1
) is a linear function of X
1
and
that the conditional variance does not depend on the particular value of
X
1
.
SPM
Nonparametric Regression 4-5
Fixed vs. Random Design
xed design: we know f
X
, the density of X
i
random design: density f
X
has to be estimated
Nonparametric regression estimator
m(x) =
1
n
n

i =1
W
hi
(x)Y
i
with weights
W
NW
hi
(x) =
K
h
(x x
i
)

f
h
(x)
for random design
W
FD
hi
(x) =
K
h
(x x
i
)
f
X
(x)
for xed design
SPM
Nonparametric Regression 4-6
Nadaraya-Watson Estimator
idea: (X
i
, Y
i
) have joint a pdf, so we can estimate m() by a
multivariate kernel estimator

f
h,

h
(x, y) =
1
n
n

i =1
1
h
K
_
x X
i
h
_
1

h
K
_
y Y
i

h
_
and therefore
_
y

f
h,

h
(x, y)dy =
1
n

n
i =1
K
h
(x X
i
)Y
i
resulting estimator:
m
h
(x) =
n
1
n

i =1
K
h
(x X
i
)Y
i
n
1
n

j =1
K
h
(x X
j
)
=
r
h
(x)

f
h
(x)
SPM
Nonparametric Regression 4-7
Engel Curve
0 0.5 1 1.5 2 2.5 3
Net-income
0
0
.
5
1
1
.
5
F
o
o
d
Figure 48: Nadaraya-Watson kernel regression, h = 0.2, U.K. Family Ex-
penditure Survey 1973
SPMengelcurve1
SPM
Nonparametric Regression 4-8
Gasser & M uller Estimator
for xed design with x
i
[a, b], and s
i
=
_
x
(i )
+ x
(i +1)
_
/2, s
0
= a,
s
n+1
= b, Gasser and M uller (1984) suggest the weights
W
GM
hi
(x) = n
_
s
i
s
i 1
K
h
(x u)du,
motivated by (with mean value theorem) :
(s
i
s
i 1
) K
h
(x ) =
_
s
i
s
i 1
K
h
(x u) du
note that as for the Nadaraya-Watson estimator

i
W
GM
hi
(x) = 1
SPM
Nonparametric Regression 4-9
0 0.5 1 1.5 2 2.5 3
X
-
0
.5
0
0
.5
1
1
.5
2
Y
0 0.5 1 1.5 2 2.5 3
X
-
1
-
0
.5
0
0
.5
1
1
.5
2
Y
Figure 49: Gasser - M uller estimator (left) and its rst derivative for sim-
ulated data.
SPMgasmul
SPM
Nonparametric Regression 4-10
Statistical properties of the GM estimator
Theorem
For h 0, nh and regularity conditions we obtain
n
1
n

i =1
W
GM
hi
(x)Y
i
= m
h
(x)
P
m(x)
and
AMSE
_
m
GM
h
(x)
_
=
1
nh

2
|K|
2
2
+
h
4
4

2
2
(K)m

(x)
2
.

e
e ! here we have xed design, for random design m
NW
h
may
have non-existing MSE (random denominator!)
SPM
Nonparametric Regression 4-11
Statistical properties of the NW estimator
m
h
(x) m(x) =
_
r
h
(x)

f
h
(x)
m(x)
__

f
h
(x)
f
X
(x)
+
_
1

f
h
(x)
f
X
(x)
__
=
r
h
(x) m(x)

f
h
(x)
f
X
(x)
+{ m
h
(x) m(x)}
f
X
(x)

f
h
(x)
f
X
(x)
calculate now bias and variance in the same way as for the KDE:
AMSE{ m
h
(x)} =
1
nh

2
(x)
f
X
(x)
K
2
2
. .
variance part
+
h
4
4
_
m

(x) + 2
m

(x)f

X
(x)
f
X
(x)
_
2

2
2
(K)
. .
bias part
SPM
Nonparametric Regression 4-12
h=0.05
0 0.5 1 1.5 2 2.5 3
Net-income
0
0
.5
1
1
.5
F
o
o
d
h=0.1
0 0.5 1 1.5 2 2.5 3
Net-income
0
0
.5
1
1
.5
F
o
o
d
h=0.2
0 0.5 1 1.5 2 2.5 3
Net-income
0
0
.5
1
1
.5
F
o
o
d
h=0.5
0 0.5 1 1.5 2 2.5 3
Net-income
0
0
.5
1
1
.5
F
o
o
d
Figure 50: Four kernel regression estimates for the 1973 U.K. Family Ex-
penditure data with bandwidths h = 0.05, h = 0.1, h = 0.2, and h = 0.5
SPMregress
SPM
Nonparametric Regression 4-13
Asymptotic normal distribution
Under regularity conditions and h = cn
1/5
n
2/5
m
h
(x) m(x)
L
N

c
2

2
(K)
_
m

(x)
2
+
m

(x)f

X
(x)
f
X
(x)
_
. .
b
x
,

2
(x)|K|
2
2
cf
X
(x)
. .
v
2
x

.
bias is a function of m

and m

variance is a function of
2
and f
SPM
Nonparametric Regression 4-14
Pointwise Condence Intervals
_
m
h
(x) z
1

|K|
2

2
(x)
nh

f
h
(x)
, m
h
(x) + z
1

|K|
2

2
(x)
nh

f
h
(x)
_

2
(x) = n
1
n

i =1
W
hi
(x)Y
i
m
h
(x)
2
,

2
and f both inuence the precision of the condence
interval
correction for bias!?
analogous to KDE: condence bands
SPM
Nonparametric Regression 4-15
Confidence Intervals
0 0.5 1 1.5 2 2.5 3
Net-income
0
0
.
5
1
1
.
5
F
o
o
d
Figure 51: Nadaraya-Watson kernel regression and 95% condence inter-
vals, h = 0.2, U.K. Family Expenditure Survey 1973
SPMengelconf
SPM
Nonparametric Regression 4-16
Local Polynomial Estimation
Taylor expansion for suciently smooth functions
m(t) m(x)+m

(x)(tx)+m

(x)(tx)
2
1
2!
+ +m
(p)
(x)(tx)
p
1
p!
consider a weighted least squares problem
min

i =1
_
Y
i

0

1
(X
i
x)
2
(X
i
x)
2

. . .
p
(X
i
x)
p
_
2
K
h
(x X
i
)
The resulting estimate for provides estimates for m
()
(x),
= 0, 1, . . . , p
SPM
Nonparametric Regression 4-17
notations
X =

1 X
1
x (X
1
x)
2
. . . (X
1
x)
p
1 X
2
x (X
2
x)
2
. . . (X
2
x)
p
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 X
n
x (X
n
x)
2
. . . (X
n
x)
p

Y = (Y
1
, , Y
n
)

W = diag (K
h
( x X
i

n
i =1
)
local polynomial estimator

(x) =
_
X

WX
_
1
X

WY
local polynomial regression estimator
m
h,p
(x) =

0
(x)
SPM
Nonparametric Regression 4-18
(Asymptotic) Statistical Properties
under regularity conditions, h = cn

1
5
, nh
n
2/5
m
h,1
(x) m(x)
L
N
_
c
2

2
(K)
m

(x)
2
,

2
(x)|K|
2
2
cf
X
(x)
_
remarks:
asymptotically equivalent to higher order kernel
analog theorem can be stated for derivative estimation
SPM
Nonparametric Regression 4-19
Local Polynomial Regression
0 0.5 1 1.5 2 2.5 3
Net-income
0
0
.
5
1
1
.
5
F
o
o
d
Figure 52: Local polynomial regression, p = 1, h = 0.2, U.K. Family Ex-
penditure Survey 1973
SPMlocpolyreg
SPM
Nonparametric Regression 4-20
Derivative Estimation
0 0.5 1 1.5 2 2.5 3
Net-income
0
0
.
5
1
1
.
5
F
o
o
d
Figure 53: Local polynomial derivative estimation, p = 2, h by rule of
thumb, U.K. Family Expenditure Survey 1973
SPMderivest
SPM
Nonparametric Regression 4-21
Other Smoothers
k-Nearest Neighbor Estimator
m
k
(x) = n
1
n

i =1
W
ki
(x)Y
i
with dened weights
W
ki
(x) =
_
n/k if i J
x
0 otherwise
where J
x
= i : X
i
is one of the k nearest observations to x.
smoothing parameter k large: smooth estimate
smoothing parameter k small: rough estimate
SPM
Nonparametric Regression 4-22
k-NN Regression
0 0.5 1 1.5 2 2.5 3
Net-income
0
0
.
5
1
1
.
5
F
o
o
d
Figure 54: k-nearest-neighbor regression, k = 101, U.K. Family Expendi-
ture Survey 1973
SPMknnreg
SPM
Nonparametric Regression 4-23
Statistical Properties
for k , k/n 0 and n
E m
k
(x) m(x)

2
(K)
8f
X
(x)
2
_
m

(x) + 2
m

(x)f

X
(x)
f
X
(x)
_ _
k
n
_
2
Var m
k
(x) 2|K|
2
2

2
(x)
k
.
variance independent of f
X
!
bias has an additional term depending on f
2
X
k = 2nhf
X
(x) corresponds to kernel estimator with local
bandwidth!
SPM
Nonparametric Regression 4-24
Median Smoothing
m(x) = medY
i
: i J
x

where
J
x
= i : X
i
is one of the k nearest neighbors of x .
SPM
Nonparametric Regression 4-25
Median Smoothing Regression
0 0.5 1 1.5 2 2.5 3
Net-income
0
0
.
5
1
1
.
5
F
o
o
d
Figure 55: Median smoothing regression, k = 101, U.K. Family Expendi-
ture Survey 1973
SPMmesmooreg
SPM
Nonparametric Regression 4-26
Spline Smoothing
The residual sum of squares RSS =

n
i =1
Y
i
m(X
i
)
2
is taken
as a criterion for the goodness of t.
Spline smoothing avoids functions that are minimizing the RSS but
at the same time merely interpolating the data, by adding a
stabilizer that penalizes non-smoothness.
Possible stabilizer:
[[m

[[
2
2
=
_
_
m

(x)
_
2
dx
Resulting minimazation problem:
m

= arg min
m
n

i =1
Y
i
m(X
i
)
2
+ |m

|
2
2
,
where controls the weight given to the stabilizer.
SPM
Nonparametric Regression 4-27
In the class of all twice dierentiable functions the solution m

is a
set of cubic polynomials
p
i
(x) =
i
+
i
x +
i
x
2
+
i
x
3
, i = 1, . . . , n 1.
For the estimator to be twice continuously dierentiable at X
(i )
we
require
p
i
(X
(i )
) = p
i 1
(X
(i )
), p

i
(X
(i )
) = p

i 1
(X
(i )
), p

i
(X
(i )
) = p

i 1
(X
(i )
)
and an additional boundary condition
p

1
_
X
(1)
_
= p

n1
_
X
(n)
_
= 0.
SPM
Nonparametric Regression 4-28
Computational algorithm (introduced by Reinsch)
RSS =
n

i =1
_
Y
(i )
m(X
(i )
)
_
2
= (Y m)

(Y m), (4.1)
where Y = (Y
(1)
, . . . , Y
(n)
)

If m() were indeed a piecewise cubic polynomial, then the penalty


term could be expressed as a quadratic form in m:
_
_
m

(x)
_
2
dx = m

Km, (4.2)
where K can be decomposed to K = QR
1
Q

. Q and R denote
band matrices and functions of h
i
= X
(i +1)
X
(i )
.
From (4.1) and (4.2) we get
m

= (I + K)
1
Y,
where I denotes a n-dimensional identity matrix.
SPM
Nonparametric Regression 4-29
Example
We apply the algorithm for cubic spline estimate on the Engel
curve example.
Spline Regression
0 0.5 1 1.5 2 2.5 3
Net-income
0
0
.
5
1
1
.
5
F
o
o
d
Figure 56: Spline regression, = 0.005, U.K. Family Expenditure Survey
1973
SPMspline
SPM
Nonparametric Regression 4-30
Spline Kernel
Under certain conditions the spline smoother is asymptotically
equivalent to a kernel smoother with a spline kernel:
K
S
(u) =
1
2
exp
_

[u[

2
_
sin
_
[u[

2
+

4
_
,
with local bandwidth h(X
i
) =
1/4
n
1/4
f
X
(X
i
)
1/4
.
SPM
Nonparametric Regression 4-31
Orthogonal Series Regression
m() represented as a series of basis functions (e.g. a Fourier
series):
m(x) =

j =0

j
(x) ,
where
j

j =0
is a known basis of functions and
j

j =0
are the
unknown Fourier coecients.
The goal is to estimate the unknown coecients.
SPM
Nonparametric Regression 4-32
We indeed have innite number of coecients, cannot be
estimated from a nite number of observations (n).
Choose the number of terms N that will be included in the
representation.
Series estimation procedure:
1. select a basis of functions
2. select N, N < n
3. estimate the N unknown coecients by a suitable method
SPM
Nonparametric Regression 4-33
Legendre polynomials
As functions
j
we used the Legendre polynomials (orthogonalized
polynomials):
(m + 1)p
m+1
(x) = (2m + 1)xp
m
(x) mp
m1
(x)
On the interval [1, 1] we have:
p
0
(x) =
1

2
, p
1
(x) =

3x

2
, p
2
(x) =

2
2

5
(3x
2
1), . . .
SPM
Nonparametric Regression 4-34
Orthogonal Series Regression
0 0.5 1 1.5 2 2.5 3
Net-income
0
0
.
5
1
1
.
5
F
o
o
d
Figure 57: Orthogonal series regression with Legendre polynomials, N = 9,
U.K. Family Expenditure Survey 1973
SPMorthogon
SPM
Nonparametric Regression 4-35
Wavelet Regression
Wavelets Alternative to Orthogonal Series Regression
Basis of functions is orthonormal.
The wavelet basis
jk
on R is generated from a mother wavelet
(). And can be computed by

jk
(x) = 2
j /2
(2
j
x k),
where k is a location shift
and 2
j
a scale factor
SPM
Nonparametric Regression 4-36
An example of the mother wavelet the Haar wavelet
(x) =

1 x [0, 1/2]
1 x (1/2, 1]
0 otherwise.
SPM
Nonparametric Regression 4-37
Wavelet Regression
0 1 2 3 4 5 6
X
-
1
-
0
.
5
0
0
.
5
1
1
.
5
Y
Figure 58: Wavelet regression for simulated data, n = 256
SPMwavereg
SPM
Nonparametric Regression 4-38
Bandwidth Choice in Kernel Regression
Bias^2, Variance and MASE
0.05 0.1 0.15
Bandwidth h
0
0
.0
1
0
.0
1
0
.0
2
0
.0
2
0
.0
3
0
.0
3
B
i
a
s
^
2
, V
a
r
i
a
n
c
e
a
n
d
M
A
S
E
Figure 59: MASE (thick line), bias part (thin solid line) and variance part
(thin dashed line) for simulated data
SPMsimulmase
SPM
Nonparametric Regression 4-39
Example: Simulated data for previous gure
Simulated Data
0 0.5 1
x
-
1
-
0
.
5
0
0
.
5
1
y
,

m
,

m
h
Figure 60: Simulated data with curve m(x) = sin(2x
3
)
3
Y
i
= m(X
i
) +

i
, X
i
U[0, 1],
i
N(0, 0.1)
SPMsimulmase
SPM
Nonparametric Regression 4-40
Cross Validation
ASE = n
1
n

j =1
m
h
(X
j
) m(X
j
)
2
w(X
j
)
MASE = EASE[X
1
= x
1
, , X
n
= x
n

estimate MASE by resubstitution function (w is a weight function)


p(h) =
1
n
n

i =1
Y
i
m
h
(X
i
)
2
w(X
i
)
separate estimation and validation by using leave-one-out
estimators
CV(h) =
1
n
n

i =1
Y
i
m
h,i
(X
i
)
2
w(X
i
)
minimizing gives

h
CV
SPM
Nonparametric Regression 4-41
Nadaraya-Watson Estimate
0 0.5 1 1.5 2 2.5 3
Net-income
0
0
.
5
1
1
.
5
F
o
o
d
Figure 61: Nadaraya-Watson kernel regression, cross-validated bandwidth

h
CV
= 0.15, U.K. Family Expenditure Survey 1973
SPMnadwaest
SPM
Nonparametric Regression 4-42
Local Polynomial Estimate
0 0.5 1 1.5 2 2.5 3
Net-income
0
0
.
5
1
1
.
5
F
o
o
d
Figure 62: Local polynomial regression, p = 1, cross-validated bandwidth

h
CV
= 0.56, U.K. Family Expenditure Survey 1973
SPMlocpolyest
SPM
Nonparametric Regression 4-43
Penalizing Functions
The conditional expectation of p(h) is not equal to the conditional
expectation of ASE(h). The penalizing function approach corrects
the bias by penalizing too small h. The corrected version of p(h):
G(h) =
1
n
n

i =1
Y
i
m
h
(X
i
)
2

_
1
n
W
hi
(X
i
)
_
w(X
i
) . (4.3)
The penalizing functions that satisfy the rst Taylor expansion
(u) = 1 + 2u +O(u
2
), u 0:
1. Shibatas model selector:
s
(u) = 1 + 2u
2. Generalized cross-validation:
GCV
(u) = (1 u)
2
3. Akaikes Information Criterion:
AIC
(u) = exp(2u)
SPM
Nonparametric Regression 4-44
Penalizing Functions
1 2 3 4
Bandwidth h
5
1
0
1
5
2
0
X
i
(
1
/
h
)
Figure 63: Penalizing functions (h
1
) as a function of h (from left to right:
Shibatas model selector, Akaikes Information Criterion, Finite Prediction
Error, Generalized cross-validation , Rices T)
SPMpenalize
SPM
Nonparametric Regression 4-45
Multivariate Kernel Regression
E(Y[X) = E (Y[X
1
, . . . , X
d
) = m(X)
Multivariate Nadaraya-Watson estimator:
m
H
(x) =
n

i =1
/
H
(X
i
x) Y
i
n

i =1
/
H
(X
i
x)
Multivariate local linear regression:
min

0
,
1
n

i =1
_
Y
i

0
(X
i
x)

1
_
2
/
H
(X
i
x)
SPM
Nonparametric Regression 4-46
Notations:
X =

1 X
1
x (X
1
x)
2
. . . (X
1
x)
p
1 X
2
x (X
2
x)
2
. . . (X
2
x)
p
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 X
n
x (X
n
x)
2
. . . (X
n
x)
p

Y = (Y
1
, . . . , Y
n
)

W = diag (/
H
(X
i
x)
n
i=1
)
Multivariate local polynomial estimator:

(x) =
_

0
,

=
_
X

WX
_
1
X

WY
Multivariate local Linear estimator:
m
1,H
(x) =

0
(x)
SPM
Nonparametric Regression 4-47
True Function
0.2
0.5 0.8
0.2
0.5
0.8
-0.2
0.5
1.2
Nadaraya-Watson
0.2
0.5 0.8
0.2
0.5
0.8
-0.2
0.5
1.2
Local Linear
0.2
0.5 0.8
0.2
0.5
0.8
-0.2
0.5
1.2
Figure 64: Two-dimensional Nadaraya-Watson and Local Linear Estimate
SPMtruenadloc
SPM
Nonparametric Regression 4-48
Bias, Variance and Asymptotics
Theorem
The conditional asymptotic bias and variance of the multivariate
Nadaraya-Watson Kernel regression estimator are
Bias m
H
[X
1
, . . . , X
n

2
(/)

m
(x)

HH

f
(x)

f
x
+
1
2

2
(/) tr
_
H

H
m
(x)H
_
Var m
H
[X
1
, . . . , X
n

1
n det(H)
|/|
2
2

2
(t)
f
x
in the interior of the support of f
X
.
SPM
Nonparametric Regression 4-49
Theorem
The conditional asymptotic bias and variance of the multivariate
Local linear regression estimator are
Bias m
1,H
[X
1
, . . . , X
n

1
2
(/) tr
_
H

H
m
(x)H
_
Var m
1,H
[X
1
, . . . , X
n

1
n det(H)
|/|
2
2

2
(t)
f
x
in the interior of the support of f
X
.
SPM
Nonparametric Regression 4-50
Curse of dimensionality
As in multivariate kernel density estimation, we have a curse of
dimensionality, since the asymptotic MSE depends on d:
AMSE(n, h) =
1
nh
d
C
1
+ h
4
C
2
.
Consequently, the optimal bandwidth is h
opt
n
1/(4+d)
and
hence the rate of convergence for AMSE is n
4/(4+d)
.
SPM
Nonparametric Regression 4-51
Cross-Validation Bandwidth Choice
Motivated by
RSS(H) = n
1
n

i =1
Y
i
m
H
(X
i
)
2
,
Cross-validation function
CV(H) = n
1
n

i =1
Y
i
m
H,i
(X
i
)
2
with the leave-one-out- estimator
m
H,i
(X
i
) =

j =i
/
H
(X
i
Y
j
) Y
j

j =i
/
H
(X
i
Y
j
)
.
SPM
Nonparametric Regression 4-52
For the Nadaraya-Watson Estimator:
CV(H) = n
1
n

i =1
Y
i
m
H,i
(X
i
)
2
= n
1
n

i =1
Y
i
m
H
(X
i
)
2
_
Y
i
m
H,i
(X
i
)
Y
i
m
H
(X
i
)
_
2
and
Y
i
m
H
(X
i
)
Y
i
m
H,i
(X
i
)
= 1
/
H
(0)

j
/
H
(X
i
X
j
)
SPM
Nonparametric Regression 4-53
For the Local Linear Estimator:
Consider for xed points x the sums,
S
0
=
n

i =1
/
H
(X
i
x)
S
1
=
n

i =1
/
H
(X
i
x) (X
i
x)
S
2
=
n

i =1
/
H
(X
i
x) (X
i
x) (X
i
x)

then
Y
i
m
H
(X
i
)
Y
i
m
H,i
(X
i
)
= 1
/
H
(0)
S
0
S

1
S
1
2
S
1
SPM
Nonparametric Regression 4-54
Summary: Nonparametric Regression
The regression function m() which relates a independent
variable X and an dependent variable Y is the conditional
expectation m(x) = E(Y[X = x).
A natural kernel regression estimate for random design can be
obtained by replacing the unknown densities in E(Y[X = x)
by kernel density estimates. This yields the Nadaraya-Watson
estimator
m
h
(x) =

n
i =1
K
h
(x X
i
)Y
i

n
j =1
K
h
(x X
j
)
=
1
n

i =1
W
hi
(x)Y
i
.
For xed design we can use the Gasser-M uller estimator with
W
GM
hi
(x) = n
_
s
i
s
i 1
K
h
(x u)du.
SPM
Nonparametric Regression 4-55
The asymptotic MSE of the Nadaraya-Watson estimator is
AMSE m
h
(x) =
1
nh

2
(x)
f
X
(x)
|K|
2
2
+
h
4
4
_
m

(x) + 2
m

(x)f

X
(x)
f
X
(x)
_
2

2
2
(K),
the asymptotic MSE of the Gasser-M uller estimator is missing
the 2
m

(x)f

X
(x)
f
X
(x)
term.
The Nadaraya-Watson estimator is a local constant estimator.
Extending to local polynomials of degree p yields the
minimization problem
min

i =1
Y
i

0

1
(X
i
x) . . .
p
(X
i
x)
p

2
K
h
(xX
i
).
Odd order local polynomial estimators outperform even order
ts.
SPM
Nonparametric Regression 4-56
In particular, the asymptotic MSE for the local linear kernel
regression estimator is
AMSE m
1,h
(x) =
1
nh

2
(x)
f
X
(x)
|K|
2
2
+
h
4
4
_
m

(x)
_
2

2
2
(K).
The bias does not depend on f
X
, f

X
and m

which makes the


the local linear estimator more design adaptive and improves
the behavior in boundary regions.
Local polynomial estimation allows an easy estimation of
regression function derivatives.
Other smoothing methods are: k-NN estimation, splines,
median smoothing, orthogonal series regression and wavelets.
SPM
Semiparametric and Generalized Regression Models 5-1
Chapter 5:
Semiparametric and Generalized
Regression Models
SPM
Semiparametric and Generalized Regression Models 5-2
Semiparametric and Generalized Regression
Models
possibilities to avoid the curse of dimensionality
variable selection in nonparametric regression
nonparametric link function g, but one-dimensional
(parametric) index

single index model (SIM)


E(Y[X) = m(X) = g(X

)
generalization to semi- or nonparametric index

generalized partial linear model (GPLM)


E(Y[X, T) = G
_
X

+ m(T)
_

generalized additive model (GAM)


E(Y[X, T) = G
_
X

+ f
1
(T
1
) + . . . + f
d
(T
d
)
_
SPM
Semiparametric and Generalized Regression Models 5-3
components link known unknown
linear GLM SIM
partial nonparametric GPLM, GAM
Table 6: Parametric semiparametric
SPM
Semiparametric and Generalized Regression Models 5-4
Generalized Linear Models
model
E(Y[x) = G(x

)
= G()
components of a GLM
distribution of Y (exponential family)
link function G (note that other authors call G
1
the link
function)
SPM
Semiparametric and Generalized Regression Models 5-5
Exponential Family
f (y, , ) = exp
_
y b()
a()
+ c(y, )
_
where
f is the density of Y (continuous)
f is the probability function of Y (discrete)
note that
in GLM = ()
is the parameter of interest
is a nuisance parameter (as in regression)
SPM
Semiparametric and Generalized Regression Models 5-6
Example: Y N(,
2
)
f (y) =
1

2
exp
_
1
2
2
(y )
2
_
= exp
_
y

2


2
2
2

y
2
2
2
log(

2)
_
exponential family with
a() =
2
b() =

2
2
c(y, ) =
y
2
2
2
log(

2)
where = and =
SPM
Semiparametric and Generalized Regression Models 5-7
Example: Y is Bernoulli
f (y) = P(Y = y) = p
y
(1 p)
1y
=
_
p if y = 1
1 p if y = 0
transform into
P(Y = y) =
_
p
1 p
_
y
(1 p) = exp
_
y log
_
p
1 p
__
(1 p)
dene logit
= log
_
p
1 p
_
p =
e

1 + e

SPM
Semiparametric and Generalized Regression Models 5-8
thus, we arrive at
P(Y = y) = exp y + log(1 p)
the parameters of the exponential family are
a() = 1
b() = log(1 p) = log(1 + e

)
c(y, ) = 0.
this is a one-parameter exponential family as there is no nuisance
parameter
SPM
Semiparametric and Generalized Regression Models 5-9
Properties of the exponential family
f is a density function:
_
f (y, , )dy = 1
The partial derivative with respect to :
0 =

_
f (y, , )dy
=
_ _

log f (y, , )
_
f (y, , )dy
= E
_

(y, , )
_
,
where (y, , ) = log f (y, , ) denotes the log-likelihood.
SPM
Semiparametric and Generalized Regression Models 5-10
We know about the score function

(y, , ) that
E
_

2

2
(y, , )
_
= E
_

(y, , )
_
2
.
So we get following properties:
0 = E
_
Y b

()
a()
_
E
_
b

()
a()
_
= E
_
Y b

()
a()
_
2
SPM
Semiparametric and Generalized Regression Models 5-11
from this we conclude
E(Y) = = b

()
Var(Y) = V()a() = b

()a()
the expectation of Y only depends on whereas the variance
of Y depends on and
we assume that a() is constant (there are modications
using prior weights)
SPM
Semiparametric and Generalized Regression Models 5-12
Link Functions
= G(), = x

the canonical link is given when


x

= =
Example: Y Bernoulli
logit = 1/1 + exp() (canonical)
probit = () (not in the exponential family!)
complementary log-log link = loglog(1 ).
Example: Power link
=

(if ,= 0) and = log (if = 0)


SPM
Semiparametric and Generalized Regression Models 5-13
Notation Range of y b() () Canonical Variance a()
link () V()
Bernoulli B(1, ) {0, 1} log(1 + e

) e

/(1 + e

) logit (1 ) 1
Binomial B(k, ) {0, 1, . . . , k} k log(1 + e

) ke

/(1 + e

) logit
_
1

k
_
1
Poisson P() {0, 1, 2, . . .} exp() exp() log 1
Geometric GE() {0, 1, 2, . . .} log
_
1 e

_
e

/(1 e

) log
_

1+
_
+
2
1
Negative NB(, k) {0, 1, 2, . . .} k log
_
1 e

_
ke

/(1 e

) log
_

k+
_
+

2
k
1
Binomial
Normal N(,
2
) (, )
2
/2 identity 1
2
Exponential Exp() (0, ) log() 1/ reciprocal
2
1
Gamma G(, ) (0, ) log() 1/ reciprocal
2
1/
Inverse IG(,
2
) (0, ) (2)
1/2
(2)
1/2
squared
3

2
Gaussian reciprocal
Table 7: Characteristics of some GLM distributions
SPM
Semiparametric and Generalized Regression Models 5-14
Maximum-Likelihood Algorithm
observations Y = (Y
1
, . . . , Y
n
)

,
E(Y
i
[x
i
) =
i
= G(x

i
)
we maximize the log-likelihood of the sample, which is
(Y, , ) =
n

i =1
log f (Y
i
,
i
, )
where
i
= (
i
) = (x

i
, ) .
Alternatively, one may minimize the deviance function
D(Y, , ) = 2 (Y, Y, ) (, Y, )
SPM
Semiparametric and Generalized Regression Models 5-15
For Y
i
N(
i
,
2
) we have
(Y
i
,
i
, ) = log
_
1

2
_

1
2
2
(Y
i

i
)
2
.
This gives the sample log-likelihood
(Y, , ) = n log
_
1

2
_

1
2
2
n

i =1
(Y
i

i
)
2
.
SPM
Semiparametric and Generalized Regression Models 5-16
Log-Likelihood and Exponential Family
The deviance function:
D(Y, , ) = 2(Y,
max
, ) (Y, , )

max
is the non-restricted vector maximizing (Y, , )
The rst term is not dependent on the model (not on ), so the
minimization of the deviance corresponds to the maximization of
the log-likelihood.
SPM
Semiparametric and Generalized Regression Models 5-17
(Y, , ) =
n

i =1
_
Y
i

i
b(
i
)
a()
c(Y
i
, )
_
neither a() nor c(Y
i
, ) have an inuence on the maximization
w.r.t. , hence we maximize only

(Y, , ) =
n

i =1
Y
i

i
b(
i
)
gradient of

T() =

(Y, , ) =
n

i =1
_
Y
i
b

(
i
)
_

i
SPM
Semiparametric and Generalized Regression Models 5-18
we have to solve
T() = 0
Newton-Raphson step

new
=

old

_
H(

old
)
_
1
T(

old
)
Fisher scoring step

new
=

old

_
EH(

old
)
_
1
T(

old
)
SPM
Semiparametric and Generalized Regression Models 5-19
Gradient and Hessian after some calculation
D() =
n

i =1
{Y
i

i
}
G

(
i
)
V(
i
)
x
i
Newton-Raphson
H() =
n

i =1
_
b

(
i
)
_

i
__

i
_

{Y
i
b

(
i
)}

2

i
_
=
n

i =1
_

(
i
)
2
V(
i
)
+{Y
i

i
}
G

(
i
)V(
i
) G

(
i
)
2
V

(
i
)
V(
i
)
2
_
x
i
x

i
Fisher scoring
EH() =
n

i =1
_

(
i
)
2
V(
i
)
_
x
i
x

i
SPM
Semiparametric and Generalized Regression Models 5-20
Simpler presentation of algorithm (Fisher scoring)
W = diag
_
G

(
1
)
2
V(
1
)
, . . . ,
G

(
n
)
2
V(
n
)
_

Y =
_
Y
1

1
G

(
1
)
, . . . ,
Y
n

n
G

(
n
)
_

, X =

1
.
.
.
X

each iteration step for can be written as

new
=
old
+ (X

WX)
1
X

Y
= (X

WX)
1
X

WZ
with adjusted dependent variables
Z
i
= x

i

old
+ (Y
i

i
)G

(
i
)
1
.
SPM
Semiparametric and Generalized Regression Models 5-21
Remarks
canonical link: Newton-Raphson = Fisher scoring
initial values

For all but binomial models:

i ,0
= Y
i
and
i ,0
= G
1
(
i ,0
)

For binomial models:

i ,0
= (Y
i
+
1
2
)/(w
x
i
+ 1) and
i ,0
= G
1
(
i ,0
).
(w
x
i
denotes the binomial weights, i.e. w
x
i
= 1 in the Bernoulli
case.)
convergence is controlled by checking the relative change in
and/or deviance
SPM
Semiparametric and Generalized Regression Models 5-22
Asymptotics
Theorem
Denote = (
1
, . . . ,
n
)

= G(x

1

, . . . , x

n

)

, then

n(

)
L

n
N
p
(0, )
and it holds approximately D(Y, , )
2
np
,
2(Y, , ) (Y, , )
2
p
.
the asymptotic covariance of

can be estimated by

=
1
n

= EH(
last
) =
n

i =1
_
G

(
i ,last
)
2
V(
i ,last
)
_
x
i
x

i
with last denoting the values from the last iteration step
SPM
Semiparametric and Generalized Regression Models 5-23
Example: Migration
Yes No (in %)
Y migration intention 38.5 61.5
X
1
family/friends in west 85.6 11.2
X
2
unemployed/job loss certain 19.7 78.9
X
3
city size 10,000-100,000 29.3 64.2
X
4
female 51.1 49.8
Min Max Mean S.D.
X
5
age (years) 18 65 39.84 12.61
T household income (DM) 200 4000 2194.30 752.45
Table 8: Descriptive statistics for migration data, n = 3235
SPM
Semiparametric and Generalized Regression Models 5-24
Example: Migration in MecklenburgVorpommern
Yes No (in %)
Y migration intention 39.1 60.9
X
1
family/friends in west 88.8 11.2
X
2
unemployed/job loss certain 21.1 78.9
X
3
city size 10,000-100,000 35.8 64.2
X
4
female 50.2 49.8
Min Max Mean S.D.
X
5
age (years) 18 65 39.94 12.89
T household income (DM) 400 4000 2262.20 769.82
Table 9: Descriptive statistics for migration data in Mecklenburg
Vorpommern, n = 402
SPMmigmvdesc
SPM
Semiparametric and Generalized Regression Models 5-25
the dependent variable is the response
Y =
_
1 move to the West
0 stay in the East
for a binary dependent variable holds
E(Y[x) = 1 P(Y = 1[x) + 0 P(Y = 0[x)
= P(Y = 1[x)
using a logit link we have
P(Y = 1[x) = G(x

) =
exp(x

)
1 + exp(x

)
SPM
Semiparametric and Generalized Regression Models 5-26
Example: Results for the migration data
Coecients (t-value)
const. 0.512 (2.39)
family/friends in west 0.599 (5.20)
unemployed/job loss 0.221 (2.31)
city size 10-100,000 0.311 (3.77)
female -0.240 (3.15)
age -4.6910
2
(14.56)
household income 1.4210
4
(2.73)
Logit model
Table 10: Logit coecients for migration data (t-values in parenthesis)
SPM
Semiparametric and Generalized Regression Models 5-27
Logit Model
-3 -2 -1 0 1 2
Link Function, Responses
0
0
.
5
1
L
i
n
k

F
u
n
c
t
i
o
n
,

R
e
s
p
o
n
s
e
s
Figure 65: Logit model for migration, sample from Mecklenburg-
Vorpommern, n = 402
SPMlogit
SPM
Semiparametric and Generalized Regression Models 5-28
More on Index Models
Binary Responses
Y =
_
1 if Y

= X

+ 0,
0 otherwise,
latent variable
Y

= X

+
assume that independent of X
E(Y[X = x) = P( x

) = 1 P( x

)
= 1 G(x

)
= G(x

)
SPM
Semiparametric and Generalized Regression Models 5-29
Multicategorical Responses
utility Y

j
Y

j
= X

j
+ u
j
non-ordered case
Y = j Y

j
> Y

k
for all k ,= j , k 0, 1, . . . , J.
dierenced utilities
Y

j
Y

k
= X

(
j

k
) + (u
j
u
k
) = X

jk
+
jk
.
corresponding probability is
P(Y = j [X = x) = P(
j 0
< x

j 0
, . . . ,
jJ
< x

jJ
)
= G(x

j 0
, . . . , x

jJ
).
SPM
Semiparametric and Generalized Regression Models 5-30
Multinomial logit
P(Y = 0[X = x) =
1
1 +
J

k=1
exp(x

k
)
P(Y = j [X = x) =
exp(x

j
)
1 +
J

k=1
exp(x

k
)
Conditional logit
P(Y = j [X = x) =
exp(x

j
)
J

k=0
exp(x

k
)
SPM
Semiparametric and Generalized Regression Models 5-31
Ordered model
Y

= x

+ .
assume thresholds c
1
, . . . , c
J1
which divide the real axis into sub
intervals
Y =

0 if < Y

0,
1 if 0 < Y

c
1
2 if c
1
< Y

c
2
.
.
.
J if c
J1
< Y

<
SPM
Semiparametric and Generalized Regression Models 5-32
Tobit Models
Y

= X

+
observe Y only for positive utility
Y = Y

I(Y

> 0) =
_
0 if Y

0,
Y

if Y

> 0.
= model which depends on the index X

Y = (X

+ ) I(X

+ > 0).
SPM
Semiparametric and Generalized Regression Models 5-33
Sample Selection Models
(regression) Y

= X

+
(selection) U

= Z

+
Y = Y

I(U

> 0) =
_
0 if U

0,
Y

if U

> 0.
SPM
Semiparametric and Generalized Regression Models 5-34
Summary: Semiparametric
Regression and GLM
The basis for many semiparametric regression models is the
generalized linear model (GLM), which is given by
E(Y[X) = GX

.
Here, denotes the parameter vector to be estimated and G
denotes a known link function. Prominent examples of this
type of regression are binary choice models (logit or probit) or
count data models (Poisson regression).
The GLM can be generalized in several ways: Considering an
unknown smooth link function (instead of G) leads to the
single index model (SIM). Assuming a nonparametric additive
argument of G leads to the generalized additive model (GAM),
SPM
Semiparametric and Generalized Regression Models 5-35
whereas a combination of additive linear and nonparametric
components in the argument of G give a generalized partial
linear model (GPLM) or generalized partial linear partial
additive model (GAPLM). If there is no link function (or G is
the identity function) then we speak of additive models (AM)
or partial linear models (PLM) or additive partial linear
models (APLM).
The estimation of the GLM is performed through an
interactive algorithm. This algorithm, the iteratively
reweighted least squares (IRLS) algorithm, applies weighted
least squares to the adjusted dependent variable Z in each
iteration step:

new
= (X

WX)
1
X

WZ
This numerical approach needs to be appropriately modied
for estimating the semiparametric modications of the GLM.
SPM
Single Index Models 6-1
Chapter 6:
Single Index Models
SPM
Single Index Models 6-2
Single Index Models
aims to summarize all information of the regressors in one single
number X

parametric single index model (GLM)


E(Y[X) = m(X) = G(X

) (6.1)
semiparametric single index model (SIM)
E(Y[X) = m(X) = g(X

) (6.2)
= (
1
, . . . ,
p
)

is the unknown parameter vector


G() is a known link function
g() is an unknown smooth link function
SPM
Single Index Models 6-3
Misspecied parametric SIM
Example: Binary Response Models
Y =
_
1 if Y

= X

0,
0 otherwise,
M1: standard logistic, independent of X
E(Y[X) = G(X

) = 1/1 + exp(X

)
M2: N(0, 1), independent of X
E(Y[X) = G(X

) = (X

)
M3: heteroscedastic logit
E(Y[X) = G(X

) = 1/1 + exp(X

/h(x))
where h(x) = 0.25[1 + 2(X

)
2
+ 4(X

)
4
]
SPM
Single Index Models 6-4
Parametric link functions
0
index (xbeta)
0
0
.5
1
G
(
x

b
e
t
a
)
logit
probit
heteroskedastic
Figure 66: Link functions for M1, M2 and M3
SPMparametriclinkfunctions
SPM
Single Index Models 6-5
results: from Horowitz (1993):
Suppose
1
= 1 and
2
= 1
2
/
1
= 1

1
and

2
are probability limits of MLE
True Estimated Max error of

2
/

1
Model Model E(Y[X)
M1 M1 0 1
M2 M1 0.02 1
M3 M1 0.37 0.45
= misspecication error is big (small), if true link is really
dierent (not very dierent) from link used for estimation
SPM
Single Index Models 6-6
Estimation of Single Index Models
single index model (SIM)
E(Y[X) = m(X) = g(X

) (6.3)
where
= (
1
, . . . ,
p
)

nite dimensional parameter vector


g() smooth link function
SPM
Single Index Models 6-7
SIM Algorithm
estimate by

compute index values = X

estimate the link function g() by using a (univariate)


nonparametric method for the regression of Y on
SPM
Single Index Models 6-8
Identication Issues
consider
model (A) E(Y[X) = g(X

+ c)
model (B) E(Y[X) = g(X

)
where
g() = g ( + c)
= an intercept in the model cannot be identied
SPM
Single Index Models 6-9
consider
model (A) E(Y[X) = g(X

)
model (B) E(Y[X) = g(X

)
where
g() = g
_
.
c
_
,

= c
= are identied only upon scale factor
SPM
Single Index Models 6-10
Two Link Functions
-3 -2 -1 0 1 2 3 4
Index
0
0
.2
0
.4
0
.6
0
.8
1
E
(
Y
|X
)
=
G
(
I
n
d
e
x
)
a
n
d
E
(
Y
|X
)
=
G
(
-
I
n
d
e
x
)
Figure 67: Two link functions
SPM
Single Index Models 6-11
Consider the latent variable
Y

= v

(X) , = v

(X)
() an uknown function
standard logistic error term (independent of X)
After some calculation
P(Y = 1[X = x) = E(Y[X = x) = G

_
v

(x)
v

(X)
_
SPM
Single Index Models 6-12
The regression function E(Y[X) is unknown because of unknown
(), the functional form of the link G

is known.
The resulting not necessarily monotone link function is:
g() = G

_
.
To compare the eects of two particular explanatory variables, we
can only use the ratio of their coecients.
SPM
Single Index Models 6-13
Estimation
Pseudo-Maximum Likelihood Estimation
use nonparametric estimate of link function g() (or its
derivative) in the log-likelihood
(Weighted) Semiparametric Least Squares (SLS, WSLS)
Construct a objective function to estimate with

n-rate.
Inside the objective function are used the conditional
distribution of Y (or its nonparametric estimates). The
objective function will be minimized w.r.t. parameter .
Average Derivative Methods
direct estimation of the parametric coecients via the average
derivative of the regression function m()
SPM
Single Index Models 6-14
Related methods
nonlinear index functions
E(Y[X) = m(X) = g v

(X)
multiple indices: Projection Pursuit, Sliced Inverse regression
series expansion for g: Gallant and Nychka (1987) estimator
maximum score estimators: Horowitz (smoothed version),
Manski (1985)
SPM
Single Index Models 6-15
Average Derivatives (ADE)
assume
X = (T, U),
i.e. split the explanatory variables into a continuous part
T = (T
1
, . . . , T
q
)

and a discrete part U = (U


1
, . . . , U
p
)

single index model with only continuous variables


E(Y[T) = m(T) = g(T

)
vector of average derivatives
= E
m
(T) = E
_
g

(T

)
_

where

m
(t) =
_
m(t)
t
1
, . . . ,
m(t)
t
q
_

SPM
Single Index Models 6-16
Transfer Derivative of m on to Density f
score vector
s(t)
def
=
log f
(t) =

f
(t)
f (t)
consider
= E
m
(T) =
_

m
(t)f (t) dt
=
(Integr.by parts)

_

f
(t)
f (t)
f (t) m(t) dt =
_
s(t) m(t) f (t) dt
= Es(t) m(t) = Es(t) Y
estimation by sample analog

=
1
n
n

i =1
s
H
(T
i
) Y
i
, H = bandwidth matrix
SPM
Single Index Models 6-17
Score estimator with trimming
estimation of s
s
H
(t) =

f
H
(t)
1
_

f
H
(t)
t
1
, . . . ,

f
H
(t)
t
q
_

f
H
(t) is a (multivariate) density estimator and

t
j

f
H
(t) =
1
ndet(H)
n

j =1

t
j
K
H
(t T
j
)
trimming away small values of

f
H
() in the denominator

= n
1
n

i =1
s
H
(T
i
) Y
i
I

f
H
(T
i
) > b
n

. .
trimming factor
SPM
Single Index Models 6-18
Asymptotic properties

n(

)
L
N(0,

)
where

is the covariance matrix of


s(T)Y +
m
(T) s(T)m(T)
= parametric rate of convergence
SPM
Single Index Models 6-19
Final estimators

= n
1
n

i =1
s
H
(T
i
) Y
i
I
_

f
H
(T
i
) > b
n
_
m
h
(t) = g
h
(t

) =

n
i =1
K
h
(t

i
)Y
i

n
i =1
K
h
(t

i
)
g
h
() converges for h n
1/5
at rate n
2/5
, i.e. like a univariate
kernel regression estimator
SPM
Single Index Models 6-20
Weighted ADE
introduce weights instead of trimming
= E
m
(T) w(T) = E
_
g

(T

) w(T)
_

density weighted ADE (WADE) means w(t) = f (t)
=
_

m
(t) f
2
(t) dt = 2
_
m(t)
f
(t) f (t) dt
= 2 EY
f
(T)

=
2
n
n

i =1
Y
i
_

f
H
(t)
t
1
, . . . ,

f
H
(t)
t
q
_

SPM
Single Index Models 6-21
Example: Unemployment after completion of an apprenticeship in
West Germany
GLM WADE
(logit) h = 1 h = 1.25 h = 1.5 h = 1.75 h = 2
constant -5630
EARNINGS -1.00 -1.00 -1.00 -1.00 -1.00 -1.00
CITY SIZE -0.72 -0.23 -0.47 -0.66 -0.81 -0.91
DEGREE -1.79 -1.06 -1.52 -1.93 -2.25 -2.47
URATE 363.03 169.63 245.75 319.48 384.46 483.31
Table 11: WADE t of unemployment data
SPM
Single Index Models 6-22
Discrete Covariates
assume now continuous and discrete covariates
E(Y[T, U) = g(U

+ T

)
how to extend ADE/WADE in this case?
simplest case: binary U (i.e. 0, 1)
E(Y[T, U) = g(T

) if U = 0
E(Y[T, U) = g(T

+ ) if U = 1
SPM
Single Index Models 6-23

-3.0 0.0 3.0 6.0 9.0


Index Values t
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
L
i
n
k

F
u
n
c
t
i
o
n
s

F
(
t
)
,

F
(
t
+

)
Horizontal Distance
Figure 68: Coecient of the variable U
SPM
Single Index Models 6-24
Integral Estimator
split sample:
(0)
and
(1)
denote the observations coming from the
subsamples according to U
i
= 0 and U
i
= 1, then use the integral
dierence as an estimator for

J
(1)

J
(0)

-3.0 0.0 3.0 6.0 9.0


Index Values t
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
L
i
n
k
F
u
n
c
t
i
o
n
s
F
(
t
)
, F
(
t
+

)
Integral
Figure 69: The integral estimator
SPM
Single Index Models 6-25
How to estimate the integrals?

J
(0)
=
n
0

i =0
Y
(0)
i
(T
(0)
i +1
T
(0)
i
)

J
(1)
=
n
1

i =0
Y
(1)
i
(T
(1)
i +1
T
(1)
i
)

the estimator is

n-consistent and can be improved for eciency
by a one-step estimator
SPM
Single Index Models 6-26
Multicategorial U
thresholded link function
g = c
o
I (g < c
o
) + g I (c
o
g c
1
) + c
1
I (g > c
1
).
integrated link function conditional on u
(k)
J
(k)
=
_
v
1
v
o
g(v +

u
(k)
) dv,
k = 1, . . . , M categories of U
SPM
Single Index Models 6-27
Multiple integral comparisons
J
(k)
J
(0)
= (c
1
c
0
)
_
u
(k)
u
(0)
_
,
or simply
J = (c
1
c
0
) u
where
J =

J
(1)
J
(0)

J
(M)
J
(0)

, u =

u
(1)
u
(0)

u
(M)
u
(0)

SPM
Single Index Models 6-28
Estimator for
= (c
1
c
o
)
1
(u

u)
1
u

J
replace J
(k)
by

J
(k)
=
_
v
1
v
o

g
(k)
(v) dv


g
(k)
is obtained by a univariate regression of the estimated
continuous indices

T
(k)
i
on Y
(k)
i
using a

n-consistent estimate and Nadaraya-Watson
estimators

g
(k)
for g
(k)
, the estimated coecient

is itself

n-consistent and asymptotically normal


SPM
Single Index Models 6-29
Testing model (mis-)specication
aim: check if the parametric model is appropriate by comparing it
with a non- or semiparametric model
Hardle-Mammen test (parametric vs. nonparametric)
Y
i
= m(X
i
) +
i
, X
i
R
d
notation
m
h
() =
n

i =1
K
h
( X
i
)Y
i
n

k=1
K
h
( X
i
)
, /
h,n
g() =
n

i =1
K
h
( X
i
) g(X
i
)
n

i =1
K
h
( X
i
)
SPM
Single Index Models 6-30
Test
H
0
: m m

:
test statistic (with weight function w)
T
n
= nh
d/2
_
m
h
(x) /
h,n
m

(x)
2
w(x)dx
h n
1/(d+4)
asymptotically normal distributed
BUT with bias and variance containing unknown expressions
= use wild bootstrap
SPM
Single Index Models 6-31
Horowitz-Hardle test / Werwatz test (GLM vs. SIM)
E(Y[X) = Gv

(X), X R
d
, R
k
idea: compare E Y[v

(X) = v (nonparametrically estimated)


with G(v) (G known link)
H
0
: EY[v

(X) = v = G(v)
H
1
: EY[v

(X) = v = smooth function in v


test statistic
T =

h
n

i =1
_
Y
i
G{v

(X)}
_ _

G
i
_
v

(X
i
)
_
G
_
v

(X
i
)
__
w
_
v

(X)
_
SPM
Single Index Models 6-32
Nonparametric Estimator for the Link

G
i
() =

G
h,i
() (h/

h)
d

G

h,i
()
1 (h/

h)
d

where

G
h,i
is a (Bierens-corrected) leave-one-out estimator with
bandwidth h
h n
1/2d+1
,

h n
/2d+1
, 0 < < 1


G
i
is asymptotically unbiased with optimal rate
T is asymptotically N(0,
2
T
)
SPM
Single Index Models 6-33
Summary: Single Index Models
A single index model (SIM) is of the form
E(Y[X) = m(X) = g v

(X) ,
where v

() is an up to the parameter vector known index


function and g() is an unknown smooth link function. In
most applications the index function is of linear form, i.e.,
v

(x) = x

).
Due to the nonparametric form of the link function, neither an
intercept nor a scale parameter can be identied. For
example, if the index is linear, we have to estimate
E(Y[X) = g
_
X

_
.
There is no intercept parameter and can only be estimated
SPM
Single Index Models 6-34
up to unknown scale factor. To identify the slope parameter
of interest, is usually assumed to have one component
identical to 1 or to be a vector of length 1.
The estimation of a SIM usually proceeds in the following
steps: First the parameter is estimated. Then, using the
index values
i
= X

i

, the nonparametric link function g is
estimated by an univariate nonparametric regression method.
For the estimation of , two approaches are available:
Iterative methods as semiparametric least squares (SLS) or
pseudo-maximum likelihood estimation (PMLE) and direct
methods as (weighted) average derivative estimation
(WADE/ADE). Iterative methods can easily handle mixed
discrete-continuous regressors but may need to employ
sophisticated routines to deal with possible local optima of the
optimization criterion. Direct methods as WADE/ADE avoid
SPM
Single Index Models 6-35
the technical diculties of an optimization but do require
continuous explanatory variables. An extension of the direct
approach is only possible for a small number of additional
discrete variables.
There are specication tests available to test whether we have
a GLM or a true SIM.
SPM
Generalized Partial Linear Models 7-1
Chapter 7:
Generalized Partial Linear Models
SPM
Generalized Partial Linear Models 7-2
Generalized Partial Linear Models
PLM
E(Y[U, T) = U

+ m(T)
GPLM
E(Y[U, T) = GU

+ m(T)
= (
1
, . . . ,
p
)

is a nite dimensional parameter


m() is a smooth function
SPM
Generalized Partial Linear Models 7-3
Partial Linear Models (PLM)
Y =

U + m(T) +
expectations conditional on T
E(Y[T) =

E(U[T) + Em(T)[T + E([T)


dierence
Y E(Y[T)
. .

Y
=

UE(U[T)
. .

U
+ E([T)
. .

SPM
Generalized Partial Linear Models 7-4
Speckman estimators
for the parameter : linear regression of

Y
i
on

U
i

=
_

U
_
1

Y
for the function m(): nonparametric regression of

Y
i

i

on T
i
m = S
_

Y

U

_
where the smoother matrix S is dened:
(S)
ij
=
/
H
(T
i
T
j
)

i
/
H
(T
i
T
j
)
and

U = USU,

Y = Y SY
SPM
Generalized Partial Linear Models 7-5
Backtting
denote P the projection matrix P = U(U

U)
1
U

and S the
smoother matrix
backtting means to solve
U = P(Y m)
m = S(Y U).
explicit solution is given by

= U

(I S)U
1
U

(I S)Y,
m = S(Y U

).
SPM
Generalized Partial Linear Models 7-6
Introducing a link function
possible estimation methods for GPLM
prole likelihood
generalized Speckman estimator
backtting for GPLM
SPM
Generalized Partial Linear Models 7-7
Prole Likelihood
x the parameter and to estimate the nonparametric
function in dependence of this xed
resulting estimate for m

() is then used to construct the


prole likelihood for
as a consequence of the prole likelihood method,

is
asymptotically ecient
SPM
Generalized Partial Linear Models 7-8
Details on Prole likelihood
requires: Y given U, T is parametric in , m(T)
least favorable curve m

()
max
m

(t)
E
_
(Y, U

+ m

(t), )
. .
likelihood
[T = t
_
conditional expectation = can be estimated by
max
m

(t)

i
Y
i
, U

i
+ m

(t), /
H
(T
i
t) (7.1)
prole likelihood estimator for
max

i
U

i
+ m

(T
i
)
SPM
Generalized Partial Linear Models 7-9
denote by
i
the individual log-likelihood for observation i and by

i
,

i
its rst and second derivatives w.r.t.
i
= U

i
+ m(T
i
)
Example: Prole likelihood for normal PLM
Y
i
= U

i
+ m(T
i
) +
i
,
i
N(0,
2
), i.i.d.
likelihood

i
U

i
+ m(T
i
) =
1
2
2
y
i
U

i
m(T
i
)
2
+ . . .
least favorable curve
max
m

(t)

i
Y
i
U

i
m

(t)
2
/
H
(t T
i
) (7.2)
m

(t) =

i
/
H
(t T
i
) (Y
i
U

i
)

i
/
H
(t T
i
)
m

= S
_
Y U

_
SPM
Generalized Partial Linear Models 7-10
prole likelihood estimator for
max

i
U

i
+ m

(T
i
)
= need to solve
0 =
n

i =1

i
U

i
+ m

(T
i
) U
i
+

(T
i
)
nding maximum in (7.1) resp. (7.2) means to solve
0 =
n

i =1

i
U

i
+ m

(t) /
H
(t T
i
) (7.3)
taking the derivative of (7.3) with respect to gives

(t) =
n

i =1
/
H
(t T
i
)

i
U

i
+ m

(t)U
i
n

i =1
/
H
(t T
i
)

i
U

i
+ m

(t)
SPM
Generalized Partial Linear Models 7-11
for the PLM we have

i
1, hence
U
i
+

(t) = U
i

n

i =1
/
H
(t T
i
)U
i
n

i =1
/
H
(t T
i
)
=

U
i
= [(I S)U]
i
= need to solve
0 =
n

i =1
Y
i
U

i
m

(T
i
) [(I S)U]
i
=

U

Y

U)
since
Y
i
U

i
m

(T
i
) = [Y U S(Y U)]
i
= [(I S)(Y U)]
i
= Speckman solution

=
_

U
_
1

Y
SPM
Generalized Partial Linear Models 7-12
Prole Likelihood Algorithm (P)
maximize usual likelihood
0 =
n

i =1

i
U

i
+ m

(T
i
) U
i
+

(T
i
)
maximize smoothed quasi-likelihood
0 =
n

i =1

i
U

i
+ m

(T
j
)/
H
(T
i
T
j
)
SPM
Generalized Partial Linear Models 7-13
Algorithm (P)
updating step for

new
= B
1
n

i =1

i
(U

i
+ m
i
)

U
i
B =
n

i =1

i
(U

i
+ m
i
)

U
i

U
j
= U
j

n

i =1

i
(U

i
+ m
j
) /
H
(T
i
T
j
) U
i
n

i =1

i
(U

i
+ m
j
) /
H
(T
i
T
j
)
.
updating step for m
j
m
new
j
= m
j

n

i =1

i
(U

i
+ m
j
) /
H
(T
i
T
j
)
n

i =1

i
(U

i
+ m
j
) /
H
(T
i
T
j
)
.
SPM
Generalized Partial Linear Models 7-14
dene S
P
the smoother matrix by its elements
(S
P
)
ij
=

i
(U

i
+ m
j
)/
H
(T
i
T
j
)
n

i =1

i
(U

i
+ m
j
)/
H
(T
i
T
j
)
Algorithm (P)
updating step for

new
= (

U)
1

Z
with

U = (I S
P
)U,

Z =

U W
1
v.
U design, I identity, v = (

i
), W = diag(

i
)
SPM
Generalized Partial Linear Models 7-15
Backtting Algorithm (B)
AlgorithmB
generalized partial linear model
E(Y[U, T) = GU

+ m(T)
= backtting for the adjusted dependent variable
use now the smoother matrix S dened through
(S)
ij
=

i
(U

i
+ m
i
)/
H
(T
i
T
j
)
n

i =1

i
(U

i
+ m
i
)/
H
(T
i
T
j
)
SPM
Generalized Partial Linear Models 7-16
Algorithm (B)
updating step for

new
= (U

U)
1
U

Z,
updating step for m
m
new
= S(Z U),
using the notations

U = (I S)U,

Z = (I S)Z =

U W
1
v.
U design, I identity, v = (

i
), W = diag(

i
)
SPM
Generalized Partial Linear Models 7-17
note that the update of the index U + m can be expressed by a
linear estimation matrix R
B
:
U
new
+ m
new
= R
B
Z
with
R
B
=

UU

U
1
U

W(I S) + S.
SPM
Generalized Partial Linear Models 7-18
Generalized Speckman Estimator (S)
generalized partial linear model
E(Y[U, T) = GU

+ m(T)
= Speckman approach for the adjusted dependent variable
recall smoother matrix from backtting with elements
(S)
ij
=

i
(U

i
+ m
i
)/
H
(T
i
T
j
)
n

i =1

i
(U

i
+ m
i
)/
H
(T
i
T
j
)
SPM
Generalized Partial Linear Models 7-19
Algorithm (S)
updating step for

new
= (

U)
1
U

Z,
updating step for m
m
new
= S(Z U)
using the notations

U = (I S)U,

Z = (I S)Z =

U W
1
v.
U design, I identity, v = (ell

i
), W = diag(

i
)
SPM
Generalized Partial Linear Models 7-20
the generalized Speckman estimator (S) shares the property of
being linear on the variable Z:
U
new
+ m
new
= R
S
Z
with
R
S
=

U

U
1

W(I S) + S
note that here in contrast to (B) always

U is used
SPM
Generalized Partial Linear Models 7-21
Comparison of (P), (S), (B)
(P) S
P
ij
=

i
(U
T
i
+m
j
)K
H
(T
i
T
j
)
n

i =1

i
(U
T
i
+m
j
)K
H
(T
i
T
j
)

new
= (

U
T
W

U)
1

U
T
W

Z, m
new
= . . .
(S) S
ij
=

i
(U
T
i
+m
i
)K
H
(T
i
T
j
)
n

i =1

i
(U
T
i
+m
i
)K
H
(T
i
T
j
)

new
= (

U
T
W

U)
1

U
T
W

Z, m
new
= S(Z U)
(B) S
ij
=

i
(U
T
i
+m
i
)K
H
(T
i
T
j
)
n

i =1

i
(U
T
i
+m
i
)K
H
(T
i
T
j
)

new
= (U
T
W

U)
1
U
T
W

Z, m
new
= S(Z U)
SPM
Generalized Partial Linear Models 7-22
Testing parametric versus Semiparametric
likelihood ratio (LR) test statistic
LR = 2
n

i =1
_
(Y
i
,
i
,

) (Y
i
,
i
,

)
_
= 2
_
(Y, mu,

) (Y, mu,

)
_
semiparametric:
i
= GU

i

+ m(T
i
)
parametric:
i
= GU

i

+ T

i
+
0

alternatively use the deviance


D(Y, mu, ) = 2 (Y, mu
max
, ) (Y, mu, )
= LR = D(Y, mu, ) D(Y, mu, )
SPM
Generalized Partial Linear Models 7-23
if at convergence of iterative estimation
= RZ = R( W
1
v)
then
D(Y, ) = a()D(Y, , ) (Z )

W
1
(Z )
we need approximate degrees of freedom for the semiparametric
estimator
df
err
( mu) = n tr
_
2R R

WRW
1
_
or simpler
df
err
( mu) = n tr (R)
SPM
Generalized Partial Linear Models 7-24
for backtting (B) and algorithm (S) we have
= RZ
for prole likelihood (P) we can use approximately
R
P
=

U

U
1

W(I S
P
) + S
where

U denotes (I S
P
)U
SPM
Generalized Partial Linear Models 7-25
Modied LR test
problem: under the parametric hypothesis, the parametric estimate
is unbiased whereas the non-/semiparametric estimates are always
biased
= use bias-adjusted parametric estimate
m(T
j
)
from
G(U

i

+ T

i
+
0
), U
i
, T
i
, i = 1, . . . , n
modied LR test statistic

LR = 2
n

i =1
_
(
i
,
i
,

) (
i
,
i
,

)
_
= 2
_
( mu, mu,

) (mu, mu,

)
_
SPM
Generalized Partial Linear Models 7-26
asymptotically equivalent

LR =
n

i =1
w
i
_
U

i
(

) + m(T
i
) m(T
i
)
_
2
with
w
i
=
[G

i

+ m(T
i
)]
2
V[GU

i

+ m(T
i
)]
.
Theorem
Under the linearity hypothesis holds:


LR =

LR + o
p
(v
n
),
v
1
n
(

LR b
n
)
L
N(0, 1)
SPM
Generalized Partial Linear Models 7-27
Bootstrapping the modied LR test
Theorem
It holds
d
K
(

LR

,

LR)
L
0
where d
K
denotes the Kolmogorov distance.
bootstrap algorithm
(a) generate n
boot
samples Y

1
, . . . , Y

n
with
E

(Y

i
) =
i
= G(U

i

+ T

i
+
0
)
Var

(Y

i
) = a(

) V(
i
)
(b) calculate estimates based on each of the bootstrap samples and
from these the bootstrap test statistics

LR

(c) the critical values (the quantiles of the distribution of



LR) are then
estimated by the empirical quantiles of the n
boot

LR

values
SPM
Generalized Partial Linear Models 7-28
Comparison of methods
backtting (B) is best under independence; (P), (S) seem
better otherwise
for large n: (P) (S)
in testing parametric versus nonparametric (P) and (S) work
well with approximate degrees of freedom; bootstrapping

LR
improves
(S) seems a good compromise between accuracy and
computational eciency in estimation and specication testing
SPM
Generalized Partial Linear Models 7-29
Example: Migration
Yes No (in %)
Y migration intention 38.5 61.5
U
1
family/friends in west 85.6 11.2
U
2
unemployed/job loss certain 19.7 78.9
U
3
city size 10,000-100,000 29.3 64.2
U
4
female 51.1 49.8
Min Max Mean S.D.
U
5
age (years) 18 65 39.84 12.61
T household income (DM) 200 4000 2194.30 752.45
Table 12: Descriptive statistics for migration data, n = 3235
SPM
Generalized Partial Linear Models 7-30
h = 10%
1000 2000 3000 4000
household income
0
.6
0
.8
1
1
.2
m
(h
o
u
s
e
h
o
ld
in
c
o
m
e
)
h = 20%
1000 2000 3000 4000
household income
0
.6
0
.8
1
1
.2
m
(h
o
u
s
e
h
o
ld
in
c
o
m
e
)
h = 30%
1000 2000 3000 4000
household income
0
.6
0
.7
0
.8
0
.9
1
m
(h
o
u
s
e
h
o
ld
in
c
o
m
e
)
h = 40%
1000 2000 3000 4000
household income
0
.6
0
.7
0
.8
0
.9
1
m
(h
o
u
s
e
h
o
ld
in
c
o
m
e
)
Figure 70: The impact of household income on migration intention, non-
parametric, linear and bias-adjusted parametric estimate, bandwidths h =
10%, 20%, 30%, 40% of the range
SPM
Generalized Partial Linear Models 7-31
Coe. (t-value) Coe. (t-value)
const. 0.512 (2.39)
family/friends in west 0.599 (5.20) 0.598 (5.14)
unemployed/job loss 0.221 (2.31) 0.230 (2.39)
city size 10-100,000 0.311 (3.77) 0.302 (3.63)
female -0.240 (3.15) -0.249 (3.26)
age -4.69 10
2
(14.56) -4.74 10
2
(14.59)
household income 1.42 10
4
(2.73)
Linear (logit) Part. Linear
Table 13: Logit coecients and GPLM coecients for migration data (t-
values in parenthesis), h = 20% for the GPLM
SPM
Generalized Partial Linear Models 7-32
testing GLM vs. GPLM
h 10% 20% 30% 40%
R 0.001 0.001 0.116 0.516

R 0.001 0.002 0.109 0.488


Table 14: Observed signicance levels for linearity test for migration data,
normal approximation of the quantiles
bootstrap approximation of quantiles: all computed signicance
levels for rejection are below 0.01
SPM
Generalized Partial Linear Models 7-33
h = 10%
5 10
test statistic R
0
0
.1
d
e
n
s
ity
e
s
tim
a
te
fo
r R
*
, n
o
rm
a
l d
e
n
s
ity
h = 20%
0 2 4 6 8
test statistic R
0
0
.1
0
.2
0
.3
d
e
n
s
ity
e
s
tim
a
te
fo
r R
*
, n
o
rm
a
l d
e
n
s
ity
h = 30%
0 1 2 3 4 5
test statistic R
0
0
.2
0
.4
0
.6
0
.8
d
e
n
s
ity
e
s
tim
a
te
fo
r R
*
, n
o
rm
a
l d
e
n
s
ity
h = 40%
0 1 2 3
test statistic R
0
0
.5
1
1
.5
d
e
n
s
ity
e
s
tim
a
te
fo
r R
*
, n
o
rm
a
l d
e
n
s
ity
Figure 71: Density estimates of bootstrapped

LR (thick) and limiting nor-
mal distribution (thin), n
boot
= 200 bootstrap replications, bandwidths
h = 10%, 20%, 30%, 40% of the range
SPM
Generalized Partial Linear Models 7-34
Example: Credit Scoring
Y =
_
1 loan is repaid
0 loan has defaulted
n = 1000, 300 of them defaulted
the data set contains observations from three continuous
variables (duration and amount of credit, age of client) and 17
discrete variables
SPM
Generalized Partial Linear Models 7-35
Coe. (t-val.) Coe. (t-val.) Coe. (t-val.)
constant -17.605 (-1.91) -34.909 (-2.51)
duration -0.036 (-3.85) -0.033 (-3.48) -0.037 (-4.23)
log(amount) 1.654 ( 1.41) 4.847 ( 2.57)
log(amount) square -0.229 (-2.26)
log(age) 4.119 ( 1.59) 6.949 ( 1.11)
log(age) square -0.501 (-0.60)
log(age) log(amount) -0.484 (-1.47) -0.384 (-1.17)
. . . . . . . . . . . . . . . . . . . . .
Linear Quadratic Part. Linear
Table 15: Parametric logit and GPLM coecients (t-values in parentheses),
bandwidths are 40% of range
SPM
Generalized Partial Linear Models 7-36
Influence: amount & age
6.6
7.7
8.7 3.3
3.6
4.0
-1.6
-1.2
-0.8
log(amount)
log(age)
influence
Scatterplot: amount & age
6 7 8 9
log(amount)
3
3
.
5
4
l
o
g
(
a
g
e
)
Figure 72: Two-dimensional nonparametric function of amount and age
(left), bandwidths are 40% of range; scatterplot of of amount and age
(right)
SPM
Generalized Partial Linear Models 7-37
Contours: influence of amount & age
6 7 8 9
log(amount)
3
3
.
5
4
l
o
g
(
a
g
e
)
Figure 73: Contours for function of amount and age, bandwidths are 40%
of the range
SPM
Generalized Partial Linear Models 7-38
testing GLM vs. GPLM
h 20% 30% 40% 50% 60%
linear <0.01 <0.01 <0.01 0.01 0.29
linear & interaction <0.01 <0.01 <0.01 0.07 0.40
quadratic <0.01 <0.01 <0.01 0.35 0.55
Table 16: Observed signicance levels for test of GLM against GPLM,
bootstrap sample size n
boot
= 100
SPM
Generalized Partial Linear Models 7-39
Summary: Generalized Partial Linear
Models
A partial linear model (PLM) is given by
E(Y[X) = X

+ m(T),
where is an unknown parameter vector and m() is an
unknown smooth function of a multidimensional argument T.
A generalized partial linear model (GPLM) is of the form
E(Y[X) = GX

+ m(T),
where G is a know link function, is an unknown parameter
vector and m() is an unknown smooth function of a
multidimensional argument T.
SPM
Generalized Partial Linear Models 7-40
Partial linear models are usually estimated by Speckmans
estimator. This estimator determines rst the parametric
component by applying an OLS estimator to a
nonparametrically modied design matrix and response vector.
In a second step the nonparametric component is estimated
by smoothing the residuals w.r.t. the parametric part.
Generalized partial linear models should be estimated by the
prole likelihood method or by a generalized Speckman
estimator.
The prole likelihood approach is based on the fact that the
conditional distribution of Y given U and T is parametric. Its
idea is to estimate the least favorable nonparametric function
m

() in dependence of . The resulting estimate for m

()
is then used to construct the prole likelihood for .
SPM
Generalized Partial Linear Models 7-41
The generalized Speckman estimator can be seen as a
simplication of the prole likelihood method. It is based on a
combination of a parametric IRLS estimator (applied to a
nonparametrically modied design matrix and response
vector) and a nonparametric smoothing method (applied to
the adjusted dependent variable reduced by its parametric
component).
To check whether the underlying true model is a parametric
GLM or a semiparametric GPLM, one can use specication
tests that are modications or the classical likelihood ratio
test. In the semiparametric setting, either an approximate
number of degrees of freedom is used or the test statistic itself
is modied such that bootstrapping its distribution leads to
appropriate critical values.
SPM
Additive Models 8-1
Chapter 8:
Additive Models
SPM
Additive Models 8-2
Additive Models
to avoid the curse of dimensionality and for better interpretability
we assume
m(x) = E (Y [ X = x) = c +
d

j =1
g
j
(x
j
)
= the additive functions g
j
can be estimated with the optimal
one-dimensional rate
SPM
Additive Models 8-3
two possible methods for estimating an additive model:
backtting estimator
marginal integration estimator
identication conditions for both methods

X
j
g
j
(X
j
) = 0, j = 1, . . . , d
=E (Y) = c
SPM
Additive Models 8-4
Backtting
consider the following optimization problem
min
m
E Y m(X)
2
such that m(x) =
d

j =1
g
j
(x
j
)
we aim to nd the best projection of Y to the space spanned by
additive univariate smooth functions
SPM
Additive Models 8-5
formulation in Hilbert space framework:
let H
YX
be the Hilbert space of random variables which are
functions of Y, X
let U, V) = E(UV) the scalar product
dene H
X
and H
X
j
, j = 1, . . . , d the corresponding subspaces
= we aim to nd the element of H
X
1
H
X
d
closest to
Y H
YX
or m H
X
SPM
Additive Models 8-6
by the projection theorem, there exists a unique solution with
E [Y m(X) [X

] = 0
g

(X

) = E
__
Y

j =
g
j
(X
j
)
_
[X

, = 1, . . . , d
denote projection P

() = E ([X

)
=

I P
1
P
1
P
2
I P
2
.
.
.
.
.
.
.
.
.
P
d
P
d
I

g
1
(X
1
)
g
2
(X
2
)
.
.
.
g
d
(X
d
)

P
1
Y
P
2
Y
.
.
.
P
d
Y

SPM
Additive Models 8-7
denote by
S

the (n n) smoother matrix


such that S

Y is an estimate of the vector


E(Y
1
[X
1
), . . . , E(Y
n
[X
n
)

I S
1
S
1
S
2
I S
2
.
.
.
.
.
.
.
.
.
S
d
S
d
I

. .
ndnd

g
1
g
2
.
.
.
g
d

S
1
Y
S
2
Y
.
.
.
S
d
Y

note: in nite samples the matrix on the left side can be singular
SPM
Additive Models 8-8
Backtting algorithm
in practice, the following backtting algorithm (a simplication of
the Gauss-Seidel procedure) is used:
initialize g
(0)

0 , c =

Y
repeat for = 1, . . . , d
r

= Y c
1

j =1
g
(+1)
j

d

j =+1
g
()
j
g
(+1)

() = S

(r

)
proceed until convergence is reached
SPM
Additive Models 8-9
Example: Smoother performance in additive models
simulated sample of n = 75 regression observations with regressors
X
j
i.i.d. uniform on [2.5, 2.5], generated from
Y =
4

j =1
g
j
(X
j
) + , N(0, 1)
where
g
1
(X
1
) = sin(2X
1
) g
2
(X
2
) = X
2
2
E(X
2
2
)
g
3
(X
3
) = X
3
g
4
(X
4
) = exp(X
4
) Eexp(X
4
)
SPM
Additive Models 8-10
-2 -1 0 1 2
X1
-1
-0
.5
0
0
.5
1
1
.5
g
1
-2 -1 0 1 2
X3
-2
-1
0
1
2
g
3
-2 -1 0 1 2
X2
-2
0
2
4
g
2
-2 -1 0 1 2
X4
-2
0
2
4
6
g
4
Figure 74: Estimated (solid lines) versus true additive component functions
(circles at the input values), local linear estimator with Quartic kernel,
bandwidths h = 1.0
SPM
Additive Models 8-11
Example: Boston housing prices
Y median value of owner-occupied homes in $1000
X
1
per capita crime rate by town
X
2
proportion of non-retail business acres per town
X
3
nitric oxides concentration (parts per 10 million)
X
4
average number of rooms per dwelling
X
5
proportion of owner-occupied units built prior to 1940
X
6
weighted distances to ve Boston employment centers
X
7
full-value property tax rate per $10,000
X
8
pupil-teacher ratio by town
X
9
1, 000(Bk 0.63)
2
where Bk is the proportion of people of Afro-
American descent by town,
X
10
percent lower status of the population
SPM
Additive Models 8-12
-4 -2 0 2 4
X1
-1
5
-1
0
-5
0
5
g
1
-0.8 -0.6 -0.4 -0.2
X3
-8
-6
-4
-2
0
2
4
g
3
1 2 3 4
X5
-1
-0
.5
0
0
.5
1
1
.5
g
5

0 1 2 3
X2
-4
-2
0
2
4
g
2

1.4 1.6 1.8 2
X4
-5
0
5
1
0
g
4

0.5 1 1.5 2 2.5
X6
-1
0
0
1
0
2
0
g
6

tted model:
E(Y[x) = c+

10
j =1
g
j
log(X
j
)
5.4 5.7 6 6.3 6.6
X7
-4
-2
0
2
4
g
7

0 2 4 6
X9
-2
-1
0
1
2
g
9

2.6 2.7 2.8 2.9 3 3.1
X8
-2
0
2
4
6
g
9

0.5 1 1.5 2 2.5 3 3.5
X10
-1
0
0
1
0
2
0
g
1
0
Figure 75: Function estimates of additive components g
1
to g
6
(left), g
7
to
g
10
(right) and partial residuals, local linear estimator with Quartic kernel,
h
j
= 0.5
j
(X
j
)
SPM
Additive Models 8-13
Marginal Integration
motivated by the identiability conditions
E
X
j
g
j
(X
j
) = 0, EY = c,
it follows
E
X

_
m(x

, X

)
_
= E
X

c + g

(x

) +

j =
g
j
(X
j
)

= c + g

(x

)
where = 1, . . . , 1, + 1, . . .
= estimate hyperdimensional surface and integrate out the
nuisance directions
SPM
Additive Models 8-14
Example: Illustration of the marginal integration principle
consider the model
Y = 4 + X
2
1
+ 2 sin(X
2
) + ,
where X
1
U[2, 2], X
2
U[3, 3], E = 0
m(x
1
, x
2
) = E (Y[X = x) = 4 + x
2
1
+ 2 sin(x
2
)
we nd
E
X
2
m(x
1
, X
2
) =
_
3
3
_
4 + x
2
1
+ 2 sin(u)
_
1
6
du = 4 + x
2
1
,
E
X
1
m(X
1
, x
2
) =
_
2
2
_
4 + u
2
+ 2 sin(x
2
)
_
1
4
du = 4 +2 sin(x
2
)
SPM
Additive Models 8-15
Example: Smoother performance of the marginal integration
estimator
simulated sample of n = 150 regression observations with
regressors X
j
i.i.d. uniform on [2.5, 2.5], generated from
Y =
4

j =1
g
j
(X
j
) + , N(0, 1),
where
g
1
(X
1
) = sin(2X
1
) g
2
(X
2
) = X
2
2
E(X
2
2
)
g
3
(X
3
) = X
3
g
4
(X
4
) = exp(X
4
) Eexp(X
4
)
SPM
Additive Models 8-16
-2 -1 0 1 2
X1
-1
-0
.5
0
0
.5
1
g
1
-2 -1 0 1 2
X3
-2
-1
0
1
2
g
3
-2 -1 0 1 2
X2
-2
0
2
4
g
2
-2 -1 0 1 2
X4
0
5
1
0
g
4
Figure 76: Estimated local linear (solid line) versus true additive component
functions (circles at the input values), local linear estimator with Quartic
kernel, h
1
= 1, h
j
= 1.5 (j = 2, 3, 4),

h for the nuisance directions
SPM
Additive Models 8-17
Details on Marginal Integration
Y
i
= c +
d

j =1
g
j
(X
ij
) +
i
assumptions on
i
:
E
i
= 0, Var(
i
) =
2
(X
i
), mutually independent conditional on
the X
j
denote matrices
Z = (Z
ik
) =
_
(X
i
x

)
k
_
, k = 0, . . . , p
W
,
= diag (W
i ,
) = diag
_
K
h
(X
i
x

)/

h
(X
i
X

)
_
n
i =1
SPM
Additive Models 8-18
Local Polynomial Estimates of m(x

, X
,
)
arg min

i =1
Y
i

0

1
(X
i
x

) . . .
2
W
i ,
Derivative estimation for the Marginal Eects
g
()

(x

) =
!
n
n

=1
e

(Z

W
,
Z

)
1
Z

W
,
Y,
where > 0, e

= (0, . . . , 1, . . . , 0)

, and p odd
SPM
Additive Models 8-19
Assumptions
(A1) K(), /() are positive, bounded, symmetric, compactly
supported and Lipschitz continuous, /() is of order q > 2
(A2)
nh

h
(d1)
log
2
(n)
,

h
q
h
p+1
0, h = h
0
n
1
2p+3
(A3) g
j
() have bounded Lipschitz continuous (p + 1)th
derivatives
(A4)
2
() is bounded and Lipschitz continuous
(A5) the densities f
X
and f
X

are uniformly bounded away


from zero and innity and are Lipschitz continuous
SPM
Additive Models 8-20
the proof mainly uses that the local polynomial estimator is
asymptotically equivalent to kernel estimation with kernel
K

(u) =
p

t=0
s
t
u
t
K(u)
K =
__
u
t+s
K(u)du
_
0t,sp
, s
t
=
_
K
1
_
t
(equivalent kernel of higher order)
SPM
Additive Models 8-21
Theorem
Under conditions (A1)-(A5), p odd and if x is in the interior of
the support of f
X
, the asymptotic bias and variance of the estimator
can be expressed as n

p+1
2p+3
b

(x

) and n
2(p+1)
2p+3
v

(x

), where
b

(x

) =
!h
p+1
0
(p + 1)!

p+1
(K

)
_
g
(p+1)

(x

)
_
and
v

(x

) =
!
2
h
2+1
0
|K

|
2
2
_

2
_
x

, x

_ f
2

_
x

_
f
_
x

, x

_dx

.
SPM
Additive Models 8-22
Theorem
For the additive components or derivative estimators holds:
n
p+1
2p+3
_
g
()

(x

) g
()

(x

)
_
L
N b

(x

), v

(x

) .
Theorem
For the regression function estimator m =

Y +

holds:
n
p+1
2p+3
m(x) m(x)
L
N b(x), v(x) ,
where b(x) =
d

=1
b

(x

) and v(x) =
d

=1
v

(x

) .
SPM
Additive Models 8-23
Example: Production function estimation
assumption that separability (additivity) holds, i.e.
log (Y) = c +
d

=1
g

log(X

)
= model is a generalization of the Cobb-Douglas model
log (Y) = c +
d

=1

log(X

)
= scale elasticities correspond to g

(log X

), return to scales to

j
g

j
(log X
j
)
SPM
Additive Models 8-24
livestock data of (middle sized) Wisconsin farms, 250 observations
from 1987
Y livestock
X
1
family labor force
X
2
hired labor force
X
3
miscellaneous inputs as e.g. repairs, rent, custom hiring,
supplies, insurance, gas, oil, or utilities
X
4
animal inputs as e.g. purchased feed, breeding, or veterinary
services
X
5
intermediate run assets, that is assets with a useful life of one
to ten years
(all measured in US-Dollar)
SPM
Additive Models 8-25
Family Labor
8 9 10 11
ln (X1)
0
0
.5
1
d
e
n
sity
Miscellaneous Inputs
9 9.5 10 10.5 11 11.5
ln (X3)
0
0
.2
0
.4
0
.6
0
.8
d
e
n
sity
Hired Labor
2 4 6 8 10
ln (X2)
0
0
.1
0
.2
0
.3
d
e
n
sity
Animal Inputs
9 10 11 12
ln (X4)
0
0
.2
0
.4
0
.6
d
e
n
sity
Intermediate Run Assets
9.5 10 10.5 11 11.5
ln (X5)
0
0
.5
d
e
n
s
ity
Figure 77: Density estimates for the regressors X
1
, . . . , X
5
, using the quartic
kernel with bandwidth h = 0.75
j
(X
j
)
SPM
Additive Models 8-26
Family Labor
9 10
log (X1)
1
1
.6
1
1
.7
lo
g
(Y
)
Family Labor
9 10
log (X1)
-0
.5
0
lo
g
(Y
)
Hired Labor
4 6 8 10
log (X2)
1
1
1
2
lo
g
(Y
)
Hired Labor
4 6 8 10
log (X2)
-0
.2
0
0
.2
lo
g
(Y
)
Figure 78: Function estimates for the additive components X
1
, X
2
(left),
parametric and nonparametric derivative estimates (right)
SPM
Additive Models 8-27
Miscellaneous Inputs
9 9.5 10 10.5 11 11.5
log (X3)
1
1
.5
1
2
lo
g
(Y
)
Miscellaneous Inputs
9 9.5 10 10.5 11 11.5
log (X3)
-0
.5
0
0
.5
1
lo
g
(Y
)
Animal Inputs
9 10 11
log (X4)
1
1
.5
1
2
lo
g
(Y
)
Animal Inputs
9 10 11
log (X4)
0
.2
0
.4
0
.6
lo
g
(Y
)
Figure 79: Function estimates for the additive components X
3
, X
4
(left),
parametric and nonparametric derivative estimates (right)
SPM
Additive Models 8-28
Intermediate Run Assets
10 10.5 11 11.5
log (X5)
1
1
.6
1
1
.8
lo
g
(Y
)
Intermediate Run Assets
10 10.5 11 11.5
log (X5)
0
0
.5
lo
g
(Y
)
Figure 80: Function estimate for the additive component X
5
(left), para-
metric and nonparametric derivative estimates (right)
SPM
Additive Models 8-29
Backtting versus Marginal Integration
backtting:
projection into space of additive models, looks for an optimal
t
marginal integration:
estimates marginal eects by integrating out other directions,
does not calculate optimal regression t
= dierent to interpret, but if model is correctly specied, they
estimate the same
SPM
Additive Models 8-30
Partial Additive Partial Linear Models (APLM)
m(U, T) = E(Y[U, T) = c +
d

j =1
g
j
(T
j
) + U

rst approach:
g

(t

) =
1
n
n

j =1
m(t

, T
j
, U
j
) c
using m from
arg min
n

i =1
Y
i

0

1
(T
i
t

)
2
W
ij
I U
i
= U
j

SPM
Additive Models 8-31
second approach:
add in brackets U
i
and skip I U
i
= U
j

= asymptotics for

get complicated
no additional assumptions needed only modication for the
densities because of U
can be estimated with parametric rate
SPM
Additive Models 8-32
Additive Models with Interaction terms
Y = m(X) + (X)
allowing now for interaction
m(X) = c +
d

j =1
g

(X

) +

1<j d
g
j
(X

, X
j
)
for identication
E
X

(X

) = E
X

f
j
(X

, X
j
) = E
X
j
f
j
(X

, X
j
) = 0
SPM
Additive Models 8-33
consider marginal integration as used before

(x

) =
_
m(x

, x

)f

(x

)dx

, 1 d,
and in addition

j
(x

, x
j
) =
_
m(x

, x
j ,
x
j
)f
j
(x
j
)dx
j
,
c
j
=
_
g
j
(x

, x
j
)f
j
(x

, x
j
) dx

dx
j
, 1 < j d,
it can be shown that

j
(x

, x
j
)

(x

)
j
(x
j
) +
_
m(x)f (x) dx = g
j
(x

, x
j
) + c
j
.
= centering this function in an appropriate way would hence give
us the interaction function of interest
SPM
Additive Models 8-34
Local Linear Estimator

g
j
+ c
j
=

j
+c,
where

j
=
1
n
n

j =1
e

0
(X

j
W
l j
X
j
)
1
X

j
W
l j
Y,
and
W
l j
= diag
_
_
1
n
/
H
(X
i
x

, X
ij
x
j
)

H
(X
i j
x
l j
)
_
i =1,...,n
_
,
X
j
=

1 X
1
x

X
1j
x
j
.
.
.
.
.
.
.
.
.
1 X
n
x

X
nj
x
j

SPM
Additive Models 8-35
Example: Production function estimation
we can evaluate the validity of the additivity assumption by
introducing second order interaction terms g

:
log (Y) = c +
d

=1
g

log(X

) +

1<d
g

log(X

), log(X

)
livestock data of (middle sized) Wisconsin farms , 250 observations
from 1987, continued
SPM
Additive Models 8-36
Family Labor versus Hired Labor
8.9 9.7 10.4
4.7
6.7
8.7 -0.04
-0.00
0.03
Family Labor versus Animal Inputs
8.9 9.5 10.2
9.4
10.1
10.8 -0.03
0.02
0.08
Hired Labor versus Miscellaneous Inputs
4.1 6.3 8.5
9.3
10.0
10.7 -0.12
0.03
0.18
Family Labor versus Miscellaneous Inputs
8.4 9.2 10.0
9.4
10.0
10.6 0.02
0.07
0.11
Family Labor versus Intermediate Run Assets
8.4 9.2 10.0
9.8
10.5
11.1 -0.10
0.00
0.10
Hired Labor versus Animal Inputs
4.1 6.4 8.6
9.4
10.1
10.8 -0.00
0.03
0.07
Hired Labor versus Intermediate Run Assets
4.6 6.7 8.7
9.8
10.5
11.2 -0.11
-0.05
0.00
Miscellaneous Inputs versus Intermediate Run Assets
9.3 9.9 10.6
9.8
10.4
11.0 -0.01
0.03
0.07
Miscellaneous Inputs versus Animal Inputs
9.3 10.0 10.7
9.3
10.0
10.7 -0.08
-0.00
0.07
Animal Inputs versus Intermediate Run Assets
9.4 10.1 10.8
9.8
10.4
11.0 -0.26
-0.18
-0.11
Figure 81: Estimates for interaction terms, Quartic kernel, h
j
= 1.7
j
(X
j
)
and

h = 4h
SPM
Additive Models 8-37
Testing for Interaction
dierent approaches
looking at the interaction directly:
_
g
2

(x

, x

)f

(x

, x

)dx

dx

looking at the mixed derivative:


_

(1,1) 2

(x

, x

)f

(x

, x

)dx

dx

SPM
Additive Models 8-38
Example: Production function estimation
applying a special model selection procedure, we tested stepwise
for each interaction
as a result we found signicance for
direct method:
g
13
(family labor and miscellaneous inputs) with p-value
of about 2%;
g
15
and f
35
came closest
derivative method:
g
15
(family labor and intermediate run assets)
g
35
(miscellaneous inputs and intermediate run assets)
SPM
Additive Models 8-39
Summary: Additive Models
Additive models are of the form
E(Y[X) = m(X) = c +
d

=1
g

(X

).
In estimation they can combine exible nonparametric
modeling of many variables with statistical precision that is
typical of just one explanatory variable, i.e. they circumvent
the curse of dimensionality.
In practice, there exist mainly two estimation procedures,
backtting and marginal integration. If the real model is
additive, then there are many similarities in terms of what
they do to the data. Otherwise their behavior and
interpretation are rather dierent.
SPM
Additive Models 8-40
The backtting estimator is an iterative procedure of the kind:
g
(l )

= S

j =
g
(l 1)
j

, l = 1, 2, 3, . . .
until some prespecied tolerance is reached. This is a
successive one-dimensional regression of the partial residuals
on the corresponding X

. For the regression estimator it ts


the regression in general better than the integration estimator.
But it pays for a low MSE (or MASE) for the regression with
high MSE (MASE respectively) in the additive function
estimation. Furthermore, it is rather sensitive against the
bandwidth choice. An increase in the correlation of the
design leads to a worse estimate.
SPM
Additive Models 8-41
The marginal integration estimator is based on the idea that
E
X

m( X

, X

= c + g

(X

).
Replacing m by a pre-estimator and the expectation by
averaging denes a consistent estimate. This estimator suers
more from sparseness of observations than the backtting
estimator does. So for example the boundary eects are worse
in the integration estimator. In the center of the support of X
this estimator mostly has lower MASE for the estimators of
the additive component functions. An increasing covariance of
the explanatory variables aects the MASE strongly in a
negative sense. Regarding the bandwidth this estimator seems
to be quite robust.
SPM
Additive Models 8-42
If the real model is not additive, the integration estimator is
estimating the marginals by integrating out the directions of
no interest. In contrast, the backtting estimator is looking in
the space of additive models for the best t of the response Y
on X.
SPM
Generalized Additive Models 9-1
Chapter 9:
Generalized Additive Models
SPM
Generalized Additive Models 9-2
Generalized Additive Models
GAM
E(Y[X) = G

c +
d

j =1
f
j
(X
j
)

with known link function G


identication conditions for both methods

X
j
g
j
(X
j
) = 0, j = 1, . . . , d
SPM
Generalized Additive Models 9-3
Estimation via Backtting and Local Scoring
consider maximum likelihood estimation as in GLM, but apply
backtting to adjusted dependent variables Z
i
algorithm consists of two loops

inner loop: backtting

outer loop: local scoring


= local scoring instead of Fisher scoring
SPM
Generalized Additive Models 9-4
Local Scoring Algorithm
initialization c = Y, g
(0)

0 for = 1, . . . , d,
loop over outer iteration counter m

(m)
i
= c
(m)
+

g
(m)

(x
i
),

(m)
i
= G(
(m)
i
)
Z
i
=
(m)
i
+
_
Y
i

(m)
i
__
G

(
(m)
i
)
_
1
w
i
=
_
G

(
(m)
i
)
_
2
_
V(
(m)
i
)
_
1
obtain c
(m+1)
, g
(m+1)

by applying backtting
to Z
i
with regressors x
i
and weights w
i
until convergence is reached
SPM
Generalized Additive Models 9-5
Backtting Algorithm
gamback
initialization Z =
1
n

n
i =1
Z
i
, g
(0)

0 for =
1, . . . , d,
repeat for = 1, . . . , d the cycles:
r

= ZZ
1

k=1
g
(l +1)

k=+1
g
(l )
k
,
g
(l +1)

() = S

(r

[w)
until convergence is reached
SPM
Generalized Additive Models 9-6
Estimation using Marginal Integration
without linear part in index (neglecting constant c):

(t

) =
1
n
n

i =1
G
1
m(t

, T

)
with linear part in index:
E(Y[U, T) = GU

+ m(T)
where
m(T) = c +
d

j =1
g
j
(T
j
)
SPM
Generalized Additive Models 9-7
estimate GPLM:
m

(t) = arg min


m

(t)
smoothed log-likelihood

= arg min

usual log-likelihood
use multivariate estimate m

() to obtain marginal functions


by marginal integration
= g

(t

) =
1
n
n

i =1
m(t

, T
i
) c
SPM
Generalized Additive Models 9-8
Example: Migration data (for Saxony)
Y migration intention
U
1
family/friend in West
U
2
unemployed/job loss certain
U
3
middle sized city (10,000100,000 habitants)
U
4
female (1 if yes)
T
1
age of person (in years)
T
2
household income (in DM)
SPM
Generalized Additive Models 9-9
Density of Age
20 30 40 50 60
Age
0
.0
1
0
.0
1
5
0
.0
2
0
.0
2
5
D
e
n
s
ity
Density of Income
0 1000 2000 3000 4000
Income
0
0
.0
0
0
1
0
.0
0
0
2
0
.0
0
0
3
0
.0
0
0
4
0
.0
0
0
5
D
e
n
s
ity
Figure 82: Density plots for migration data (subsample from Sachsen),
AGE on the left, HOUSEHOLD INCOME on the right
SPM
Generalized Additive Models 9-10
Yes No (in %)
Y MIGRATION INTENTION 39.6 60.4
U
1
FAMILY/FRIENDS 82.4 27.6
U
2
UNEMPLOYED/JOB LOSS 18.3 81.7
U
3
CITY SIZE 26.0 74.0
U
4
FEMALE 51.6 48.4
Min Max Mean S.D.
T
1
AGE 18 65 40.37 12.69
T
2
INCOME 200 4000 2136.31 738.72
Table 17: Descriptive statistic for migration data (subsample from Sachsen,
n = 955)
SPM
Generalized Additive Models 9-11
GLM GAPLM
Coecients S.E. p-values Coecients
h = 0.75 h = 1.00
FAMILY/FRIENDS 0.7604 0.1972 <0.001 0.7137 0.7289
UNEMPLOYED/JOB LOSS 0.1354 0.1783 0.447 0.1469 0.1308
CITY SIZE 0.2596 0.1556 0.085 0.3134 0.2774
FEMALE -0.1868 0.1382 0.178 -0.1898 -0.1871
AGE (stand.) -0.5051 0.0728 <0.001
INCOME (stand.) 0.0936 0.0707 0.187
constant -1.0924 0.2003 <0.001 -1.1045 -1.1007
Table 18: Logit and GAPLM coecients for migration data
SPM
Generalized Additive Models 9-12
logit(Migration) <-- Age
20 30 40 50 60
T1
-
1
-
0
.5
0
0
.5
1
g
1
logit(Migration) <-- Age
20 30 40 50 60
T1
-
1
-
0
.5
0
0
.5
1
g
1
logit(Migration) <-- Household Income
0 1000 2000 3000 4000
T2
-
0
.5
0
0
.5
1
g
2
logit(Migration) <-- Household Income
0 1000 2000 3000 4000
T2
-
0
.5
0
0
.5
1
g
2
Figure 83: Additive curve estimates for AGE (left) and INCOME (right) in
Sachsen (upper plots with h = 0.75, lower with h = 1.0)
SPM
Generalized Additive Models 9-13
Testing the GAPLM
hypothesis
H
0
: g
1
(t
1
) = t
1
, for all t
1
analogy to GPLM test statistic

LR =
n

i =1
w(T
i
)
[G

i
+ m(T
i
)]
2
VG(
i
)
g
1
(T
i 1
) E

1
(T
i 1
)
2
where m(T
i
) = c + g
1
(T
i 1
) + . . . + g
q
(T
iq
),

i
= G
_
U

i

+ m(T
i
)
_
and bias-adjusted bootstrap estimates
E

1
of the linear estimate for component T
1
SPM
Generalized Additive Models 9-14
Example: Migration data (for Saxony)
as a test statistic we compute

LR and derive its critical values
from the bootstrap test statistics

LR

bootstrap sample size is set to n


boot
= 499 replications
results for AGE: linearity is always rejected at the 1% level
results for INCOME: linearity is rejected at the 2% level for
h = 0.75 and at 1% level for h = 1.0
SPM
Generalized Additive Models 9-15
Example: Unemployed after apprenticeship?
subsample of n = 462 from the rst 9 waves of the GSOEP, including all
individuals who have completed an apprenticeship in the years between
1985 and 1992
Y unemployed after apprenticeship (1 if yes)
U
1
female (1 if yes)
U
2
years of school education
U
3
rm size (1 if large rm)
T
1
age of the person
T
2
earnings as an apprentice (in DM)
T
3
city size (in 100, 000 habitants)
T
4
percentage of people apprenticed in a certain occupation, divided by the
percentage of people employed in this occupation in the entire economy,
and
T
5
unemployment rate in the particular country the apprentice is living in
SPM
Generalized Additive Models 9-16
GLM (logit) GAPLM
Coecients S.E. Coecients
FEMALE -0.3651 0.3894 -0.3962
AGE 0.0311 0.1144
SCHOOL 0.0063 0.1744 0.0452
EARNINGS -0.0009 0.0010
CITY SIZE -5.e-07 4.e-07
FIRM SIZE -0.0120 0.4686 -0.1683
DEGREE -0.0017 0.0021
URATE 0.2383 0.0656
constant -3.9849 2.2517 -2.8949
Table 19: Logit and GAPLM coecients for unemployment data
SPM
Generalized Additive Models 9-17
Density of Age
20 25 30 35
T1
0
0
.0
5
0
.1
0
1
5
0
.2
0
0
.2
5
D
e
n
sity
Density of City Size
0 5 10
T3
0
5
1
0
1
e
-0
7
*
D
e
n
sity
Density of Unemployment Rate
6 9 12 15
T5
0
0
.0
5
0
.1
0
0
.1
5
D
e
n
sity
Density of Earnings
500 1000 1500 2000
T2
0
0
.0
0
0
5
0
.0
0
1
0
.0
0
1
5
0
.0
0
2
D
e
n
sity
Density of Degree
0 100 200 300 400
T4
0
.0
0
1
0
.0
0
2
0
.0
0
3
0
.0
0
4
D
e
n
sity
logit(Unemployment) <-- Age
20 25 30 35
T1
-8
-6
-4
-2
0
g
1
logit(Unemployment) <-- City Size
0 5 10
T3
-2
-1
0
1
g
3
logit(Unemployment) <-- Unemployment Rate
6 9 12 15
T5
-3
0
3
6
g
5
logit(Unemployment) <-- Earnings
500 1000 1500 2000
T2
-1
2
-9
-6
g
2
logit(Unemployment) <-- Degree
0 100 200 300 400
T4
-6
-4
-2
0
2
g
4
Figure 84: Density plots (left) and additive component estimates (right)
for some of the explanatory variables
SPM
Generalized Additive Models 9-18
test results show that the linearity hypothesis cannot be
rejected for all variables and signicant levels from 1% to 20%
parametric logit coecients for all variables (except the
constant and URATE) are already insignicant
seems that the data can be explained by neither the
parametric logit model nor the semiparametric GAPLM
SPM
Generalized Additive Models 9-19
Summary: Generalized Additive
Models
The nonparametric additive components in all these
extensions of simple additive models can be estimated with
the rate that is typical for one dimensional smoothing.
An additive partial linear model (APLM) is of the form
E(Y[X, T) = X

+ c +
d

=1
g

(T

) .
Here, and c can be estimated with the parametric rate

n.
While for the marginal integration estimator in the suggested
procedure it is necessary to undersmooth, it is still not clear
for the backtting what to do when d > 1.
SPM
Generalized Additive Models 9-20
A generalized additive model (GAM) has the form
E(Y[T) = Gc +
d

=1
g

(T

)
with a (known) link function G. To estimate this using the
backtting we combine the local scoring and the Gauss-Seidel
algorithm. Theory is lacking here. Using the marginal
integration we get a closed formula for the estimator for which
asymptotic theory can be also derived.
The generalized additive partial linear model (GAPLM) is of
the form
E(Y[X, T) = G
_
X

+ c +
d

=1
g

(T

)
_
with a (known) link function G. In the parametric part and
SPM
Generalized Additive Models 9-21
c can be estimated again with the

n-rate. For the
backtting we combine estimation in APLM with the local
scoring algorithm. But again the the case d > 1 is not clear
and no theory has been provided. For the marginal integration
approach we combine the quasi-likelihood procedure with the
marginal integration afterwards.
In all considered models, we have only developed theory for
the marginal integration so far. Interpretation, advantages
and drawbacks stay the same as mentioned in the context of
additive models (AM).
We can perform test procedures on the additive components
separately. Due to the complex structure of the estimation
procedures we have to apply (wild) bootstrap methods.
SPM