You are on page 1of 13

# Univariate Kernel Density Estimation

Zhi Ouyang
Augest, 2005
1 Questions:
• What are the statistical properties of kernel functions on estimators?
• What inﬂuence does the shape/scaling of the kernel functions have on the estimators?
• How to chose the scaling parameter in practice?
• How can kernel smoothing ideas used in making conﬁdence statements?
• How do dependencies in the data aﬀect the kernel regression estimator?
• How can one best deal with multiple predictor variables?
2 Histogram is a kind of density estimation.
• Binwidth is the smoothing parameter.
• Sensitivity of the placement of bin edges is a problem not shared by other density estimators,
solution is the averaged shifted histogram[10], which is an appealing motivation for kernel
methods.
• Drawback: step functions.
• Multivariate histogram.
• Histograms are not so suﬃcient as other kernel estimators in using the data.
3 Why univariate kernel density estimator?
• Eﬀective way to show structure of the data / do not want to impose a speciﬁc parametric form
of the density.
1
Kernel Smoothing Zhi Ouyang Augest, 2005
4 The estimator
Suppose we have a random sample X
1
, . . . , X
n
taken from a continuous, univariate density f.
ˆ
f(x; h) =
1
nh
n
¸
i=1
K

x −X
i
h

; or
ˆ
f(x; h) =
1
n
n
¸
i=1
K
h
(x −X
i
), where K
h
(u) =
1
h
K

u
h

.
We shall see, the choice of the shape of the kernel function is not a particular important one, but
the choice of the value of the bandwidth is very important.
5 MSE and MISE criteria
5.1 MSE
If we want to estimate f at a particular point x, consider
E
ˆ
f(x; h) = EK
h
(x −X) = (K
h
∗ f)(x);
V
ˆ
f(x; h) = E{n
−1
¸
i
K
h
(x −X
i
)
2
} −(K
h
∗ f)
2
(x)
= n
−1
EK
h
(x −X)
2
−n
−2
E
¸
i=j
K
h
(x −X
i
)K
h
(x −X
j
) −(K
h
∗ f)
2
(x)
= n
−1
(K
2
h
∗ f)(x) +

n(n −1)
n
2
−1

(K
h
∗ f)
2
(x)
= n
−1
{(K
2
h
∗ f)(x) −(K
h
∗ f)
2
(x)}.
Then the mean squared error(MSE) can be written as
MSE{
ˆ
f(x; h)} = E{
ˆ
f(x; h) −f(x)}
2
= V
ˆ
f(x; h) −{E
ˆ
f(x; h) −f(x)}
2
= n
−1
{(K
2
h
∗ f)(x) −(K
h
∗ f)
2
(x)} +{(K
h
∗ f)(x) −f(x)}
2
.
5.2 MISE
However, we are more interested in estimating f on a real line. Consider the integrated squared
error(ISE) and the mean integrated squared error(MISE)
ISE{
ˆ
f(x; h)} =

{
ˆ
f(x; h) −f(x)}
2
dx;
MISE{
ˆ
f(.; h)} = E

{
ˆ
f(x; h) −f(x)}
2
dx =

E{
ˆ
f(x; h) −f(x)}
2
dx
= n
−1

{(K
2
h
∗ f)(x) −(K
h
∗ f)
2
(x)} dx +

{(K
h
∗ f)(x) −f(x)}
2
dx.
2
Kernel Smoothing Zhi Ouyang Augest, 2005
Notice that

(K
2
h
∗ f)(x) dx =

1
h
2
K
2
(
x −y
h
)f(y) dy dx
=

1
h
K
2
(z)f(x −hz) dz dx
=

1
h
K
2
(z)

f(x −hz) dxdz
=
1
h

K
2
(x) dx
Then MISE could be written as
MISE{
ˆ
f(.; h)} =
1
nh

K
2
(x) dx + (1 −
1
n
)

(K
h
∗ f)
2
(x) dx −2

(K
h
∗ f)(x)f(x) dx +

f
2
(x) dx
5.3 MIAE
We could also work with other criterions such as mean integrated absolute error(MIAE)
MIAE{
ˆ
f(.; h)} = E

|
ˆ
f(x; h) −f(x)| dx.
MIAE is always deﬁned whenever
ˆ
f(x; h) is a density, and it is invariant under monotone transfor-
mation, but more complicated.
6 Order and Taylor expansions
6.1 Taylor’s Theorem
Suppose f is a real-valued function deﬁned on R and let x ∈ R. Assume that f has p continuous
derivatives in an interval (x−δ, x+δ) for some δ > 0. Then for any sequence α
n
converging to zero,
f(x +α
n
) =
p
¸
j=0
α
j
n
j!
f
(j)
(x) +o(α
p
n
).
6.2 Example
Suppose we have a random sample X
1
, . . . , X
n
from the N(µ, σ
2
), and we are interested in estimating
e
µ
. Known that the maximum likelihood estimator is e
¯
X
, and
E{e
¯
X
} = e
µ+
σ
2
2n
; E{e
2
¯
X
} = e
2µ+

2
n
; V{e
¯
X
} = e
2µ+
σ
2
n
(e
σ
2
n
−1
).
Then the MSE can be approximated as
MSE(e
¯
X
) = e

(e

2
n
−2e
σ
2
2n
+ 1)
= e

1 +

2
n
+
1
2

2
n

2
+. . .
¸
−2

1 +
σ
2
2n
+
1
2

σ
2
2n

2
+. . .
¸
+ 1

1
n
σ
2
e

.
It is typical for MSE that has a rate of convergence of order n
−1
, and we shall see that rates of
convergence of nonparametric kernel estimators are typically slower than n
−1
.
3
Kernel Smoothing Zhi Ouyang Augest, 2005
7 Asymptotic MSE and MISE approximations
7.1 Assumptions and notations
Assume that
(
¯
i) The density f has second derivative f

which is continuous, square integrable and monotone.
(
¯
ii) The bandwidth h
n
is non-random sequence of positive numbers, such that
lim
n→∞
h = 0, and lim
n→∞
nh = ∞.
(
¯
iii) The kernel K is a bounded probability density function having ﬁnite fourth moment and

K(z) dz = 1,

zK(z) dz = 0, µ
2
(K) :=

z
2
K(z) dz < ∞.
Also, denote R(g) :=

g
2
(z) dz.
7.2 Calculations
Recall that
E
ˆ
f(x; h) = (K
h
∗ f)(x) =

1
h
K(
x −y
h
)f(y) dy =

K(z)f(x −hz)dz.
Expand f(x −hz) about x, we obtain that
f(x −hz) = f(x) −hzf

(x) +
1
2
h
2
z
2
f

(x) +o(h
2
).
which is uniformly in z, hence
E
ˆ
f(x; h) −f(x) =
1
2
h
2
µ
2
(K)f

(x) +o(h
2
).
Similarly,
V
ˆ
f(x; h) = n
−1
{(K
2
h
∗ f)(x) −(K
h
∗ f)
2
(x)}
=
1
nh

K
2
(z)f(x −hz) dz −n
−1

K(z)f(x −hz) dz
=
1
nh

K
2
(z){f(x) +o(1)} dz −n
−1

K(z){f(x) +o(1)} dz
=
1
nh
R(K)f(x) +o{
1
nh
}.
Therefore,
MSE{
ˆ
f(x; h)} =
1
nh
R(K)f(x) +
1
4
h
4
µ
2
2
(K)f
2
(z) +o{
1
nh
+h
4
};
MISE{
ˆ
f(.; h)} = AMISE{
ˆ
f(.; h)} +o{
1
nh
+h
4
};
AMISE{
ˆ
f(.; h)} =
1
nh
R(K) +
1
4
h
4
µ
2
2
(K)R(f

).
4
Kernel Smoothing Zhi Ouyang Augest, 2005
Notice that the tail term o{(nh)
−1
+ h
4
} shows the variance-bias trade-oﬀ, while AMISE could be
minimized at
h
AMISE
=
¸
R(K)

2
2
(K)R(f

)

1/5
, inf
h>0
AMISE{
ˆ
f(x; h)} =
5
4

2
2
(K)R(K)
4
R(f

)}
1/5
n
−4/5
.
Equivalently, as n goes to inﬁnity, we can rewrite
h
MISE

¸
R(K)

2
2
(K)R(f

)

1/5
, inf
h>0
MISE{
ˆ
f(x; h)} ∼
5
4

2
2
(K)R(K)
4
R(f

)}
1/5
n
−4/5
.
Aside from its dependence on the known K and n, the expression shows us the optimal h is
inversely proportional to the curvature of f, i.e.R(f

). The problem is that we do not know the
curvature, but there are ways to estimate it.
Another thing could be seen is that, the best obtainable rate of convergence of the MISE of the
kernel estimator is of order n
−4/5
, which is less eﬃcient than that of MSE(n
−1
).
7.3 Comparison with histogram
Suppose the knots are x
0
, . . . , x
n
, where x
k
= x
0
+kb. Since f is a density function, denote the c.d.f.
F(x) =

x
−∞
f(x)dx. The histogram could be written as
ˆ
f(x; b) =
1
nb
n
¸
i=1
I
(x
k
,x
k
+b]
(X
i
), where x ∈ (x
k
, x
k
+b].
Then
E
ˆ
f(x; b) =
1
b

x
k
+b
x
k
f(x) dx =
F(x
k
+b) −F(x
k
)
b
= f(x
k
) +
b
2
f

(x
k
) +o(b);
Bias{
ˆ
f(x; b)} = f(x
k
) −{f(x
k
) + (x −x
k
)f

(x
k
) +o(x −x
k
)} +
b
2
f

(x
k
) +o(b)
= {
b
2
−(x −x
k
)}f

(x
k
) +o(b);
E
ˆ
f
2
(x; b) =
1
nb
2
{F(x
k
+b) −F(x
k
)} +
n(n −1)
n
2
b
2
{F(x
k
+b) −F(x
k
)}
2
;
V
ˆ
f(x; b) =
1
nb
{f(x
k
) +o(1)} −
1
n
{f(x
k
) +o(1)}
2
.
Vary x in diﬀerent bins, and take integration, we have
MISE{
ˆ
f(.; h} = AMISE{
ˆ
f(.; h)} +o{(nb)
−1
+b
2
};
AMISE{
ˆ
f(.; h} =
1
nb
+
b
2
12
R(f

).
Therefore, MISE is asymptotically minimized at
b
MISE
∼ {6/R(f

)}
1/3
n
−1/3
, inf
b>0
MISE{
ˆ
f(x; h)} ∼
1
4
{36R(f

)}
1/3
n
−2/3
.
In other words, the MISE of histogram is asymptotically inferior to the kernel estimator, since its
convergence rate is O(n
−2/3
) compared to the kernel estimator’s O(n
−4/5
) rate. More reference see
Scott [9].
5
Kernel Smoothing Zhi Ouyang Augest, 2005
8 Exact MISE calculations
Recall that φ
σ
(x −µ) is the density of the N(µ, σ
2
) distribution, we know that

φ
σ
(x −µ)φ

σ
(x −µ

) dx = φ

σ
2

2
(µ −µ

).
Also recall that
MISE{
ˆ
f(.; h)} =
1
nh

K
2
(x) dx + (1 −
1
n
)

(K
h
∗ f)
2
(x) dx −2

(K
h
∗ f)(x)f(x) dx +

f
2
(x) dx.
8.1 MISE for a single normal distribution
Take K to be the N(0, 1) density and f to be the N(0, σ
2
) density, then
K
h
(x) = φ
h
(x), and f(x) = φ
σ
(x).
It is very easy to show that

K
2
(x) dx = φ

2
(0) =
1
2

π
;

(K
h
∗ f)
2
(x) dx =

φ
2

h
2

2
(x) dx = φ

2(h
2

2
)
(0) =
1
2

π(h
2

2
)
;

(K
h
∗ f)(x)f(x) dx =

φ

h
2

2
(x)φ
σ
(x) dx = φ

h
2
+2σ
2
(0) =
1

2π(h
2
+ 2σ
2
)
;

f
2
(x) dx = φ

(0) =
1
2

πσ
.
Therefore
MISE{
ˆ
f(.; h)} =
1
2

π

1
nh
+
1 −n
−1

h
2

2

2
3/2

h
2
+ 2σ
2
+
1
σ
¸
.
8.2 MISE for normal mixtures
Continue using K(x) = φ
1
(x). Suppose the density can be written as mixture of normal distributions,
f(x) =
k
¸
l=1
w
l
φ
σ
l
(x −µ
l
),
where k ∈ Z
+
, w
1
, . . . , w
k
are positive numbers that sum to one, an for each l, µ
l
∈ R and σ
2
l
> 0.
Similarly, (almost trivial to verify)
MISE{
ˆ
f(.; h)} =
1
2

πnh
+w
T
((1 −n
−1
)Ω
2
−2Ω
1
+ Ω
0
)w,
where w = (w
1
, . . . , w
k
)
T
, and Ω
a
[l; l

] = φ
q
ah
2

2
l

2
l

l
−µ
l
).
If we plot the exact and the asymptotic MISE/IV(integrated variance)/ISB(integrated squared
bias) according to diﬀerent bandwidth, we shall see that IV/AIV decreases fairly ”uniformly” as log h
increases, but ISB/AISB increases very ”non-uniformly”. This is because the bias approximation is
based on the assumption that h →0. Overall, for densities close to normality, the bias approximation
tends to be quite reasonable, while for densities with more features such as multiple modes, this
approximation becomes worse, see Marron and Wand [7].
6
Kernel Smoothing Zhi Ouyang Augest, 2005
8.3 Using characteristic functions to simplify MISE
For real-valued function g, denote its Fourier transformation as ϕ
g
(t) =

e
itx
g(x) dx, then
ϕ
f∗g
(t) = ϕ
f
(t)ϕ
g
(t).
Also, recall the well-known Paseval’s identity

f(x)g(x) dx =
1

ϕ
f
(t)ϕ
g
(t) dt.
From those properties, we could easily rewrite the MISE as
MISE{
ˆ
f(.; h)} =
1
2πnh

κ
2
(t) dt +
1

(1 −
1
n

2
(ht) −2κ(ht) + 1

f
(t)|
2
dt,
where κ(t) = ϕ
K
(t).
8.3.1 sinc kernel
The sinc kernel and its characteristic function is given below
K(x) =
sin x
πx
, κ(t) =

e
itx
sin x
πx
dx = 1
{|t|≤1}
.
Note that |ϕ
f
(t)
2
| = ϕ
f
(t)ϕ
f
(−t) is symmetric about 0, hence
MISE{
ˆ
f(., h)} =
1
πnh

1 +n
−1
π

1/h
0

f
(t)|
2
dt +

f
2
(x) dx.
Davis [3] showed that the MISE-optimal bandwidth satisﬁes

ϕ
f

1
h
MISE

2
=
1
n + 1
,
provided |ϕ
f
(t)| > 0 for all t. We might think of ϕ
f
(t) is a constant near 0, then set
1
πnh
=
n + 1

1
h

f
(
1
h
)|
2
.
If f is the normal density, it can be shown that for sinc kernel,
inf
h>0
MISE{
ˆ
f(., h)} = O{(log n)
1/2
n
−1
}.
which is faster than any rate of order n
−α
, 0 < α < 1, but the MISE is not O(n
−1
) since R(K)
is inﬁnite. This is an example of a higher-order kernel with ”inﬁnite” order, i.e. µ
j
(K) = 0 for all
j ∈ N.
8.3.2 Laplace kernel, with exponential density
The Laplace kernel and its characteristic function is given below
K(x) =
1
2
e
−|x|
, κ(t) =

e
itx
1
2
e
−|x|
dx =
1
1 +t
2
.
Using the exponential density
f(x) = e
−x
1
{x>0}
, ϕ
f
(t) =
1
1 −it
.
After standard calculations, we got
MISE{
ˆ
f(.; h)} =
1
4nh
+
2nh
2
+ (n −1)h −2
4n(1 +h)
2
.
Take the derivative on h, and it is easy to ﬁnd h
MISE
=
1

n
, which attains minimal MISE =
1
2+2

n
.
7
Kernel Smoothing Zhi Ouyang Augest, 2005
9 Canonical kernels and optimal kernel theory
Now we investigate how the shape of the kernel could inﬂuence the estimator. In order to obtain
admissible estimators, Cline [2] showed that the kernel should be symmetric and unimodal.
Recall the two component in AMISE,
AMISE{
ˆ
f(.; h)} =
1
nh
R(K) +
1
4
h
4
µ
2
2
(K)R(f

).
If we want to separate h and K, we need R(K) = µ
2
2
(K). Consider the scaling of K of the form
K
δ
(.) = K(./δ)/δ.
Plug this into above equation, we need
δ
0
= {R(K)/µ
2
2
(K)}
1/5
.
Then
R(K
δ
) = δ
−1
R(K), µ
2
2
(K
δ
) = δ
4
µ
2
2
(K).
We could rewrite AMISE as
AMISE{
ˆ
f(.; h)} = C(K)

1
nh
+
1
4
h
4
R(f

)

, where C(K) = {R(K)
4
µ
2
2
(K)}
1/5
.
Notice that C(K) is invariant to scaling of K. We call K
c
= K
δ
0
the canonical kernel for the class
{K
δ
: δ > 0}. It is the unique member of this class that permits the ”decoupling” of K and h, see
Marron and Nolan [6].
For example, let K = φ, the standard normal, then δ
0
= (4π)
−1/10
, so
φ
c
(x) = φ
(4π)
−1/10(x), C(φ) = (4π)
−2/5
.
Canonical kernels are very useful for pictorial comparison of density estimates based on diﬀerent
shaped kernels, since they are deﬁned in a way that a particular single choice of bandwidth gives
roughly the same amount of smoothing.
We changed the problem to choose K to minimize C(K
δ
0
). Since C(K) is invariant, the optimal
K is the one that minimises C(K) subject to

K(x) dx = 1,

xK(x) dx = 0,

x
2
K(x) dx = a
2
< ∞, K(x) ≥ 0 for all x.
Hodges and Lehmann [4] showed that the solution can be written as
K
a
(x) =
3
4

1 −
x
2
5a
2

5
1/2
a

1
{|x|<5
1/2
a}
.
A special case is that a = 1/

5 (Epanechnikov kernel ), and
K

(x) =
3
4
(1 −x
2
)1
{|x|<1}
.
The eﬃciency of the kernel K relative to K

is deﬁned as {C(K∗)/C(K)}
5/4
. The family
K(x; p) = {2
2p+1
B(p + 1, p + 1)}
−1
(1 −x
2
)
p
1
{|x|<1}
,
where B(., .) is the beta function taken in [−1, 1].
Table 1 shows us the eﬃciency doesn’t have much improvement in diﬀerent shape of the kernels.
Uniform kernels are not very popular in practice since it corresponds to piecewise constant. Even the
Epanechnikov kernel is not so attractive because the estimator have a discontinuous ﬁrst derivative.
8
Kernel Smoothing Zhi Ouyang Augest, 2005
Table 1: Eﬃciencies of several kernels compared to the optimal kernel.
Kernel Form {C(K

)/C(K)}
5/4
Epanechnikov K(x; 1) 1.000
Biweight K(x; 2) 0.994
Triweight K(x; 3) 0.987
Normal K(x; ∞) 0.951
Triangular 0.986
Uniform K(x; 0) 0.930
10 Higher-order kernels
10.1 Why higher-order kernels?
We know that the best obtainable rate of convergence of the kernel estimator is of order n
−4/5
. If
we loose the condition that K must be a density, the convergence rate could be faster. For example,
recall the asymptotic bias is given by
E
ˆ
f(x; h) −f(x) =
1
2
h
2
µ
2
(K)f

(x) +o(h
2
).
If we set µ
2
(K) = 0, the MSE and MISE will have optimal convergence rate of order n
−8/9
.
10.2 What is higher-order kernels?
We insist that K to be symmetric, and say K is a k−th order kernel if
µ
0
(K) = 1, µ
j
(K) = 0, for j = 1, . . . , k −1, µ
k
(K) = 0.
10.3 How to get higher-order kernels?
One way to generate higher-order kernels is deductively from the lower-order kernels,
K
[k+2]
(x) =
3
2
K
[k]
(x) +
1
2
xK

[k]
(x).
For example, set K
[2]
(x) = φ(x), then K
[4]
=
1
2
(3 −x
2
)φ(x).
Another way is developed when f is a normal mixture density for a certain class of higher-order
kernels, see Marron and Wand [7].
G
[k]
=
k/2−1
¸
l=0
(−1)
l
2
l
l!
φ
(2l)
(x), l = 0, 2, 4, . . . .
10.4 Misc topics about higher-order kernels.
The convergence rate can be made arbitrarily close to the parametric n
−1
as the order increases,
which means it will eventually dominate second-order kernel estimators for large n. However, it does
need a larger sample size (K
[4]
would require several thousand in order to reduce MISE compared
to normal kernel). Another price that need to be paid for higher-order kernels is the negative
contributions of the kernel may make the the estimated density not a density itself.
9
Kernel Smoothing Zhi Ouyang Augest, 2005
An extreme case of higher-order kernels is the ”inﬁnite”-order kernels, such as sinc kernel K(x) =
sin x/(πx). Sinc kernel estimator suﬀers from the same drawback as other higher-order kernel esti-
mators and the good asymptotic performance is not guaranteed to carry over to ﬁnite sample size in
practice.
11 Measuring how diﬃcult a density to estimate
Recall that, for K a symmetric probability kernel density function,
inf
h>0
MISE{
ˆ
f(.; h)} ∼
5
4
C(K)R(f

)
1/5
n
−4/5
,
so the magnitude of R(f

) tells us how well f can be estimated even when h is chosen optimally.
First, we cannot set R(f

), where f cannot be constant over the real line, and uniform density
have diﬃculty in estimating its boundaries. Second, R(f

) is not scale invariant. To appreciate this,
suppose X has the density f
X
, and Y = X/a has density f
Y
(x) = af
X
(x) for some positive a. Then
R(f

Y
) = a
5
R(f

X
), but
D(f) = {σ(f)
5
R(f

X
)}
1/4
is scale invariant, where σ(f) is the population standard deviation of f. The inclusion of
1
4
power
allows a equivalence sample size interpretation as was done for the comparison of kernel shapes.
One result for the D(f) is that it attains its minimal with the shape Beta(4, 4),
f

(x) =
35
32
(1 −x
2
)
3
1
{|x|<1}
.
More general result is that densities close to normality appear to be easier for the kernel estimator to
estimate. The degree of estimation diﬃculty increases with skewness, kurtosis, and multimodality.
12 Modiﬁcation of kernel density estimator
12.1 Local kernel density estimators
One modiﬁcation is to change the bandwidth h along the x−axis,
ˆ
f
L
(x; h(x)) = {h(x)}
−1
n
¸
i=1
K{(x −X
i
)/h(x)}.
The optimal h for asymptotic MSE at x is
h
AMSE
(x) =

R(K)f(x)
µ
2
2
(K)f
2
(x)n

1/5
, provided f

(x) = 0.
When f

(x) = 0, additional terms should be taken into account, see Schucany [8]. If this h
A
MSE
is chosen for every x, then the corresponding value of AMISE{
ˆ
f(.; h(.))} can be shown to be
5
4

2
(K)
2
R(K)
4
}
1/5
R((f
2
f

)
1/5
)n
−4/5
Although n
−4/5
showed no improvement, but
R((f
2
f

)
1/5
) ≤ R(f

)
1/5
10
Kernel Smoothing Zhi Ouyang Augest, 2005
holds for all f. The sample size eﬃciency depend on the magnitude of the ratio
{R((f
2
f

)
1/5
)/R(f

)
1/5
}
5/4
.
Their are other ways such as the nearest neighbor density estimator, see Loftsgaarden and Que-
senberry [5]. It uses distances from x to the data point that is the kth nearest to x (for some suitable
k) in a pilot estimation step that is essentially equivalent to h(x) ∝ 1/f(x). However, 1/f(x) is
usually not satisfactory surrogated for the optimal {f(x)/f

(x)}
1/5
.
12.2 Variable kernel density estimators
In stead of a single h or a function h(x), the idea here is to use n values α(X
i
), i = 1, . . . , n.
Therefore, the kernel centered on X
i
has associated with its own scale parameter α(X
i
), allowing
diﬀerent degrees of smoothing depending on the locations. The estimator is in the form
ˆ
f
V
(x; α) =
1
n
n
¸
i=1
1
α(X
i
)
K

x −X
i
α(X
i
)

.
The intuition suggests that each α(X
i
) should depend on the true density in roughly an inverse
way. An theoretical result by Abramson [1] shows that taking α(X
i
) = hf
−1/2
(X
i
) is a particular
good choice such that one can achieve a bias of order h
4
2
. Pilot estimation to obtain
α(X
i
) is necessary for a practical implementation of
ˆ
f
V
(.; α), which is the expense compared with
using a fourth-order kernel, while no negativity produced. The optimal MSE can be improved from
order n
−4/5
to n
−8/9
.
12.3 Transformation kernel density estimator
If the random sample X
i
has a density f that is diﬃcult to estimate, then apply an transformation
such that Y
i
has a density g which is easier to estimate. One can ”backtransform” the estimate of g
to obtain the estimate of f.
Suppose that Y
i
= t(X
i
), where t is an increasing diﬀerentiable function deﬁned on the support
of f. Therefore f(x) = g(t(x))t

(x), and the density could be estimated by
ˆ
f
T
(x; h, t) =
1
n
n
¸
i=1
K
h
{t(x) −t(X
i
)}t

(x).
Apply the mean value theorem, we can see this estimator lies somewhere in between of
ˆ
f
L
and
ˆ
f
V
,
ˆ
f
T
(x; h, t) =
1
n
n
¸
i=1
t

(x)
h
K

t

i
)(x −X
i
)
h

.
where ξ lies between x and X
i
.
The best choice of the transformation t depend quite heavily on the shape of f. If f is a skewed
unimodal density, then t is suggested to be convex function on the support of f in order to reduce
the skewness of f in some sense. If f is close to being symmetric, but has a high amount of kurtosis,
then t should be concave to the left and convex to the right of the center of symmetry of f.
One approach is to apply the following family called shifted power family on heavily skewed data,
t(x; λ
1
, λ
2
) =

(x +λ
1
)
λ
2
sign(λ
2
), λ
2
= 0
ln(x +λ
1
), λ
2
= 0
.
where λ
1
> −min(X).
Another approach is to estimate t nonparametrically. If F and G are c.d.f. of p.d.f. f and g,
then Y = G
−1
(F(X)) has density g. One could choose G easy to estimate, and take t = G
−1
(
ˆ
F(X)).
11
Kernel Smoothing Zhi Ouyang Augest, 2005
13 Density estimation at boundaries
It becomes diﬃcult to estimate the density at the boundary. Suppose f is a density such that
f(x) = 0 for x < 0 and f(x) > 0 for x ≥ 0, and f

is continuous away from x = 0. Also, let K be a
kernel with support conﬁned in [−1, 1].
For x > 0,
E
ˆ
f(x; h) =

x/h
−1
K(z)f(x −hz) dz.
Be aware that we still have a large bias at the boundary, but the performance is greatly improved.
14 Density derivative estimation
A natural estimator of the r−th derivative f
(r)
(x) is
ˆ
f
(r)
(x; h) =
1
nh
r+1
n
¸
i=1
K
(r)

x −X
i
h

,
suﬃciently diﬀerentiability of K permitting. MSE property can be shown as
MSE{
ˆ
f
(r)
(x; h)} =
1
nh
2r+1)
R(K
(r)
)f(x) +
1
4
h
4
µ
2
2
(K)f
(r+2)
(x) +o{
1
nh
(2r+1)
+h
4
}.
It follows that the MSE-optimal bandwidth for estimating f
(r)
(x) is of order n
−1/(2r+5)
. There-
fore, the estimation of f

(x) requires a bandwidth of order n
−1/7
compared to the optimal n
−1/5
for
estimating f itself. It reveals the increasing diﬃculty in problems of estimating higher derivatives.
12
Kernel Smoothing Zhi Ouyang Augest, 2005
References
[1] I.S. Abramson. On bandwidth variation in kernel estimates - a square root law. Annal of
Statitics, 9:168–76, 1982.
[2] D.B.H. Cline. Admissible kernel estimators of a multivariate density. Annal of Statistics,
16:1421–7, 1988.
[3] K.B. Davis. Mean square error properties of density estimates. Annal of Statistics, 75:1025–30,
1975.
[4] J.L. Hodges and E.L. Lehmann. The eﬃciency of some nonparametric competitors to the t-test.
Annal of Mathematical Statistics, 13:324–35, 1956.
[5] D.O. Loftsgaarden and C.P. Quesenberry. A nonparametric density estimate of a multivariate
density function. Annal of Mathematical Statistics, 36:1049–51, 1965.
[6] J.S. Marron and D. Nolan. Canonical kernel for density estimation. Statist. Probab. Lett.,
7:195–9, 1989.
[7] J.S. Marron and M.P. Wand. Exact mean integrated squared error. Annal of Statistics, 20:712–
36, 1992.
[8] W.R. Schucany. Locally optimal window width for kernel density estimation with large samples.
Statist. Probab. Lett., 7:401–5, 1989.
[9] D.W. Scott. On optimal and data-based histograms. Biometrika, 66:605–10, 1979.
[10] D.W. Scott. Average shifted histograms: eﬀective nonparametric density estimators in several
dimensions. Annal of Statistics, 13:1024–40, 1985.
13

h) − f (x)}2 2 = n−1 {(Kh ∗ f )(x) − (Kh ∗ f )2 (x)} + {(Kh ∗ f )(x) − f (x)}2 . Consider the integrated squared error (ISE) and the mean integrated squared error (MISE) ˆ ISE{f (x. the choice of the shape of the kernel function is not a particular important one. 5.2 MISE However. h) − f (x)}2 dx = ˆ E{f (x. . h) = EKh (x − X) = (Kh ∗ f )(x). h) = n Kh (x − Xi ). h)} = E = n−1 ˆ {f (x. h)} = E{f (x.Kernel Smoothing Zhi Ouyang Augest. 1 ˆ f (x. where Kh (u) = i=1 1 u K . . . consider ˆ Ef (x. h h We shall see. h) − f (x)}2 dx {(Kh ∗ f )(x) − f (x)}2 dx. Xn taken from a continuous. h) = nh K i=1 x − Xi h . h) − f (x)}2 ˆ ˆ = Vf (x. h) − f (x)}2 dx. we are more interested in estimating f on a real line. 2 {(Kh ∗ f )(x) − (Kh ∗ f )2 (x)} dx + 2 . h) = E{n−1 Kh (x − Xi )2 } − (Kh ∗ f )2 (x) i = n −1 EKh (x − X)2 − n−2 E i=j Kh (x − Xi )Kh (x − Xj ) − (Kh ∗ f )2 (x) 2 = n−1 (Kh ∗ f )(x) + n(n − 1) − 1 (Kh ∗ f )2 (x) n2 2 = n−1 {(Kh ∗ f )(x) − (Kh ∗ f )2 (x)}. ˆ Vf (x. univariate density f . 2005 4 The estimator n n Suppose we have a random sample X1 . h)} = ˆ M ISE{f (. ˆ {f (x.1 MSE and MISE criteria MSE If we want to estimate f at a particular point x.. Then the mean squared error (MSE) can be written as ˆ ˆ M SE{f (x. but the choice of the value of the bandwidth is very important. or 1 ˆ f (x. . h) − {Ef (x. 5 5.

.3 MIAE We could also work with other criterions such as mean integrated absolute error (MIAE) ˆ M IAE{f (.. h)} = K 2 (x) dx + (1 − ) nh n (Kh ∗ f )2 (x) dx − 2 (Kh ∗ f )(x)f (x) dx + f 2 (x) dx 5. h)} = E ˆ |f (x. . p f (x + αn ) = j=0 j αn (j) p f (x) + o(αn ). +1 1 2 2µ σ e . Known that the maximum likelihood estimator is eX .. and E{eX } = eµ+ 2n . . V{eX } = e2µ+ n (e n −1 ). h) − f (x)| dx. . and we shall see that rates of convergence of nonparametric kernel estimators are typically slower than n−1 . Then for any sequence αn converging to zero. . n It is typical for MSE that has a rate of convergence of order n−1 . ¯ σ2 σ2 Then the MSE can be approximated as M SE(eX ) = e2µ (e = e2µ ∼ ¯ − 2e 2n + 1) 2σ 2 1 + n 2 2σ 2 n 2 1+ + . 3 . but more complicated. 2005 Notice that 2 (Kh ∗ f )(x) dx = = = = 1 h 1 2 x−y K ( )f (y) dy dx h2 h 1 2 K (z)f (x − hz) dz dx h 1 2 K (z) h f (x − hz) dx dz K 2 (x) dx Then MISE could be written as 1 1 ˆ M ISE{f (.Kernel Smoothing Zhi Ouyang Augest. ˆ MIAE is always deﬁned whenever f (x.2 Example Suppose we have a random sample X1 . Assume that f has p continuous derivatives in an interval (x − δ..1 Order and Taylor expansions Taylor’s Theorem Suppose f is a real-valued function deﬁned on R and let x ∈ R. −2 1+ σ2 1 + 2n 2 σ2 2n 2 + . 6 6. ¯ σ2 E{e2X } = e2µ+ 2σ 2 n σ2 ¯ 2σ 2 n . σ 2 ).. Xn from the N (µ. j! 6. h) is a density. x + δ) for some δ > 0. and it is invariant under monotone transformation.. and we are interested in estimating ¯ eµ .

we obtain that 1 f (x − hz) = f (x) − hzf (x) + h2 z 2 f (x) + o(h2 ).1 Asymptotic MSE and MISE approximations Assumptions and notations Assume that (i) The density f has second derivative f which is continuous. h)} = 4 . i. 2 nh 4 ˆ M SE{f (x. denote R(g) := g 2 (z) dz. K(z) dz = 1. 2005 7 7.. Expand f (x − hz) about x.Kernel Smoothing Zhi Ouyang Augest. 7. h) = n−1 {(Kh ∗ f )(x) − (Kh ∗ f )2 (x)} 1 K 2 (z)f (x − hz) dz − n−1 K(z)f (x − hz) dz = nh 1 = K 2 (z){f (x) + o(1)} dz − n−1 K(z){f (x) + o(1)} dz nh 1 1 R(K)f (x) + o{ }. Also. 2 nh 4 nh 1 ˆ ˆ M ISE{f (.. 1 1 1 R(K)f (x) + h4 µ2 (K)f 2 (z) + o{ + h4 }. h)} = AM ISE{f (. 2 Similarly. hence 1 ˆ Ef (x. h)} = R(K) + h4 µ2 (K)R(f ). such that ¯ n→∞ lim h = 0. h) − f (x) = h2 µ2 (K)f (x) + o(h2 ). 2 which is uniformly in z. µ2 (K) := z 2 K(z) dz < ∞. nh 1 1 ˆ AM ISE{f (. (iii) The kernel K is a bounded probability density function having ﬁnite fourth moment and ¯ symmetric about the origin.. and n→∞ lim nh = ∞. = nh nh Therefore.e. square integrable and monotone.2 Calculations 1 x−y K( )f (y) dy = h h Recall that ˆ Ef (x. ¯ (ii) The bandwidth hn is non-random sequence of positive numbers. h) = (Kh ∗ f )(x) = K(z)f (x − hz)dz. 2 ˆ Vf (x. zK(z) dz = 0. h)} + o{ + h4 }.

h)} ∼ {µ2 (K)R(K)4 R(f )}1/5 n−4/5 . 2005 Notice that the tail term o{(nh)−1 + h4 } shows the variance-bias trade-oﬀ. Vary x in diﬀerent bins. x F (x) = −∞ f (x)dx. 1 b2 ˆ AM ISE{f (. 4 2 Aside from its dependence on the known K and n. xn .. b) = ˆ Bias{f (x. h} = AM ISE{f (.3 Comparison with histogram Suppose the knots are x0 .R(f ). b) = ˆ Vf (x. i=1 where x ∈ (xk . Since f is a density function. More reference see Scott [9]. 2 1 n(n − 1) {F (xk + b) − F (xk )} + {F (xk + b) − F (xk )}2 . we can rewrite hMISE ∼ R(K) nµ2 (K)R(f ) 2 1/5 . b 2 xk b f (xk ) − {f (xk ) + (x − xk )f (xk ) + o(x − xk )} + f (xk ) + o(b) 2 b { − (x − xk )}f (xk ) + o(b).d. 4 2 Equivalently. as n goes to inﬁnity. i.Kernel Smoothing Zhi Ouyang Augest. which is less eﬃcient than that of MSE(n−1 ). denote the c. The problem is that we do not know the curvature. h)} + o{(nb)−1 + b2 }. the best obtainable rate of convergence of the MISE of the kernel estimator is of order n−4/5 . b>0 4 In other words.xk +b] (Xi ). 7. the MISE of histogram is asymptotically inferior to the kernel estimator.. h} = + R(f ). .f. M ISE is asymptotically minimized at bMISE ∼ {6/R(f )}1/3 n−1/3 . b) = nb Then ˆ Ef (x.e. while AMISE could be minimized at hAMISE R(K) = 2 (K)R(f ) nµ2 1/5 .. b)} = = ˆ Ef 2 (x. b) = F (xk + b) − F (xk ) b = f (xk ) + f (xk ) + o(b). . . h>0 5 ˆ inf AMISE{f (x. the expression shows us the optimal h is inversely proportional to the curvature of f . 1 ˆ inf MISE{f (x. Another thing could be seen is that. . since its convergence rate is O(n−2/3 ) compared to the kernel estimator’s O(n−4/5 ) rate. h>0 5 ˆ inf MISE{f (x. nb n 1 b f (x) dx = xk +b n I(xk . 2 nb n2 b2 1 1 {f (xk ) + o(1)} − {f (xk ) + o(1)}2 . we have ˆ ˆ M ISE{f (. and take integration. The histogram could be written as 1 ˆ f (x. where xk = x0 + kb. nb 12 Therefore. h)} ∼ {36R(f )}1/3 n−2/3 . 5 . h)} = {µ2 (K)R(K)4 R(f )}1/5 n−4/5 . but there are ways to estimate it. xk + b].

2 πσ Therefore 1 ˆ M ISE{f (. σ 2 ) distribution. Similarly. 2 πnh where w = (w1 . 1) density and f to be the N (0. see Marron and Wand [7]. . then Kh (x) = φh (x). If we plot the exact and the asymptotic MISE/IV(integrated variance)/ISB(integrated squared bias) according to diﬀerent bandwidth. It is very easy to show that 1 K 2 (x) dx = φ√2 (0) = √ . f (x) = l=1 wl φσl (x − µl ). . we know that φσ (x − µ)φσ (x − µ ) dx = φ√σ2 +σ 2 (µ − µ ). wk are positive numbers that sum to one. 2 π (Kh ∗ f )2 (x) dx = (Kh ∗ f )(x)f (x) dx = √ φ2 h2 +σ2 (x) dx = φ√2(h2 +σ2 ) (0) = 1 2 π(h2 + σ 2 ) 1 .2 MISE for normal mixtures k Continue using K(x) = φ1 (x). . Also recall that 1 ˆ M ISE{f (.. 6 . . this approximation becomes worse. . σ 2 ) density. where k ∈ w1 . h)} = √ 2 π 1 1 − n−1 23/2 1 +√ −√ + 2 + σ2 2 + 2σ 2 nh σ h h . This is because the bias approximation is based on the assumption that h → 0. and f (x) = φσ (x). for densities close to normality. while for densities with more features such as multiple modes. Overall. 2005 8 Exact MISE calculations Recall that φσ (x − µ) is the density of the N (µ.Kernel Smoothing Zhi Ouyang Augest. 8. h)} = √ + wT ((1 − n−1 )Ω2 − 2Ω1 + Ω0 )w.1 MISE for a single normal distribution Take K to be the N (0. l ] = φq 2 2 ah2 +σl +σl Z+ . . the bias approximation tends to be quite reasonable. Suppose the density can be written as mixture of normal distributions. and Ωa [l. . but ISB/AISB increases very ”non-uniformly”. . (µl − µl ). .. 8. h)} = nh K 2 (x) dx + (1 − 1 ) n (Kh ∗ f )2 (x) dx − 2 (Kh ∗ f )(x)f (x) dx + f 2 (x) dx. wk )T . an for each l. (almost trivial to verify) 1 ˆ M ISE{f (. φ√h2 +σ2 (x)φσ (x) dx = φ√h2 +2σ2 (0) = 2π(h2 + 2σ 2 ) 1 f 2 (x) dx = φ√2σ (0) = √ .. µl ∈ R and σl2 > 0. we shall see that IV/AIV decreases fairly ”uniformly” as log h increases.

κ(t) = eitx e−|x| dx = .3. This is an example of a higher-order kernel with ”inﬁnite” order. h)} = + . f (x) = e−x 1{x>0} . 4nh 4n(1 + h)2 Take the derivative on h.. hence 1 1 + n−1 1/h − |ϕf (t)|2 dt + f 2 (x) dx. ˆ inf M ISE{f (. 8. κ(t) = eitx dx = 1{|t|≤1} . h)} = κ2 (t) dt + 2πnh 2π n where κ(t) = ϕK (t). then set n+11 1 1 = |ϕf ( )|2 . denote its Fourier transformation as ϕg (t) = Also. hM ISE n+1 provided |ϕf (t)| > 0 for all t.3. h)} = 1 .. µj (K) = 0 for all j ∈ N.3 Using characteristic functions to simplify MISE eitx g(x) dx. we could easily rewrite the MISE as 1 1 1 ˆ (1 − )κ2 (ht) − 2κ(ht) + 1 M ISE{f (. it can be shown that for sinc kernel. and it is easy to ﬁnd hM ISE = 7 1 √ . For real-valued function g. with exponential density The Laplace kernel and its characteristic function is given below 1 1 1 K(x) = e−|x| . πnh nπ h h If f is the normal density. but the MISE is not O(n−1 ) since R(K) is inﬁnite. The sinc kernel and its characteristic function is given below sin x sin x .Kernel Smoothing Zhi Ouyang Augest. h)} = O{(log n)1/2 n−1 }. ϕf = h>0 1 2 which is faster than any rate of order n−α . 2005 8.. 8. recall the well-known Paseval’s identity 1 f (x)g(x) dx = ϕf (t)ϕg (t) dt. 2 2 1 + t2 Using the exponential density 1 . 0 < α < 1. K(x) = πx πx Note that |ϕf (t)2 | = ϕf (t)ϕf (−t) is symmetric about 0. 2π From those properties. πnh π 0 Davis [3] showed that the MISE-optimal bandwidth satisﬁes ˆ M ISE{f (. we got 1 2nh2 + (n − 1)h − 2 ˆ M ISE{f (. then ϕf ∗g (t) = ϕf (t)ϕg (t).e. ϕf (t) = 1 − it After standard calculations. 2+2 n . n which attains minimal M ISE = 1√ . i.1 sinc kernel |ϕf (t)|2 dt. We might think of ϕf (t) is a constant near 0..2 Laplace kernel.

we need δ0 = {R(K)/µ2 (K)}1/5 . We could rewrite AMISE as ˆ AM ISE{f (. the optimal K is the one that minimises C(K) subject to K(x) dx = 1. we need R(K) = µ2 (K).) is the beta function taken in [−1. let K = φ. h)} = 1 1 R(K) + h4 µ2 (K)R(f ).. Table 1 shows us the eﬃciency doesn’t have much improvement in diﬀerent shape of the kernels. √ A special case is that a = 1/ 5 (Epanechnikov kernel ). ˆ AM ISE{f (. K(x) ≥ 0 for all x. 1]. Consider the scaling of K of the form 2 Kδ (. where C(K) = {R(K)4 µ2 (K)}1/5 . 2005 9 Canonical kernels and optimal kernel theory Now we investigate how the shape of the kernel could inﬂuence the estimator. Since C(K) is invariant. We changed the problem to choose K to minimize C(Kδ0 ). 2 nh 4 µ2 (Kδ ) = δ 4 µ2 (K). 8 . 2 Then R(Kδ ) = δ −1 R(K). C(φ) = (4π)−2/5 . . Recall the two component in AMISE. p + 1)}−1 (1 − x2 )p 1{|x|<1} . and 3 K ∗ (x) = (1 − x2 )1{|x|<1} . see Marron and Nolan [6]. 4 The eﬃciency of the kernel K relative to K ∗ is deﬁned as {C(K∗)/C(K)}5/4 . For example. The family K(x. In order to obtain admissible estimators.) = K(. 2 2 Notice that C(K) is invariant to scaling of K. Uniform kernels are not very popular in practice since it corresponds to piecewise constant. x2 K(x) dx = a2 < ∞. then δ0 = (4π)−1/10 . It is the unique member of this class that permits the ”decoupling” of K and h. where B(. Hodges and Lehmann [4] showed that the solution can be written as K a (x) = 3 4 1− x2 5a2 51/2 a 1{|x|<51/2 a} . so φc (x) = φ(4π)−1/10 (x). xK(x) dx = 0. p) = {22p+1 B(p + 1. 2 nh 4 If we want to separate h and K./δ)/δ. h)} = C(K) 1 1 + h4 R(f ) . since they are deﬁned in a way that a particular single choice of bandwidth gives roughly the same amount of smoothing. the standard normal. Even the Epanechnikov kernel is not so attractive because the estimator have a discontinuous ﬁrst derivative. Plug this into above equation... Canonical kernels are very useful for pictorial comparison of density estimates based on diﬀerent shaped kernels. Cline [2] showed that the kernel should be symmetric and unimodal.Kernel Smoothing Zhi Ouyang Augest. We call K c = Kδ0 the canonical kernel for the class {Kδ : δ > 0}.

2005 Table 1: Eﬃciencies of several kernels compared to the optimal kernel. For example. 1) K(x. set K[2] (x) = φ(x). 2 If we set µ2 (K) = 0. the MSE and MISE will have optimal convergence rate of order n−8/9 . 2 Another way is developed when f is a normal mixture density for a certain class of higher-order kernels. 10.930 10 10. k − 1. The convergence rate can be made arbitrarily close to the parametric n−1 as the order increases. 2 2 One way to generate higher-order kernels is deductively from the lower-order kernels.1 Higher-order kernels Why higher-order kernels? We know that the best obtainable rate of convergence of the kernel estimator is of order n−4/5 . Another price that need to be paid for higher-order kernels is the negative contributions of the kernel may make the the estimated density not a density itself. and say K is a k−th order kernel if µ0 (K) = 1. k/2−1 G[k] = l=0 (−1)l (2l) φ (x). for j = 1. 2) K(x. recall the asymptotic bias is given by 1 ˆ Ef (x. 2. µk (K) = 0.2 What is higher-order kernels? We insist that K to be symmetric.4 Misc topics about higher-order kernels. . . ∞) K(x.994 0.987 0. 9 . 3) K(x. µj (K) = 0.3 How to get higher-order kernels? 3 1 K[k+2] (x) = K[k] (x) + xK[k] (x). .986 0.Kernel Smoothing Zhi Ouyang Augest. . . If we loose the condition that K must be a density. . 10. which means it will eventually dominate second-order kernel estimators for large n. 2l l! l = 0. . 10. 0) {C(K ∗ )/C(K)}5/4 1. then K[4] = 1 (3 − x2 )φ(x). 4. the convergence rate could be faster. h) − f (x) = h2 µ2 (K)f (x) + o(h2 ). For example. Kernel Epanechnikov Biweight Triweight Normal Triangular Uniform Form K(x. .951 0.000 0. However. it does need a larger sample size (K[4] would require several thousand in order to reduce MISE compared to normal kernel). see Marron and Wand [7].

If this hA M SE ˆ is chosen for every x. ˆ fL (x. To appreciate this.))} can be shown to be 5 {µ2 (K)2 R(K)4 }1/5 R((f 2 f )1/5 )n−4/5 4 Although n−4/5 showed no improvement. When f (x) = 0. 4 Recall that. 12 12. Second. but R((f 2 f )1/5 ) ≤ R(f )1/5 10 . we cannot set R(f ). Sinc kernel estimator suﬀers from the same drawback as other higher-order kernel estimators and the good asymptotic performance is not guaranteed to carry over to ﬁnite sample size in practice. f ∗ (x) = 35 (1 − x2 )3 1{|x|<1} . 11 Measuring how diﬃcult a density to estimate 5 ˆ inf M ISE{f (. h>0 so the magnitude of R(f ) tells us how well f can be estimated even when h is chosen optimally. for K a symmetric probability kernel density function. h(. The degree of estimation diﬃculty increases with skewness. suppose X has the density fX . see Schucany [8]. Then R(fY ) = a5 R(fX ). additional terms should be taken into account. but D(f ) = {σ(f )5 R(fX )}1/4 is scale invariant. where f cannot be constant over the real line. One result for the D(f ) is that it attains its minimal with the shape Beta(4.. provided f (x) = 0. 32 More general result is that densities close to normality appear to be easier for the kernel estimator to estimate. The inclusion of 1 power 4 allows a equivalence sample size interpretation as was done for the comparison of kernel shapes. 4). then the corresponding value of AMISE{f (.Kernel Smoothing Zhi Ouyang Augest.. and uniform density have diﬃculty in estimating its boundaries. h)} ∼ C(K)R(f )1/5 n−4/5 . and Y = X/a has density fY (x) = afX (x) for some positive a. h(x)) = {h(x)}−1 i=1 K{(x − Xi )/h(x)}. R(f ) is not scale invariant. 2005 An extreme case of higher-order kernels is the ”inﬁnite”-order kernels. where σ(f ) is the population standard deviation of f . such as sinc kernel K(x) = sin x/(πx).1 Modiﬁcation of kernel density estimator Local kernel density estimators n One modiﬁcation is to change the bandwidth h along the x−axis. First. The optimal h for asymptotic MSE at x is hAM SE (x) = R(K)f (x) µ2 (K)f 2 (x)n 2 1/5 . and multimodality. kurtosis.

The intuition suggests that each α(Xi ) should depend on the true density in roughly an inverse way. λ2 = 0 where λ1 > −min(X). One could choose G easy to estimate. If F and G are c. Pilot estimation to obtain ˆ α(Xi ) is necessary for a practical implementation of fV (. 2005 holds for all f . while no negativity produced. However. allowing diﬀerent degrees of smoothing depending on the locations. The sample size eﬃciency depend on the magnitude of the ratio {R((f 2 f )1/5 )/R(f )1/5 }5/4 . If f is close to being symmetric.d. α). λ2 ) = (x + λ1 )λ2 sign(λ2 ). but has a high amount of kurtosis. where ξ lies between x and Xi . λ2 = 0 . The optimal MSE can be improved from order n−4/5 to n−8/9 . then t is suggested to be convex function on the support of f in order to reduce the skewness of f in some sense. ln(x + λ1 ). t) = n n Kh {t(x) − t(Xi )}t (x).d.3 Transformation kernel density estimator If the random sample Xi has a density f that is diﬃcult to estimate. h. ˆ then Y = G−1 (F (X)) has density g. λ1 . Suppose that Yi = t(Xi ). and take t = G−1 (F (X)). t(x. 1/f (x) is usually not satisfactory surrogated for the optimal {f (x)/f (x)}1/5 . . which is the expense compared with using a fourth-order kernel. the idea here is to use n values α(Xi ). we can see this estimator lies somewhere in between of fL and fV . see Loftsgaarden and Quesenberry [5]. . i = 1.2 Variable kernel density estimators In stead of a single h or a function h(x). then apply an transformation such that Yi has a density g which is easier to estimate. Therefore f (x) = g(t(x))t (x). One approach is to apply the following family called shifted power family on heavily skewed data. 11 . n. Their are other ways such as the nearest neighbor density estimator. of p. If f is a skewed unimodal density. 12.f. It uses distances from x to the data point that is the kth nearest to x (for some suitable k) in a pilot estimation step that is essentially equivalent to h(x) ∝ 1/f (x). i=1 ˆ ˆ Apply the mean value theorem. . where t is an increasing diﬀerentiable function deﬁned on the support of f . then t should be concave to the left and convex to the right of the center of symmetry of f . t) = n n i=1 t (x) K h t (ξi )(x − Xi ) h . α) = n n i=1 1 K α(Xi ) x − Xi α(Xi ) . One can ”backtransform” the estimate of g to obtain the estimate of f .f. 12. Therefore. and the density could be estimated by 1 ˆ fT (x. The estimator is in the form 1 ˆ fV (x. An theoretical result by Abramson [1] shows that taking α(Xi ) = hf −1/2 (Xi ) is a particular good choice such that one can achieve a bias of order h4 in stead of h2 .. Another approach is to estimate t nonparametrically. . f and g. The best choice of the transformation t depend quite heavily on the shape of f . the kernel centered on Xi has associated with its own scale parameter α(Xi ). 1 ˆ fT (x. h.Kernel Smoothing Zhi Ouyang Augest.

suﬃciently diﬀerentiability of K permitting. −1 Be aware that we still have a large bias at the boundary. and f is continuous away from x = 0. h) = x/h K(z)f (x − hz) dz. Also. 2 4 nh It follows that the MSE-optimal bandwidth for estimating f (r) (x) is of order n−1/(2r+5) .Kernel Smoothing Zhi Ouyang Augest. 1]. 2005 13 Density estimation at boundaries It becomes diﬃcult to estimate the density at the boundary. the estimation of f (x) requires a bandwidth of order n−1/7 compared to the optimal n−1/5 for estimating f itself. h) = 1 nhr+1 K (r) i=1 x − Xi h . but the performance is greatly improved. 14 Density derivative estimation n A natural estimator of the r−th derivative f (r) (x) is ˆ f (r) (x. Suppose f is a density such that f (x) = 0 for x < 0 and f (x) > 0 for x ≥ 0. let K be a kernel with support conﬁned in [−1. MSE property can be shown as ˆ M SE{f (r) (x. h)} = 1 nh2r+1) 1 1 R(K (r) )f (x) + h4 µ2 (K)f (r+2) (x) + o{ (2r+1) + h4 }. ˆ E f (x. Therefore. It reveals the increasing diﬃculty in problems of estimating higher derivatives. For x > 0. 12 .

W. Quesenberry. Locally optimal window width for kernel density estimation with large samples. Schucany. 16:1421–7. On optimal and data-based histograms. Annal of Statistics. Annal of Statistics. 13:1024–40. Lett. 1979.B. 1965. A nonparametric density estimate of a multivariate density function. The eﬃciency of some nonparametric competitors to the t-test. 7:195–9.L. 20:712– 36. Statist.Kernel Smoothing Zhi Ouyang Augest. Nolan. Admissible kernel estimators of a multivariate density. Canonical kernel for density estimation. [6] J.P. 1989.S. [9] D. 1985.L. Abramson. [10] D. Lett.B. [3] K. [2] D. Davis. Marron and M. Annal of Mathematical Statistics. 75:1025–30. Scott. Scott. [4] J. [7] J. [5] D. 66:605–10. Hodges and E. 13:324–35. 1982. Loftsgaarden and C. 9:168–76. 2005 References [1] I. Annal of Statistics. 1956. 36:1049–51. 1989. Statist.P. Exact mean integrated squared error.H. 1988.S. Annal of Mathematical Statistics. On bandwidth variation in kernel estimates . Annal of Statitics.O. Annal of Statistics... [8] W. 1975. Wand. Cline. 1992.R. 7:401–5.W. Average shifted histograms: eﬀective nonparametric density estimators in several dimensions. Probab.S. 13 . Marron and D.a square root law. Lehmann. Mean square error properties of density estimates. Probab. Biometrika.