Professional Documents
Culture Documents
p
=
p+1
= . . . =
1
=
0
and another set of p knots
K+1
=
K+2
= . . . =
K+p+1
.
The B-spline basis functions are dened as
N
j,1
(x) =
1,
j
x <
j+1
0, otherwise
,
4
N
j,p+1
(x) =
x
j
j+p
j
N
j,p
(x) +
j+p+1
x
j+p+1
j+1
N
j+1,p
(x),
for j = p, . . . , K. Thereby the convention 0/0 = 0 is used. With the use of the additional
knots, this gives precisely K + p + 1 independent basis functions.
We dene the P-spline estimator as the minimizer of
n
i=1
Y
i
j=p
j
N
j,p+1
(x
i
)
2
+
b
a
j=p
j
N
j,p+1
(x)
(q)
2
dx, (2)
where the sum of squared dierences is penalized with the integrated squared qth order
derivative of the spline function, which is assumed to be nite. Since the p+1st derivative
of a spline function of degree p + 1 contains Dirac delta functions (see also Section 2.2),
it is a natural condition to have q p. However, we give a separate treatment to the
case of truncated polynomial basis functions where q = p + 1, see Section 2.2 and the
end of Section 4. The penalty constant plays the role of a smoothing parameter. Note
that letting 0 implies an unpenalized estimate, while forces convergence of
the qth derivative of the spline function to zero, with the consequence that the limiting
estimator is a (q 1)th degree polynomial. The derivative formula for B-spline functions,
as given in de Boor (2001, ch. X), states that
j=p
j
N
j,p+1
(x)
(q)
=
K
j=p+q
N
j,p+1q
(x)
(q)
j
,
where the coecients
(q)
j
are dened recursively via
(1)
j
= p(
j
j1
)/(
j+p
j
),
(q)
j
= (p + 1 q)(
(q1)
j
(q1)
j1
)/(
j+p+1q
j
), q = 2, 3, . . . . (3)
Thus we can rewrite the penalty term in (2) as
t
t
q
R
q
, where matrix Rhas elements
R
ij
=
b
a
N
j,p+1q
(x)N
i,p+1q
(x)dx, for i, j = p + q, . . . , K and
q
denotes the matrix
corresponding to the weighted dierence operator dened in (3), i.e.
(q)
=
q
. Note
5
that for the special case of equidistant knots, i.e.
j
j1
= for any j = p+1, . . . , K,
there is an explicit expression of the matrix
q
in terms of the qth backward dierence
operator
q
. This latter matrix is dened recursively via
q
=
1
(
q1
),
1
j
=
j1
j
. For equidistant knots it holds
q
=
q
q
. For the special case p = q,
matrix R reduces to a diagonal matrix with diagonal elements (
j
j1
), which are
further simplied to for equidistant knots. Another special case of p = q +1 results in a
tridiagonal matrix R, with the integrals of the squared linear (order equal to 2) B-splines
on the main diagonal.
Further, dene the spline basis vector of dimension 1 (K + p + 1) as N(x) =
{N
p,p+1
(x), . . . , N
K,p+1
(x)}, the n (K +p +1) spline design matrix N = {N(x
1
)
t
, . . . ,
N(x
n
)
t
}
t
, and let D
q
=
t
q
R
q
. With this notation the P-spline estimator takes the
form of a ridge regression estimator
f = N(N
t
N + D
q
)
1
N
t
Y , (4)
where
f = {
f(x
1
), . . . ,
f(x
n
)}
t
and Y = (Y
1
, . . . , Y
n
)
t
.
Originally, Eilers and Marx (1996) simplied (4) by suggesting to use equidistant knots
and a combination of cubic splines (p = 3) and second order penalty (q = 2). Moreover,
they only took into account the diagonal elements of R, resulting in the simpler penalty
matrix c
4
t
2
2
, with c =
b
a
{N
j,2
(x)}
2
dx. Since c is a constant, it can be absorbed
in the penalty constant. For our theoretical investigation we use the general denition of
penalty matrix D
q
.
2.2 Penalized splines using truncated polynomial basis func-
tions
Ruppert and Carroll (2000) used truncated polynomials as basis functions. In particular,
with truncated polynomials of degree p based on K inner knots a <
1
< . . . <
K
< b,
the penalized spline estimator is dened as the solution to the penalized least squares
6
criterion
n
i=1
{Y
i
F(x
i
)}
2
+
p
K
j=1
2
j+p
,
with F(x) = {1, x, . . . , x
p
, (x
1
)
p
+
, . . . , (x
K
)
p
+
} and = (
0
, . . . ,
K+p
), so that
F(x) =
0
+
1
x + . . . +
p
x
p
+
K
k=1
k+p
(x
k
)
p
+
.
The resulting estimator is a ridge regression estimator given by
f
p
= F(F
t
F +
p
D
p
)
1
F
t
Y , (5)
where F = {F(x
1
)
t
, . . . , F(x
n
)
t
}
t
and
D
p
is the diagonal matrix diag(0
p+1
, 1
K
), indicating
that only the spline coecients are penalized. Note that
p
results in the pth degree
polynomial limiting t.
The ridge penalty imposed on the spline coecients can also be viewed as a penalty
containing the integrated squared (p + 1)th derivative of the spline function, where the
(p + 1)th derivative is a generalized function. Indeed,
(F(x))
(p)
= p!
p
+ p!
K
j=1
k+p
I
[
j
,)
(x).
Since the derivative of an indicator function is a Dirac delta function, one nds that
b
a
(F(x))
(p+1)
2
dx = (p!)
2
K
j=1
2
j+p
.
Truncated polynomials and B-splines are directly connected, which can be seen from
the alternative denition of B-splines as appropriately scaled (p + 1)th order divided
dierences of the truncated polynomials (see de Boor, 2001),
N
j,p+1
(x) = (1)
(p+1)
(
j+p+1
j
)[
j
, . . . ,
j+p+1
](x )
p
+
, j = p, . . . , K, (6)
where [
j
, ...,
j+p+1
](x)
p
+
denotes the (p+1)th order divided dierence of (x)
p
+
as a
function of knots
j
for xed x. In case of equidistant knots (6) simplies to N
j,p+1
(x) =
7
(1)
(p+1)
p+1
(x)
p
+
/p!. B-spline and truncated polynomial basis functions span the
same set of spline functions (de Boor, 2001, ch. IX). Thus there exist coecient vectors
and such that N = F, implying equivalence of unpenalized estimators. In other
words, there exists a square and invertible transition matrix L (see further), such that
N = FL.
The equivalence of the penalized spline estimators
f and
f
p
is not automatic, but
will follow when there is equality of the penalties. We work out the case of tting with
B-splines and obtaining the same penalized estimator as
f
p
in (5) with
D
p
as penalty
matrix. Using the equality N = FL for the penalized estimator
f
p
, implies that we
can write it as
f
p
= N(N
t
N +
p
L
t
D
p
L)
1
N
t
Y . Thus, tting with B-splines yields
an equivalent estimator to
f
p
if we use the penalty term
p
L
t
D
p
L instead of D
q
. This
penalty matrix can be explicitly obtained for equidistant knots. By writing (N(x))
p
=
K
j=0
N
j,1
(x)
(p)
j
=
K
j=1
I
[
j
,)
(x)(
(p)
j
(p)
j1
) we nd that
b
a
(N(x))
(p+1)
2
dx =
K
j=1
(
(p)
j
(p)
j1
)
2
.
We formally generalize (3) to
(p+1)
j
= (
(p)
j
(p)
j1
)/ and obtain that
(p!)
2
D
p
=
K
i=1
(
(p+1)
j
)
2
=
2p
t
p+1
p+1
.
Thus, in this case, for equivalence of the estimators, the penalty matrix using B-splines
should be L
t
D
p
L =
2p
t
p+1
p+1
/(p!)
2
. In addition, these calculations provide a method
to obtain the transformation matrix L. In general, also for unequispaced knots, L can be
found from the equation
t
L
t
D
p
L =
K
j=1
(
(p)
k
(p)
j1
)
2
/(p!)
2
.
2.3 Regression splines
An unpenalized estimator with = 0 in (4) is referred to as a regression spline estimator.
More precisely, the regression spline estimator of order (p + 1) for f(x) is the minimizer
8
of
n
i=1
{Y
i
f
reg
(x
i
)}
2
= min
s(x)S(p+1;)
n
i=1
{Y
i
s(x
i
)}
2
,
where
S(p + 1; ) =
s() C
p1
[a, b] : s is a p degree polynomial on each [
j
,
j+1
]
, p > 0
is the set of spline functions of degree p with knots = {a =
0
<
1
< . . . <
K
<
K+1
= b} and S(1, ) is the set of functions with jumps at the knots. Since
N
j,p+1
(), j = p, . . . , K form a basis for S(p + 1, ) (see Schumaker, 1981),
f
reg
(x) =
N(x)(N
t
N)
1
N
t
Y S(p + 1, ). Further we denote with s
f
() = N() S(p + 1, )
the best L
j
). There exists a constant M > 0, such that
/ min
0jK
(
j+1
j
) M and = o(K
1
).
(A2) For deterministic design points x
i
[a, b], i = 1, . . . , n assume that there exists a
distribution function Q with corresponding positive continuous design density such
that, with Q
n
the empirical distribution of x
1
, . . . , x
n
, sup
x[a,b]
|Q
n
(x) Q(x)| =
o(K
1
).
(A3) The number of knots K = o(n).
Zhou et al. (1998) obtained the approximation bias and variance of the regression spline
estimator as
E{
f
reg
(x)} f(x) = b
a
(x) + o(
p+1
), (7)
Var{
f
reg
(x)} =
2
n
N(x)G
1
N
t
(x) + o{(n)
1
}, (8)
9
where G =
b
a
N(x)
t
N(x)(x)dx and the approximation bias
b
a
(x) =
f
(p+1)
(x)
(p + 1)!
K
j=0
I
[
j
,
j+1
)
(x)(
j+1
j
)
p+1
B
p+1
x
j
j+1
, (9)
with B
p+1
() denoting the (p + 1)th Bernoulli polynomial (see Abramowitz and Stegun,
1972, p. 804).
We will come back to these results in Section 4 where they will be used for the case
of penalized regression splines.
3 Average mean squared error of a penalized spline
estimator
In this section we investigate the average mean squared error (AMSE) of a penalized
spline estimator and discuss the optimum choice of smoothing parameter and number
of knots K.
With the Demmler-Reinsch decomposition (Demmler and Reinsch, 1975) the average bias
and variance can be expressed in terms of the eigenvalues obtained from the singular value
decomposition
(N
t
N)
t/2
D
q
(N
t
N)
1/2
= Udiag(s)U
t
, (10)
where U is the matrix of eigenvectors and s is the vector of eigenvalues s
j
. Denote
A = N(N
t
N)
1/2
U. This matrix is semi-orthogonal with A
t
A = I
K+p+1
and AA
t
=
N(N
t
N)
1
N
t
. We can rewrite the penalized spline estimator (4) as
f = A{I
n
+ diag(s)}
1
A
t
Y (11)
= {I
n
+ Adiag(s)A
t
}
1
AA
t
Y = {I
n
+ Adiag(s)A
t
}
1
f
reg
. (12)
Equation (12) clearly shows the shrinkage eect of including the penalty term. Equality
(11) provides an expression that is straightforward to use to obtain the average mean
10
squared error
AMSE(
f) =
1
n
E{(
f f)
t
(
f f)}
=
2
n
K+p+1
j=1
1
(1 + s
j
)
2
+
2
n
K+p+1
j=1
s
2
j
b
2
j
(1 + s
j
)
2
+
1
n
f
t
(I
n
AA
t
)f,
where f = {f(x
1
), , f(x
n
)}
t
and b = A
t
f with components b
j
. Since AA
t
is an
idempotent matrix and AA
t
f = E(
f
reg
) we obtain that
AMSE(
f) =
2
n
K+p+1
j=1
1
(1 + s
j
)
2
+
2
n
K+p+1
j=1
s
2
j
b
2
j
(1 + s
j
)
2
+
1
n
n
j=1
E{
f
reg
(x
j
)} f(x
j
)
2
.
The rst term in this equation is the average variance, the second term is the average
squared shrinkage bias and the last component is the average squared approximation bias,
which can be obtained from (7).
We now study optimum orders of the smoothing parameter and of the number of
knots K. A study of asymptotic properties of spline estimators via eigenvalues goes back
to at least Utreras (1980), see also Utreras (1981, 1983). Speckman (1981, 1985) extended
these results and a version of that we use below. Lemma 1 is adopted from Speckman
(1985, eqn. 2.5d), see also Eubank (1999, p. 237).
Lemma 1 Under design condition (A2) and for the eigenvalues obtained in (10),
s
1
= = s
q
= 0
s
j
= n
1
(j q)
2q
c
1
{1 + o(1)}, j = q + 1, . . . , K + p + 1,
where c
1
is a constant that depends only on q and the design density.
With a slightly dierent assumption on the design density, namely that the design density
is regular in the sense that for i = 1, . . . , n,
x
i
a
(x)dx = (2i 1)/(2n), Speckman (1985)
obtained the exact expression of the constant as c
1
=
2q
(
b
a
(x)
1/2q
dx)
2q
. Although
11
originally obtained for smoothing splines, this result remains valid for this situation with
K < n.
Using
K+p+1
j=1
1
(1 + s
j
)
2
= q +
K+p+1
j=q+1
1
(1 + n
1
c
1
(j q)
2q
)
2
=
c
1
1/2q
K
q
0
du
(1 + u
2q
)
2
+ q 1 + r
q
,
with
K
q
= (K + p + 1 q)( c
1
)
1/2q
n
1/2q
, (13)
c
1
= c
1
{1 + o(1)} and r
q
= O(1) as the remainder term of the Euler-Maclaurin formula,
we can rewrite
AMSE(
f) =
2
n
c
1
1/2q
K
q
0
du
(1 + u
2q
)
2
+
2
n
K+p+1
j=q+1
b
2
j
s
2
j
(1 + s
j
)
2
+
n
i=1
{b
a
(x
i
)}
2
/n + o(
2(p+1)
) +
2
(q 1 + r
q
)/n.
(14)
This expression is exact, not a bound. Depending on whether K
q
in (13) is smaller or
larger than one (and thus depending on K, and q), we obtain a dierent value for the
integral in (14) and thus the following results.
Theorem 1 Under assumptions (A1)(A3) and for p 2q 1 it holds
(a) If K
q
< 1
AMSE(
f) =
2
n
c
2
(K + p + r
q
) +
2
n
c
3
t
D
q
(N
t
N)
1
D
q
+ o(
2
n
2
2q
) +
1
n
n
i=1
{b
a
(x
i
)}
2
+ o(
2(p+1)
)
= O
K
n
+ O
2
n
2
K
2q
+ O(K
2(p+1)
),
12
for c
2
= c
2
(K
q
, q) (0.6, 1] for any K
q
(0, 1) and q > 0 and some constant
c
3
[1/4, 1]. For K = O(n
1/(2p+3)
) and = O(n
) with (p + 2 q)/(2p + 3)
one gets AMSE(
f) = O(n
(2p+2)/(2p+3)
).
(b) If K
q
> 1
AMSE(
f)
2
n
q 1 + r
q
+
c
4
K
14q
q
c
5
c
1
1/2q
+
4n
t
D
q
+ o(n
1
) +
1
n
n
i=1
{b
a
(x
i
)}
2
+ o(
2(p+1)
)
= O
n
1/2q1
1/2q
+ O
+ O(K
2(p+1)
),
for constants c
4
=
0
(1 + u
2q
)
2
du and c
5
= c
5
(K
q
, q) (0, 1/3]. For =
O(n
1/(2q+1)
) and K = O(n
f) = O(n
2q/(1+2q)
).
Note that for the case K
q
= 1, there is a discontinuity in the calculation of the
integral in (14) as the part of the averaged variance (see the proof of this theorem),
leading to a clear separation between the two cases. Case (a) with K
q
< 1 results in
the asymptotic scenario similar to that of regression splines. The AMSE is determined
by the average asymptotic variance and the average approximation bias. The average
shrinkage bias becomes negligible for small , that is for < (p + 2 q)/(2p + 3). The
asymptotically optimal number of knots has the same order as that for regression splines,
that is K = O(n
1/(2p+3)
). Case (b) with K
q
> 1 results in the asymptotic scenario close to
that of smoothing splines. The AMSE is dominated by the average asymptotic variance
and shrinkage bias, while the average approximation bias is negligible for the chosen
asymptotic order of the number of knots K. The asymptotic order of the AMSE depends
only on the order of the penalty q.
The bound of the average mean squared error, obtained in case (b) of Theorem 1 is
precisely the same as known from the smoothing spline theory, up to the average squared
13
approximation bias, which is negligible for K
q
> 1. It is well known that the AMSE of a
smoothing spline estimator is dominated by the bias in the boundaries and the optimal
rate of convergence can not be reached unless boundary conditions are imposed on the
true function (e.g. Speckman and Sun, 2001). If the boundary conditions are satised, the
asymptotic order of the average squared shrinkage bias is found to be O(
2
n
2
), implying
that the smoothing parameter = O(n
(2q+1)/(4q+1)
) and AMSE(
f) = O(n
4q/(4q+1)
),
which is optimal. However, in our general setting for penalized splines no boundary
conditions on the true functions are imposed, leading to inevitable domination of boundary
eects in the AMSE, so that the optimal convergence rate can not be reached.
The result of Theorem 1 saying that the AMSE is smaller (and optimal) for the case
K
q
< 1 (K small), than for K
q
> 1 (K large) supports the simulation study of Ruppert
(2002) who found examples where having too many knots degrades the performance of
the spline estimator. This seems to be the rst rigorous proof of those empirical ndings.
4 Asymptotic bias and variance
4.1 Penalized splines with a small number of knots
In the case of a penalized spline estimator
f(x) with a relatively small number of knots,
determined by K
q
< 1, we are able to derive expressions for the pointwise asymptotic bias
and variance. First we relate the penalized spline estimator (4) with q p to a regression
spline estimator. We dene G
K,n
(x) = N
t
N/n and make the following assumption.
(A4) Eigenvalues of n
1
G
1
K,n
D
q
are less than 1.
This allows us to relate regression and penalized spline estimators using a series expansion
around = 0. We start with
f(x) = N(x)
1
n
N
t
N +
n
D
q
1
1
n
N
t
Y
14
= N(x)
I
K+p+q
n
G
1
K,n
D
q
+
n
G
1
K,n
D
q
2
. . .
1
n
G
1
K,n
N
t
Y (15)
=
f
reg
(x)
n
N(x)G
1
K,n
D
q
G
1
K,n
1
n
N
t
Y + r
n
. (16)
The rst term in (16) is the regression spline estimator and the second term gives the
dierence between the penalized and unpenalized (regression spline) estimator, which also
contributes to the bias and variance. Assumption (A4) ensures convergence of the series
in (15). Note that (A4) is equivalent to the assumption K
q
< 1 in case (a) of Theorem 1,
since K
2q
q
is the maximum eigenvalue of matrix n
1
G
t/2
K,n
D
q
G
1/2
K,n
, which as a matrix is
similar to n
1
G
1
K,n
D
q
, and thus has the same eigenvalues.
From this expansion we obtain the following result.
Theorem 2 Suppose f() C
p+1
[a, b], assumptions (A1) (A4) hold and p 2q 1.
Then,
E{
f(x)} f(x) = b
a
(x) + b
(x) + o(
p+1
) + o(n
1
q
), (17)
Var{
f(x)} =
2
n
N(x)G
1
N
t
(x) c
7
2
n
2
N(x)G
1
D
q
G
1
N
t
(x)
+ o{(n)
1
} + o(n
2
(2q+1)
), (18)
with penalization or shrinkage bias
b
(x) =
n
c
6
N(x)G
1
D
q
=
n
c
6
N(x)G
1
t
q
Ws
(q)
f
(),
where c
6
[1/2, 3/2] and c
7
[3/4, 2] are constants, W = diag
j+pq
l=j
l+1
l
N
j,q
(t)dt
and s
(q)
f
() = {s
(q)
f
(
p+q
), . . . , s
(q)
f
(
K
)}
t
with some
j
[
j
,
j+p+1q
], j = p +q, ..., K.
Thus, apart from the approximation bias b
a
(x) the asymptotic bias of a penalized spline
estimator has an additional component the shrinkage bias b
(x) =
n
c
6
s
(1)
f
K
j=0
I
[
j
,
j+1
)
(x)
(
j+1
x)
(G
1
)
j+1,1
+ (G
1
)
j+1,K+2
+ (x
j
)
(G
1
)
j+2,1
+ (G
1
)
j+2,K+2
,
15
0.0 0.2 0.4 0.6 0.8 1.0
1
.
5
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0
.
1
5
0
.
1
0
0
.
0
5
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
x
S
h
r
i
n
k
a
g
e
b
i
a
s
Figure 1: True function sin(2x) (solid) and its estimate (bold) based on 35 equidistant
knots and p = q = 2 (left) and the shrinkage bias (right).
where s
(1)
f
(x) = s
(1)
f
is a constant for s
f
() S(2, ). Since |(G
1
)
k,l
| =
|kl|
O(
1
) for
some (0, 1) (see Lemma 6.3 in Zhou et al., 1998), the elements (G
1
)
j,1
decrease expo-
nentially with growing j, while the (G
1
)
j,K+2
increase with growing j. Thus, for j close
to [K/2], both (G
1
)
j,K+2
and (G
1
)
j,1
are small, implying that b
f(x)} f(x)] = O(
p+1
) + O(n
1
q
)
Var{
f(x)} = O(n
1
1
) O(n
2
(2q+1)
),
we nd for
K = O(n
1
2p+3
), = O(n
) with
p + 2 q
2p + 3
that for = (p + 2 q)/(2p + 3) both bias components b
a
and b
2p
t
p+1
p+1
/(p!)
2
and modifying (A4) accordingly, in full analogy we obtain that
E{
f
p
(x)} f(x) = b
a
(x)
p
p
(p!)
2
n
c
6
N(x)G
1
t
p+1
s
(p+1)
f
() + o(
p+1
) + o(n
1
p
)
= O(
p+1
) + O(n
1
p
), (19)
Var{
f
p
(x)} =
2
n
N(x)G
1
N
t
(x)
p
2p
2
(p!n)
2
c
7
N(x)G
1
t
p+1
p+1
G
1
N
t
(x)
+ o{(n)
1
} + o(n
2
(2p+2)
) = O{(n)
1
} + O(n
2
(2p+2)
), (20)
where s
(p+1)
f
() = {s
(p)
f
(
1
), s
(p)
f
(
2
) s
(p)
f
(
1
), . . . , s
(p)
f
(
K
) s
(p)
f
(
K1
)}
t
. It follows that
taking K = O(n
1/(2p+3)
) and
p
= O(n
) with
2
2p+3
leads to the optimal rate of
convergence. Moreover, in this setting not only approximation and shrinkage biases are
balanced, both components of the asymptotic variance are of same asymptotic order. More
details on the asymptotics of a generalized penalized estimator with truncated polynomial
basis functions are available in Kauermann et al. (2007).
4.2 Penalized splines with a large number of knots
A penalized spline estimator with a relatively large number of knots, determined by
K
q
> 1, behaves asymptotically similar to a smoothing spline estimator. Thus, a study
of the asymptotic bias and variance of a penalized spline estimator in this case can be
carried out through derivation of an equivalent kernel, similar to the classical results on
smoothing splines (see e.g. Eubank, 1999). We consider this task beyond the scope of our
paper, and plan to address this issue in a separate work. However, we discuss here the
available results of Li and Ruppert (2008), who studied the asymptotic properties of a
penalized spline estimator by deriving an equivalent kernel.
Note that the general setting of Li and Ruppert (2008) diers from the one adopted
in the current paper. Using B-spline basis functions of degree p and equidistant knots,
Li and Ruppert (2008) employ a simplied version of penalty D
q
, namely just
t
q
q
.
18
Also, the spline degree and the penalty order are decoupled, so that cases with q > p are
possible, in contrast to our setting with q p. Moreover, the true function is assumed to
be 2q times continuously dierentiable, compared to f() C
p+1
adopted in our paper.
However, the case of linear splines penalized with the rst order penalty (p = q = 1) ts
into both settings - the one of Li and Ruppert (2008) as well as of the current paper,
making comparison between both results possible.
For interior points, Li and Ruppert (2008) obtain the asymptotic orders K = O(n
),
> 1/5 and
n
= O(n
22/5
), where
n
is connected with smoothing parameter corre-
sponding to our setting through
n
= (K
2
/n), resulting in = O(n
3/5
). Clearly, > 1/5
implies K
q
= K
1
= (K + p)( c
1
/n)
1/2
= O(n
n
1/5
) > 1 for n suciently large. The
optimal rates for K and of Li and Ruppert (2008) are given for the non-boundary region
and dier from those obtained in case (b) of Theorem 1, since the average shrinkage bias
(and thus AMSE(
= O(
1
), G
1
K,n
G
1
= o(
1
) and G
K,n
G
= o() (Lemmas
20
6.3 and 6.4 in Zhou et al., 1998).
(R2) D
q
= O(
2q+1
) (Lemma 6.1 in Cardot, 2000).
(R3) Under (A1) - (A3) it holds max
qjK
b
a
N
j,p+1
(u){f(u) s
f
(u)}dQ
n
(u) = o(
p+2
)
(Lemma 6.10 in Agarwal and Studden, 1980) and thus
E{
f
reg
(x) s
f
(x)} = N(x)G
1
K,n
1
n
N(f s
f
) = o(
p+1
),
with f = {f(x
1
), . . . , f(x
n
)}
t
and s
f
= {s
f
(x
1
), . . . , s
f
(x
n
)}
t
.
Lemma 2 max
1i,jK
|(G
1
K,n
D
q
G
1
K,n
)
i,j
(G
1
D
q
G
1
)
i,j
| = o(
(2q+1)
).
Proof. Rewriting
G
1
K,n
D
q
G
1
K,n
G
1
D
q
G
1
= G
1
K,n
D
q
(G
1
K,n
G
1
) +G
1
D
q
(G
1
K,n
G
1
)
+ G
1
K,n
D
q
G
1
G
1
D
q
G
1
K,n
and using (R1) (R2) we obtain
G
1
K,n
D
q
G
1
G
1
D
q
G
1
K,n
= G
1
K,n
(D
q
G
1
G
K,n
G
1
D
q
G
1
K,n
)
= G
1
K,n
(D
q
G
1
(G
K,n
G)G
1
D
q
G
1
K,n
D
q
G
1
K,n
)
= G
1
K,n
D
q
(G
1
G
1
K,n
+ o(
1
))
= o(
(2q+1)
),
from which the result follows.
Lemma 3 For any K + p + 1 dimensional vectors v and w such that v
t
w 0 it holds
0 v
t
(N
t
N)
1
D
q
w
c
1
n
(K + p + 1 q)
2q
v
t
w.
Proof. Since matrices (N
t
N)
1
D
q
and (N
t
N)
1/2
D
q
(N
t
N)
1/2
are similar, they
have the same eigenvalues. Thus, (N
t
N)
1
D
q
=
Udiag(s)
U
t
, with
U a matrix of
eigenvalues and where s is dened in Lemma 1, so that min
1jK+p+1
(s
j
) = 0 and
21
max
q+1jK+p+1
(s
j
) = (K + p + 1 q)
2q
c
1
n
1
. Since (N
t
N)
1
D
q
and wv
t
are posi-
tive semidenite and v
t
(N
t
N)
1
D
q
w = tr{(N
t
N)
1
D
q
wv
t
} = tr{diag(s)
U
t
wv
t
U},
the results follows from
0 tr{diag(s)
U
t
wv
t
U}
c
1
n
(K + p + 1 q)
2q
v
t
w,
see Fang et al. (1994) or Xing et al. (2000).
If v
t
w 0, the lemma leads to the following result
0 |v
t
(N
t
N)
1
D
q
w|
c
1
n
(K + p + 1 q)
2q
|v
t
w|.
A.1 Proof of Theorem 1
Let us rst consider case (a), that is K
q
< 1. Using a series expansion around zero of
(1 + x)
2
=
j=0
(1)
j
(j + 1)x
j
for |x| < 1 we easily nd
K
q
0
du
(1 + u
2q
)
2
= K
q
j=0
(1)
j
(j + 1)
K
2qj
q
2qj + 1
= K
q 2
F
1
2,
1
2q
;
1 + 2q
2q
, K
2q
q
=: K
q
c
2
,
where
2
F
1
(2, 1/(2q); 1 + 1/(2q), K
2q
q
) denotes the hypergeometric series (see Ch. 15 of
Abramowitz and Stegun, 1972), which for any K
q
< 1 and q > 0 converges to some
c
2
= c
2
(K
q
, q) (0.6, 1]. With this we obtain that the average variance in case (a) has
the asymptotic order O(K/n).
Using the denitions of b and s we can represent the average squared shrinkage bias as
2
n
K+p+1
j=q+1
b
2
j
s
2
j
1
s
j
(2 + s
j
)
(1 + s
j
)
2
=
2
n
{
t
f
D
q
(N
t
N)
1
D
q
f
r
b
},
where r
b
=
j=1
j
(1)
j+1
(j + 1)
t
f
{D
q
(N
t
N)
1
}
j
D
q
(N
t
N)
1
D
q
f
and
f
=
(N
t
N)
1
N
t
f. With Lemma 3 we can bound
0 r
b
K
2q
q
(2 + K
2q
q
)
(1 + K
2q
q
)
2
t
f
D
q
(N
t
N)
1
D
q
f
,
22
so that for K
q
< 1 there exists a constant c
3
[1/4, 1] such that average squared shrinkage
bias equals
2
c
3
t
f
D
q
(N
t
N)
1
D
q
f
/n. Adding and subtracting s
f
from f in
f
we nd
t
f
D
q
(N
t
N)
1
D
q
f
=
t
D
q
(N
t
N)
1
D
q
+ 2(f s
f
)
t
N(N
t
N)
1
D
q
(N
t
N)
1
D
q
(N
t
N)
1
N
t
s
f
+ (f s
f
)
t
N(N
t
N)
1
D
q
(N
t
N)
1
D
q
(N
t
N)
1
N
t
(f s
f
)
=
t
D
q
(N
t
N)
1
D
q
+ o(
p+14q
n
1
) + o(
2p+24q
n
1
),
where (R3) and Lemma 3 where applied to obtain the orders of two last terms. To nd
the asymptotic order of
t
D
q
(N
t
N)
1
D
q
we again apply Lemma 3 and obtain
t
D
q
(N
t
N)
1
D
q
c
1
(K + p + 1 q)
2q
n
t
D
q
=
c
1
(K + p + 1 q)
2q
n
b
a
{N(x)}
(q)
2
dx.
Since the penalty was assumed to be nite (see below eqn. (2)) and p 2q 1 we obtain
that
2
n
c
3
t
f
D
q
(N
t
N)
1
D
q
f
=
2
n
c
3
t
D
q
(N
t
N)
1
D
q
+ o(
2
n
2
2q
) = O
2
n
2
K
2q
.
Finally, the average squared approximation bias has the asymptotic order O(K
2(p+2)
), as
follows from (9). We now choose orders of K and , so that they ensure the best possible
rate of convergence. As shown in Stone (1982) a p + 1 times continuously dierentiable
function has the optimal rate of convergence n
(2p+2)/(2p+3)
. It is straightforward to see
that choosing K = n
1/(2p+3)
implies the average variance and squared approximation
bias to have the same order O(n
(2p+2)/(2p+3)
). The shrinkage bias is controlled by the
smoothing parameter . Choosing = O(n
(p+2q)/(2p+3)
) balances both bias components,
while values of a smaller asymptotic order make the shrinkage bias negligible.
For values K
q
> 1 we can nd the integral in (14) as the dierence
0
du
(1 + u
2q
)
2
K
q
du
(1 + u
2q
)
2
=: c
4
K
q
du
(1 + u
2q
)
2
.
23
Changing the integration variable to its reciprocal in the second integral and using a series
expansion of (1 + x)
2
again, one easily obtains that
K
q
du
(1 + u
2q
)
2
=
j=0
(1)
j
(j + 1)
K
2q(j+2)+1
q
2q(j + 2) 1
= K
14q
q
2
F
1
2,
4q 1
2q
;
6q 1
2q
, K
2q
q
(4q 1)
1
=: K
14q
q
c
5
,
where
2
F
1
(2, (4q 1)(2q)
1
; (6q 1)(2q)
1
, K
2q
q
) is a converging hypergeometric series,
with c
5
= c
5
(K
q
, q) (0, 1/3] for any K
q
> 1 and q > 0. Thus, for case (b) the average
variance has the asymptotic order O(n
1/2q1
1/2q
).
Since the function x(1 + x)
2
1/4 for any x > 0, one can bound the average shrinkage
bias
n
K+p+1
j=q+1
b
2
j
s
j
s
j
(1 + s
j
)
2
4n
K+p+1
j=q+1
b
2
j
s
j
=
4n
f
D
q
f
.
Again, adding and subtracting s
f
from f in
f
we nd
f
D
q
f
= D
q
+ o(
p+12q
).
Thus, the average squared approximation bias is of order O(/n). It is straightforward
to see that = O(n
1/(2q+1)
) balances the average squared shrinkage bias and the average
variance. For the condition K
q
> 1 to be fullled, the number of knots should satisfy
K = O(n
), with > 1/(2q + 1). This implies that the average approximation bias is
negligible with the order O(n
(2p+2)/(2q+1)
). Thus, AMSE(
f) = O(n
2q/(1+2q)
).
A.2 Proof of Theorem 2
From (16) we obtain that
E{
f(x)} f(x) = E{
f
reg
(x)} f(x)
n
N(x)G
1
K,n
D
q
G
1
K,n
N
t
1
n
f + E(r
n
).
24
Rewriting
E(r
n
) =
n
j=1
(1)
j+1
j
N(x)(G
1
K,n
D
q
)
j
G
1
K,n
D
q
G
1
K,n
N
t
1
n
f
and using Lemma 3 we nd
|E(r
n
)|
K
2q
q
1 + K
2q
q
n
|N(x)G
1
K,n
D
q
G
1
K,n
N
t
1
n
f|.
Since K
q
< 1 according to assumption (A4), K
q
/(1 + K
q
) < 1/2 and thus there exists
some constant c
6
[1/2, 3/2], such that
E{
f(x)} f(x) = E{
f
reg
(x) s
f
(x)} +{s
f
(x) f(x)} +
n
c
6
N(x)G
1
K,n
D
q
G
1
K,n
N
t
1
n
s
f
n
c
6
N(x)G
1
K,n
D
q
G
1
K,n
N
t
1
n
(f s
f
).
The order of the rst component is given by (R3). According to Barrow and Smith
(1978) it holds that s
f
(x) f(x) = b
a
(x) +o(
p+1
). With Lemma 3, assumption (A4) and
result (R3) we obtain
n
c
6
|N(x)G
1
K,n
D
q
G
1
K,n
N
t
1
n
(f s
f
)|
c
1
n
c
6
(K + p + 1 q)
2q
o(
p+1
) = o(
p+1
).
Thus,
E{
f(x)} f(x) = b
a
(x) +
n
c
6
N(x)G
1
K,n
D
q
+ o(
p+1
),
with = G
1
K,n
N
t
s
f
/n = (N
t
N)
1
N
t
s
f
. Using the denition of penalty D
q
and noting
that s
(q)
f
(x) = (N(x))
(q)
= N
q
(x)
q
, with N
q
(x) = {N
p+q,p+1q
(x), . . . , N
K,p+1q
(x)}
we can apply the mean value theorem and rewrite
n
c
6
N(x)G
1
K,n
D
q
=
n
c
6
N(x)G
1
K,n
t
q
b
a
N
t
q
(x)s
(q)
f
(x)dx
=
n
c
6
N(x)G
1
K,n
t
q
Ws
(q)
f
(),
where W = diag
j+pq
l=j
l+1
l
N
j,q
(x)dx
and = (
p+q
, . . . ,
K
)
t
with some
j
[
j
,
j+p+1q
], j = p + q, . . . , K.
25
Let us consider the shrinkage bias term in more detail. First, we decompose
t
q
=
[
q,1
,
q,2
,
q,3
]
t
with
t
q,1
and
t
q,3
as q (K+p+1q) dimensional matrices and
t
q,2
as a (K + p + 1 2q) (K + p + 1 q) matrix of weighted dierences as dened in (3),
so that
t
q
Ws
(q)
f
() = [{
t
q,1
Ws
(q)
f
()}
t
, {
t
q,2
Ws
(q)
f
()}
t
, {
t
q,3
Ws
(q)
f
()}
t
]
t
. Since for
equidistant knots {W}
jj
= , a constant, for all j and
t
q,2
is the matrix corresponding
to a divided dierence operator of order q, we can rewrite
t
q
Ws
(q)
f
() = [{
t
q,1
s
(q)
f
()}
t
, {s
(2q)
f
( )/q!}
t
, {
t
q,3
s
(q)
f
()}
t
]
t
,
for some = (
p+q
, . . . ,
Kq
) with
j
[
j
,
j+q
], j = p + q, . . . , K q. Since s
f
(x)
is a piecewise polynomial of degree p, s
(2q)
f
(x) disappears if p 2q 1. For p > 2q 1
we nd s
(2q)
f
(x) = 0 and thus the order of the shrinkage bias would increase with
1
.
Using assumption (A1) one can show that {
t
q,2
Ws
(q)
f
()}
t
is asymptotically small for
p 2q 1. These considerations justify the assumption on the relationship between p
and q made in Theorem 1.
Further, since N
j,q
() 1, one nds W
=
O(
q
) (see also Lemma 6.1 in Cardot, 2000). Thus, using (R1) and noting that s
(q)
f
()
=
O(1) and s
(2q)
f
(x) = 0 we obtain the shrinkage bias b
(x) as
n
c
6
N(x)G
1
t
q
Ws
(q)
f
()
n
c
6
N(x)(G
1
K,n
G
1
)
t
q
Ws
(q)
f
()
=
n
c
6
N(x)G
1
t
q
Ws
(q)
f
() + o(n
1
q
)
def
= b
(x) + o(n
1
q
).
From (15) we nd
Var{
f(x)} =
2
n
N(x)
I
n
n
G
1
K,n
D
q
+ . . .
2
G
1
K,n
N
t
(x)
= Var{
f
reg
(x)} 2
2
n
2
N(x)G
1
K,n
D
q
G
1
K,n
N
t
(x)
+3
2
2
n
3
N(x)G
1
K,n
D
q
G
1
K,n
D
q
G
1
K,n
N
t
(x) + . . .
26
with Var{
f
reg
(x)} as given in (8). Applying Lemma 3 again, we can rewrite
Var{
f(x)} = Var{
f
reg
(x)}
2
n
2
c
7
N(x)G
1
K,n
D
q
G
1
K,n
N
t
(x),
with some c
7
[3/4, 2]. Finally, using Lemma 2 we obtain that
2
n
2
c
7
N(x)G
1
D
q
G
1
N
t
(x)
2
n
2
c
7
N(x)
G
1
K,n
D
q
G
1
K,n
G
1
D
q
G
1
N
t
(x)
=
2
n
2
c
7
N(x)G
1
D
q
G
1
N
t
(x) + o(n
2
(2q+1)
),
proving Theorem 2.
Acknowledgements
The authors wish to thank M. Jansen for helpful hints concerning some of the calculations.
References
Abramowitz, M. and Stegun, I. (1972). Handbook of Mathematical Functions with For-
mulas, Graphs, and Mathematical Tables. Dover, New York.
Agarwal, G. and Studden, W. (1980). Asymptotic integrated mean square error using
least squares and bias minimizing splines. Ann. Statist., 8:13071325.
Barrow, D. L. and Smith, P. W. (1978). Asymptotic properties of best L
2
[0, 1] approxi-
mation by splines with variable knots. Quart. Appl. Math., 36(3):293304.
Brumback, B. A., Ruppert, D., and Wand, M. P. (1999). Comment on Shively, Kohn and
Wood. J. Amer. Statist. Assoc., 94:794797.
Cardot, H. (2000). Nonparametric estimation of smoothed principal components analysis
of sampled noisy functions. J. Nonpar. Statist., 12:503538.
Carter, C. K. and Kohn, R. (1996). Markov chain Monte Carlo in conditionally Gaussian
state space models. Biometrika, 83(3):589601.
27
Cox, D. D. (1983). Asymptotics for M-type smoothing splines. Ann. Statist., 11(2):530
551.
Craven, P. and Wahba, G. (1978). Smoothing noisy data with spline functions. Estimating
the correct degree of smoothing by the method of generalized cross-validation. Numer.
Math., 31(4):377403.
de Boor, C. (2001). A Practical Guide to Splines. Springer, New York. Revised Edition.
Demmler, A. and Reinsch, C. (1975). Oscillation matrices with spline smoothing. Numer.
Math., 24(5):375382.
Eilers, P. H. C. and Marx, B. D. (1996). Flexible smoothing with B-splines and penalties.
Stat. Science, 11(2):89121. With comments and a rejoinder by the authors.
Eubank, R. L. (1999). Nonparametric regression and spline smoothing, volume 157 of
Statistics: Textbooks and Monographs. Marcel Dekker Inc., New York, second edition.
Fang, Y., Loparo, K., and Feng, X. (1994). Inequalities for the trace of matrix product.
IEEE Trans. Automat. Control, 39(12):24892490.
Gasser, T. and M uller, H. (1984). Estimating regression functions and their derivatives
by the kernel method. Scand. J. Statist., 11:171185.
Green, P. J. and Silverman, B. W. (1994). Nonparametric regression and generalized
linear models: A roughness penalty approach, volume 58 of Monographs on Statistics
and Applied Probability. Chapman & Hall, London.
Hall, P. and Opsomer, J. (2005). Theory for penalized spline regression. Biometrika,
92:105118.
Huang, J. Z. (2003a). Asymptotics for polynomial spline regression under weak conditions.
Statist. Probab. Lett., 65(3):207216.
Huang, J. Z. (2003b). Local asymptotics for polynomial spline regression. Ann. Statist.,
31(5):16001635.
Kauermann, G., Krivobokova, T., and Fahrmeir, L. (2007). Some asymptotic results on
generalized penalized spline smoothing. Research report 0733, ORSTAT, K.U.Leuven.
28
Li, Y. and Ruppert, D. (2008). On the asymptotics of penalized splines. Biometrika. To
appear.
Nychka, D. (1995). Splines as local smoothers. Ann. Statist., 23(4):11751197.
Oehlert, G. W. (1992). Relaxed boundary smoothing splines. Ann. Statist., 20(1):146160.
OSullivan, F. (1986). A statistical perspective on ill-posed inverse problems. Stat. Science,
1:505527. With discussion.
Rice, J. and Rosenblatt, M. (1981). Integrated mean squared error of a smoothing spline.
J. Approx. Theory, 33(4):353369.
Rice, J. and Rosenblatt, M. (1983). Smoothing splines: regression, derivatives and decon-
volution. Ann. Statist., 11(1):141156.
Ruppert, D. (2002). Selecting the number of knots for penalized splines. J. Comput.
Graph. Statist., 11(4):735757.
Ruppert, D. and Carroll, R. (2000). Spatially-adaptive penalties for spline tting. Aust.
N. Z. J. Statist., 42:205224.
Ruppert, D., Wand, M., and Carroll, R. (2003). Semiparametric Regression. Cambridge
University Press, Cambridge, UK.
Schumaker, L. L. (1981). Spline Functions: Basic Theory. Wiley, New York.
Schwetlick, H. and Kunert, V. (1993). Spline smoothing under constraints on derivatives.
BIT, 33(3):512528.
Speckman, P. (1981). The asymptotic integrated mean square error for smoothing noisy
data by splines. Technical report, University of Oregon.
Speckman, P. (1985). Spline smoothing and optimal rates of convergence in nonparametric
regression models. Ann. Statist., 13(3):970983.
Speckman, P. and Sun, D. (2001). Asymptotic properties of smoothing parameter selection
in spline smoothing. Technical report, University of Missouri.
Speckman, P. L. and Sun, D. (2003). Fully Bayesian spline smoothing and intrinsic
autoregressive priors. Biometrika, 90(2):289302.
29
Speed, T. (1991). Comment on that BLUP is a good thing: The estimation of random
eects, by G. K. Robinson. Statist. Science, 6:4244.
Stone, C. J. (1982). Optimal rate of convergence for nonparametric regression. Ann.
Statist., 10(4):10401053.
Utreras, F. (1980). Sur le choix du param`etre dajustement dans le lissage par fonctions
spline. Numer. Math., 34:1528.
Utreras, F. (1981). Optimal smoothing of noisy data using spline functions. SIAM J. Sci.
Stat. Comput., 2(3):349362.
Utreras, F. (1983). Natural spline functions, their associated eigenvalue problem. Numer.
Math., 42(1):107117.
Utreras, F. (1988). Boundary eects on convergence rates for tikhonov regularization. J.
Approx. Theory, 54(3):235249.
Wahba, G. (1975). Smoothing noisy data with spline functions. Numer. Math., 24(5):383
393.
Wahba, G. (1990). Spline models for observational data, volume 59 of CBMS-NSF Re-
gional Conference Series in Applied Mathematics. Society for Industrial and Applied
Mathematics (SIAM), Philadelphia, PA.
Xing, W., Zhang, Q., and Wang, Q. (2000). A trace bound for a general square matrix
product. IEEE Trans. Automat. Control, 45(8):15631565.
Zhou, S., Shen, X., and Wolfe, D. A. (1998). Local asymptotics for regression splines and
condence regions. Ann. Statist., 26(5):17601782.
30
Author addresses.
Gerda Claeskens. Katholieke Universiteit Leuven, OR & Business Statistics and Leuven
Statistics Research Center, Naamsestraat 69, B-3000 Leuven, Belgium.
E-mail: gerda.claeskens@econ.kuleuven.be
Tatyana Krivobokova. Katholieke Universiteit Leuven, OR & Business Statistics and
Leuven Statistics Research Center, Naamsestraat 69, B-3000 Leuven, Belgium.
E-mail: tatyana.krivobokova@econ.kuleuven.be
Jean D. Opsomer. Colorado State University Department of Statistics, Fort Collins, CO
80523, USA.
E-mail: jopsomer@stat.colostate.edu.
31