Professional Documents
Culture Documents
1093/biomet/asy041
Printed in Great Britain Advance Access publication 26 September 2018
JENG-MIN CHIOU
Institute of Statistical Science, Academia Sinica, 128 Academia Road, Nankang District,
Taipei 11529, Taiwan
jmchiou@stat.sinica.edu.tw
AND YANYUAN MA
Department of Statistics, Pennsylvania State University, University Park,
Pennsylvania 16802, U.S.A.
yzm63@psu.edu
Summary
Prediction is often the primary goal of data analysis. In this work, we propose a novel model
averaging approach to the prediction of a functional response variable. We develop a crossval-
idation model averaging estimator based on functional linear regression models in which the
response and the covariate are both treated as random functions. We show that the weights cho-
sen by the method are asymptotically optimal in the sense that the squared error loss of the
predicted function is as small as that of the infeasible best possible averaged function. When the
true regression relationship belongs to the set of candidate functional linear regression models,
the averaged estimator converges to the true model and can estimate the regression parameter
functions at the same rate as under the true model. Monte Carlo studies and a data example
indicate that in most cases the approach performs better than model selection.
Some key words: Asymptotic optimality; Crossvalidation; Functional data; Model averaging; Weighting.
1. Introduction
Functional data analysis is an active research area that has experienced rapid growth in recent
decades. Comprehensive works on this topic include monographs (Ramsay & Silverman, 2005;
Ferraty & Vieu, 2006; Horváth & Kokoszka, 2012; Zhang, 2013; Hsing & Eubank, 2015) and
review articles (Müller, 2005; Cuevas, 2014; Wang et al., 2016).
Functional regression analysis relates the response to the covariates when either or both con-
tain random functions. Among various types of functional regression models, those with scalar
responses and functional covariates are the most intensively studied; see, for example, Cardot
et al. (2003), James (2002), Müller & Stadtmüller (2005), and Ferraty & Vieu (2009). Ramsay
& Dalzell (1991) introduced functional regression models in which the response variable is a
c 2018 Biometrika Trust
946 X. Zhang, J.-M. Chiou AND Y. Ma
function and the covariates can be vectors or functions. Faraway (1997) and Chiou et al. (2003)
developed functional response models with scalar covariates. Yao et al. (2005b) and Ferraty
et al. (2012) investigated functional linear regression models in which both the responses and the
covariates were functionals, with various extensions such as functional additive models (Müller
& Yao, 2008) and functional mixture prediction models (Chiou, 2012), among others. Although
the practice of regression analysis often concerns the best-fitting relations between the covariates
and response, its ultimate purpose is often prediction.
Prediction is challenging, especially when the target is functional. The accuracy of prediction
where the errors are assumed to be independent and identically distributed with E(il ) = 0,
E(εij ) = 0, E(il2 ) = σX2 and E(εij2 ) = σY2 , and {Xi (·), Yi (·)} is independent of il and εij
where Xi represents the entire curve Xi (s) on S and β(s, t) is the coefficient function evaluated
at (s, t). Using the expansion of Yi (t), we have
∞ ∞
E{Yi (t) | Xi } = E μY (t) + ξY,i,k ϕY,k (t)Xi = μY (t) + E ξY,i,k Xi ϕY,k (t).
k=1 k=1
(2)
∞ ∞ ∞
β(s, t) {Xi (s) − μX (s)} ds = βk,l ϕY,k (t) ξX,i, j ϕX,l (s)ϕX, j (s) ds.
S k=1 l=1 j=1 S
(3)
948 X. Zhang, J.-M. Chiou AND Y. Ma
Inserting (2) and (3) into (1), we obtain
∞ ∞ ∞ ∞
E ξY,i,k | Xi ϕY,k (t) = βk,l ϕY,k (t) ξX,i, j ϕX,l (s)ϕX, j (s) ds
k=1 k=1 l=1 j=1 S
∞ ∞
= βk, j ϕY,k (t)ξX,i, j ,
k=1 j=1
or equivalently
∞
for all k. At each k, (4) is alinear regression model without an intercept but with an infinite number
of covariates, i.e., Y = ∞ j=1 βj Xj + e in the familiar notation. At the ith observation, ξY,i,k is
the response variable, {ξX,i, j }∞ ∞
j=1 are the covariates, and {βk, j }j=1 are the regression parameters.
Let {(ξ̂X,i, j )∞ ∞
j=1 , ξ̂Y,i,k }, ϕ̂X, j (s) and ϕ̂Y,k (t) denote the estimates of {(ξX,i, j )j=1 , ξY,i,k }, ϕX, j (s) and
ϕY,k (t) based on the observed Ui (sil )s and Vi (tij )s. Solving the linear regression model (4) using
{(ξ̂X,i, j )∞
j=1 , ξ̂Y,i,k } and the estimates ϕ̂Y,k (t), we can obtain the estimates of βk, j , β̂k, j , and let
∞ ∞
β̂(s, t) = β̂k, j ϕ̂X, j (s)ϕ̂Y,k (t).
k=1 j=1
Assume we also have a new functional covariate X0 (s), but the corresponding response function
Y0 (t) is unknown and must be predicted, assuming that {X0 (s), Y0 (t)} has the same distribution
as {Xi (s), Yi (t)} (i = 1, . . . , n). Like Xi (s), X0 (s) is observed with measurement error, i.e., only
the U0 (s0l ) are observed. The relation in (1) leads to
E{Y0 (t) | X0 } = μY (t) + β(s, t) {X0 (s) − μX (s)} ds
S
∞ ∞ ∞
= μY (t) + βk, j ϕX, j (s)ϕY,k (t) ξX,0, j ϕX, j (s) ds
S k=1 j=1 j=1
∞ ∞
= μY (t) + βk, j ξX,0, j ϕY,k (t).
k=1 j=1
Using this and the estimates β̂k, j , ξ̂X,0, j and ϕ̂Y,k (t), we aim to predict Y0 (t) by considering
∞ ∞
Ŷ0 (t) = μ̂Y (t) + β̂(s, t) X̂0 (s) − μ̂X (s) ds = μ̂Y (t) + β̂k, j ξ̂X,0, j ϕ̂Y,k (t),
S k=1 j=1
where ξ̂X,0, j is estimated using the U0 (s0l )s and X̂0 (s) = ∞ j=1 ξ̂X,0, j φ̂X, j (s) + μ̂X (s).
We omit the detailed calculations, which are well documented in the functional data analysis
literature. Specifically, see Yao et al. (2005a,b) for the estimation procedure for the mean func-
tions μX (·) and μY (·), the covariance functions CX (·, ·) and CY (·, ·) with their corresponding
eigenvalues and eigenfunctions {λX, j , ϕX, j (s)} and {λY,k , ϕY,k (t)}, and the random coefficients
ξX,i, j and ξY,i, j based on the observations Ui (sil ) and Vi (tij ).
where Kn and Jn can increase to infinity as n → ∞. The linear regression model (5) is then solved
to obtain β̂k, j for k = 1, . . . , Kn and j = 1, . . . , Jn , and conceptually form
Kn Jn
β̂Jn ,Kn (s, t) = β̂k, j ϕ̂X, j (s)ϕ̂Y,k (t).
We still obtain μ̂X (s) and μ̂Y (t) using the smoothed means of the Ui (sil )s and Vi (tij )s, as
in § 2·2.
Thus, inserting the results into (1), we can predict the Y0 (t) for t ∈ T via
Kn
Jn
Ŷ0, Jn ,Kn (t) = μ̂Y (t)+ β̂Jn , Kn (s, t) X̂0, Kn (s) − μ̂X (s) ds = μ̂Y (t) + β̂k, j ξ̂X,0, j ϕ̂Y,k (t),
S k=1 j=1
(6)
n
where X̂0, Kn (s) = K
j=1 ξ̂X,0, j φ̂X, j (s) + μ̂X (s). The predicted function Ŷ0, Jn , Kn (t) in (6) depends
on the total numbers of components Jn and Kn , so the prediction performance varies with Jn
and Kn .
where the second term is unrelated to w. Thus, minimizing r̃(w) is equivalent to minimizing
2
r(w) ≡ E h0 (t) − Ŷ0 (t, w) dt. (8)
T
In the following development, we determine the optimal weights, design a procedure to predict
Y0 (t) by approximating and minimizing r̃(w), and establish the theoretical properties of the
procedure by examining its performance in terms of minimizing r(w).
n [−q]
where X̂i,Kn (s) = K j=1 ξ̂X,i, j φ̂X, j (s) + μ̂X (s). The computation of Ŷi, Jn , Kn (t) follows similarly
to that of the last expression in (6); we omit the details. Then the weighted average prediction of
Yi (t) is
[−q]
[−q]
Ŷi (t, w) = wJn , Kn Ŷi, Jn , Kn (t).
Kn ∈K Jn ∈ J
instead of cvQ (w). Nevertheless, we propose to work with cvQ (w) because
Q M 2
[−q]
cvQ (w) = Y(q−1)M +m (t) + ε(q−1)M +m (t) − Ŷ(q−1)M +m (t, w) dt
q=1 m=1 T
Q M 2
[−q]
= Y(q−1)M +m (t) − Ŷ(q−1)M +m (t, w) dt + n(T2 − T1 )σY2
q=1 m=1 T
= cv(w)
˜ + n(T2 − T1 )σY2 (11)
almost surely, where n(T2 − T1 )σY2 is unrelated to w; see the Supplementary Material for the
proof of (11). Therefore, minimizing cvQ (w) and cv ˜ Q (w) are equivalent almost surely.
Second, the computation of (9) is a standard quadratic programming problem and is extremely
[−q]
simple. Let Ṽq,m, Jn , Kn (t) = V(q−1)M +m (t) − Ŷ(q−1)M +m, Jn , Kn (t), define the inner product of two
functions on a domain T by f1 , f2 = T f1 (t)f2 (t) dt and the norm by
f1
= { T f12 (t)dt}1/2 ,
Q
and let ṼJ∗n ,Kn , J ,K = q=1 Mm=1 Ṽq,m, Jn , Kn , Ṽq,m, Jn , Kn . Then
n n
Q M 2
cvQ (w) = wJn , Kn Ṽq,m, Jn , Kn
q=1 m=1 Jn ∈J Kn ∈K
Q M
= wJn , Kn Ṽq,m, Jn , Kn , wJn , Kn Ṽq,m, Jn , Kn
q=1 m=1 Jn ∈J Kn ∈K Jn ∈J Kn ∈K
= wJn , Kn wJn , Kn ṼJ∗n , Kn , J , K = wT Hw, (12)
Jn ∈ J Kn ∈K Jn ∈J Kn ∈K n n
where H = (Hr,s ) and the (r, s) element of H is Hr,s = ṼJ∗n , Kn , J , K with r = (Jn − 1)|K| + Kn
n n
and s = (Jn − 1)|K| + Kn , where Jn , Jn = 1, . . . , |J | and Kn , Kn = 1, . . . , |K|; here |J | and |K|
stand for the cardinalities of the sets J and K, respectively. It is now clear that (9) is a standard
quadratic programming problem of the form
where l is a column vector whose entries are all 1, and P = |J ||K|. This problem can be rapidly
solved even when P is very large.
(−i) (−i)
Define the average leave-one-out prediction to be Ŷi (t, w) = Kn ∈K Jn ∈J wJn , Kn Ŷi, Jn , Kn (t),
and
2
1 n
R̄(w) = E hi (t) − Ŷi (t, w) dt. (14)
n i=1 T
Below we study the theoretical properties of the prediction risk r(w) in (8), which is associated
with the leave-one-out crossvalidation prediction risk R(w) in (13) defined above.
Condition 1. There exist functions μ∗Y (t), μ∗X (s), βJ∗n ,Kn (s, t), Xi,K
∗ (s) and a series c → 0 such
n n
that as n → ∞, μ̂Y (t) − μ∗Y (t) = Op (cn ), μ̂X (s) − μ∗X (s) = Op (cn ), β̂Jn , Kn (s, t) − βJ∗n , Kn (s, t) =
Op (cn ) and X̂i,Kn (s) − Xi,K∗ (s) = O (c ) hold uniformly for s ∈ S , t ∈ T , J ∈ J , K ∈ K and
n p n n n
i ∈ {0, 1, . . . , n}.
Let Yi,∗Jn , Kn (t) = μ∗Y (t) + S βJ∗n , Kn (s, t) Xi,K
∗ (s) − μ∗ (s) ds,
n X
1 n 2
Yi∗ (t, w) = wJn , Kn Yi,∗Jn , Kn (t), R∗ (w) = hi (t) − Yi∗ (t, w) dt,
Kn ∈K Jn ∈J n i=1 T
(15)
Condition 2. As n → ∞, hi (t) = Op (1), Yi,∗Jn , Kn (t) = Op (1) and Yi (t) = Op (1) hold
uniformly for i ∈ {0, 1, . . . , n}, t ∈ T , Jn ∈ J and Kn ∈ K.
Condition 5. The prediction risk r(w) and the crossvalidation risk R(w) satisfy
supw∈W |r(w)/R(w) − 1| → 0 as n → ∞.
Condition 1 requires the convergence of the average estimators μ̂Y (t), μ̂X (s) and the estimator
β̂Jn ,Kn (s, t) under each model and describes their convergence rates. They need not have the same
Averaging estimated functional linear regression models 953
convergence rate. For example, when μ̂Y (t) − μ∗Y (t)
= Op μ̂X (s) − μ∗X (s) = Op (n−a2 ),
(n−a1 ),
∗ −a
β̂Jn , Kn (s, t)−βJn , Kn (s, t) = Op (n ) and X̂i,Kn (s)−Xi (s)
3 = Op (n−a4 ), cn can be n− min(a1 , a2 ,a3 ,a4 ) .
This very mild condition is satisfied by all the valid functional data estimators that we know;
for example, Yao et al. (2005b) provide the rates of convergence and the conditions for these
rates in their (41), (A1), (A2) and the assumption in Lemma A.1. Since these conditions are
lengthy, we omit them. Condition 2 excludes some pathological cases in which the limiting
value explodes. Condition 3 contains more essential requirements in situations in which our
asymptotic results apply. The first part requires that ηn grow at a rate no slower than n1/2 . This
(−1)
Here, Ŷ0 (t, w) and Ŷ0 (t, w) should be very close because, although they use observa-
tions {U0 (s0l ); U2 (s2l ), V2 (t2j ), . . . , Un (snl ), Vn (tnj )} and {U0 (s0l ); U1 (s1l ), V1 (t1j ), . . . , Un (snl ),
Vn (tnj )}, respectively, they are otherwise constructed in an identical manner. Thus, Condition 5
is also reasonable.
We are now ready to describe our theoretical findings.
in probability as n → ∞.
954 X. Zhang, J.-M. Chiou AND Y. Ma
Theorem 1 shows that the prescribed model averaging procedure is asymptotically optimal in
the sense that its squared error loss is asymptotically identical to that of the infeasible best possible
model averaging estimator. The proof of Theorem 1 is in the Supplementary Material, with all
other proofs. Condition 3 used in Theorem 1 requires that all candidate models be misspecified.
This excludes the lucky situation in which one of these candidate models happens to be the true
model. Indeed, if the candidate model set contains a true model, then the more suitable practice
would be to perform model selection instead of model averaging. Of course, in practice, it is not
known whether the candidate model set happens to contain the true model, and without this prior
Let (J ∗ , K ∗ ) correspond to a true model; this may not be the only true model. For example, any
model containing model (J ∗ , K ∗ ) is also true. Assume that uniformly for s ∈ S and t ∈ T ,
This simply states that the estimator of the regression coefficient function βJ ∗ ,K ∗ (s, t) is consistent,
which is true for all valid estimators in the literature. Depending on the situation, the convergence
rate dn can vary. In the ideal case in which the complete curves Ui (s) and Vi (t) are observed,
dn = n−1/2 for fixed J ∗ and K ∗ . In more common cases in which the curves are observed at
some discrete points, because smoothing methods are typically implemented first to estimate the
curves μX (s) and μY (t) and to estimate their covariance processes CX (s, s ) and CY (t, t ), the
convergence rate of dn is slower than n−1/2 . For increasing J ∗ and K ∗ , dn depends on J ∗ and K ∗ ,
the covariance processes, the eigenvalue gaps and decay rates, and the bandwidths and has very
complex form (Yao et al., 2005b). We further impose the following conditions.
Condition 7. There exists δ(t) c > 0 such that uniformly for w ∈ W and t ∈ T and for
almost all i ∈ {1, . . . , n},
2 2
β̂w (s, t) − β(s, t) X̂i (s) − μ̂X (s) ds β̂w (s, t) − β(s, t) ds δ(t),
S S
Condition 6 resembles and often strengthens Condition 4. It requires the difference between the
[−q]
regular prediction Ŷ(q−1)M +m (t, w) and the leave-M -out prediction Ŷ(q−1)M +m (t, w) to decrease
as the sample size increases. As mentioned, we show that Condition 6 holds quite naturally and
Averaging estimated functional linear regression models 955
give an expression for gn in Corollary 1. Condition 7 seems complex; essentially it states that
most X̂i (s)s do not degenerate, in the sense that their inner products with β̂w (s, t) − β(s, t) do not
approach zero. We now describe the performance of the weighted estimation results when the
true model is among the candidate models.
Theorem 2. When the true model (J ∗ , K ∗ ) is one of candidate models, under Conditions 1 and
2
2 and Conditions 6 and 7, the weighted estimator satisfies T S β̂ŵ (s, t) − β(s, t) ds dt =
Op (dn2 + n−1 + gn ), where gn is defined in Condition 6.
Condition 8. As n → ∞, X̂i (s) and μ̂X (s) are Op (1) uniformly for i ∈ {1, . . . , n} and s ∈ S .
4. Data example
Traffic flow prediction is important in transportation studies because traffic flow is a crucial
macroscopic traffic characteristic. Here, we aim to predict a future or unobserved short-term
traffic flow trajectory using the current or partially observed traffic flow trajectory (Chiou, 2012).
We use historical records of daily traffic flow rates, i.e., vehicle counts per five-minute interval,
collected in 2014 by a dual-loop vehicle detector located southbound near Shea-Shan Tunnel on
Taiwan Freeway 5. The 354 daily trajectories used in the analysis exclude 11 trajectories with
massive missing values due to detector malfunction. We randomly sample 219 days as the training
data to build the functional regression model, and use the remaining 135 days as the test data to
evaluate the prediction performance.
We set the time domains S and T for X and Y to ([0, 10], [10, 20]) for Scenario I and
([6, 14], [14, 24]) for Scenario II, and use fraction-of-variance-explained and the Akaike and
Bayesian information criteria to select the numbers of functional principal components of X and
Y , as shown in Table 1. The results selected using the Akaike and Bayesian information criteria
are quite close and are much larger than those selected using the fraction-of-variance criterion.
956 X. Zhang, J.-M. Chiou AND Y. Ma
Table 1. Number of principal components selected for traffic flow prediction under two scenarios
of (S , T )
Scenario (S, T ) FVE(0·90) FVE(0·95) aic bic
I ([0, 10], [10, 20]) (2, 3) (3, 5) (15, 20) (14, 20)
II ([6, 14], [14, 22]) (2, 2) (4, 3) (18, 15) (16, 13)
aic, Akaike information criterion; bic, Bayesian information criterion criteria; FVE, fraction-of-variance-explained.
RMIPE
RMIPE
27 27 27
26 26 26
25 25 25
24 24 24
MA(0)
MA(1)
MA(2)
MA(4)
MA(6)
MA(8)
MA(0)
MA(1)
MA(2)
MA(4)
MA(6)
MA(8)
MA(0)
MA(1)
MA(2)
MA(4)
MA(6)
MA(8)
MA(10)
MA(10)
MA(10)
RW
RW
RW
Fig. 1. Line charts for the root mean integrated squared prediction errors, rmipe, of traffic flow rates for X in S =
[0, 10] and Y in T = [10, 20] with Q = 2 (a), 4 (b), and 10 (c). The (Kn , Jn ) for the reweighting method, RW, the
selection method, MA(0), and the candidate sets (J , K) for the model averaging method, MA(d), are determined
by the fraction-of-variance-explained (0·90) (dashed and ◦), fraction-of-variance-explained (0·95) (dashed and ),
Akaike (solid and ), and Bayesian information criteria (solid and •).
We use these criteria to determine the range of component numbers for model averaging. We also
investigate how averaging different numbers of models impacts the prediction results.
To perform prediction using model averaging, we choose the candidate sets J and K by setting
J = {j ∈ N : |L̂ − j| d} and K = {k ∈ N : |P̂ − k| d}, where L̂ and P̂ are obtained using the
Akaike or Bayesian information criteria or fraction-of-variance-explained. We call the resulting
procedure d-divergence model averaging. Here, d is a small positive integer; when d = 0,
it corresponds to the selection criterion without model averaging. We then obtain the optimal
weights through Q-fold crossvalidation, where Q is chosen to be 2, 4 and 10. In addition to the the
model averaging approach, we further included a reweighting method, as suggested by a referee.
n Jn
The reweighting method is defined as Ŷ0 (t, p) = μ̂Y (t)+ K k=1 j=1 pk, j b̂k, j ξ̂X,0, j φ̂Y,k (t), where
Kn and Jn are determined by one of the above three criteria and p = (p1,1 , . . . , pKn , Jn ) is selected
by minimizing the Q-fold crossvalidation without any constraint on p.
Figure 1 illustrates the root mean integrated prediction errors in Scenario I,
n0 2 1/2
rmipe(w) = n−1
0 | T | −1
Y0,i (t) − Ŷ 0,i (t, w) dt , (18)
i=1 T
where |T | is the interval length of T and n0 is the number of test data. The results suggest that
the reweighting method does not perform well and that a small number of components (3,5) for
(X , Y ) selected by the fraction-of-variance-explained (0·95) criterion almost reaches the minimal
root mean integrated squared prediction errors. In addition, the performance depends little on
Q. The 1-divergence model averaging coupled with fraction-of-variance-explained (0·90) also
performs well, and the prediction error decreases only slightly as d increases. The results using
the Akaike and Bayesian information criteria with larger numbers of components have larger
Averaging estimated functional linear regression models 957
(a) (b) (c)
29 29 29
28 28 28
27 27 27
26 26 26
RMIPE
RMIPE
RMIPE
25 25 25
24 24 24
23 23 23
22 22 22
MA(1)
MA(2)
MA(4)
MA(6)
MA(8)
MA(0)
MA(1)
MA(2)
MA(4)
MA(6)
MA(8)
MA(0)
MA(1)
MA(2)
MA(4)
MA(6)
MA(8)
MA(10)
MA(10)
MA(10)
RW
RW
RW
Fig. 2. Line charts for the root mean integrated squared prediction error of traffic flow rates for X in S = [6, 14] and
Y in T = [14, 22] with Q = 2 (a), 4 (b), and 10 (c). The abbreviations are the same as in Fig. 1.
prediction errors, but these results also improve when using model averaging with d = 8 or
bigger.
Figure 2 suggests that Scenario II requires more components, since the models selected by
the Akaike information criterion (18,15) and the Bayesian information criterion (16,13) yield
relatively small prediction errors, and the model averaging approaches only slightly reduce
them. While fraction-of-variance-explained (0·90) and fraction-of-variance-explained (0·95) both
underestimate the numbers of components, model averaging is able to compensate because the
prediction errors decrease as d increases to 6 or 8.
We display the weights for Q = 2 for each selection criterion with d = 1 and d = 2 in Tables 2
and 3. Most of the (Jn , Kn ) combinations shrink to zero and only a few play a role in prediction.
For example, Table 3 indicates that fraction-of-variance-explained (0·95) with d = 2 results in
the optimal weights 0·36, 0·30, 0·23, and 0·11, respectively, for the models with the component
numbers (3,5), (5,3), (5,7) and (3,6), and all other weights are zero.
The data example demonstrates that when there exists a right number of components chosen by
a criterion, further employing model averaging does not cause additional loss. In addition, when
a criterion does not select the right number of components, model averaging can guard against
prediction loss. Model averaging offers the optimal weights to improve prediction performance
and can protect against potential prediction loss caused by a single selection criterion.
958 X. Zhang, J.-M. Chiou AND Y. Ma
Table 3. Model averaging weights (%) obtained by Q-fold (Q = 2) crossvalidation
with columns and rows corresponding to J with d = 2 and K with d = 2
FVE(0·90): (L̂, P̂) = (2, 3) FVE(0·95): (L̂, P̂) = (3, 5)
(Jn , Kn ) 1 2 3 4 5 (Jn , Kn ) 3 4 5 6 7
1 0·0 0·0 0·0 0·0 0·0 1 0·0 0·0 0·0 0·0 0·0
2 0·0 0·0 0·0 0·0 0·0 2 0·0 0·0 0·0 0·0 0·0
3 3·1 0·0 0·0 0·0 53·2 3 0·0 0·0 36·0 11·2 0·0
4 0·0 0·0 29·5 0·0 14·2 4 0·0 0·0 0·0 0·0 0·0
5 29·8 0·0 0·0 0·0 23·0
5. Simulation studies
5·1. Simulation designs
We assume that pairs of the smooth stochastic processes Xi and Yi are observed at time-points
Si,l and Ti,m , respectively, with measurement errors, for l = 1, . . . , nXi and m = 1, . . . , nYi ,
assuming that nXi 2 and nYi 2. We consider the following simulation designs.
(a) Consider the sparse designs. Set nT = 400 as the number of individual trajectories of
a dataset, among which the first n = 300 trajectories are used as training data and the
other n0 = 100 as test data. Set nXi = nYi (i = 1, . . . , nT ), which are sampled from the
discrete uniform distribution on [10, 12]. For each i, the associated Si,l (l = 1, . . . , nXi ) and
Ti,m (m = 1, . . . , nYi ) are randomly sampled from 26 equally spaced grid points on [0, 1].
Generate 100 replicates of the simulation datasets.
(b) Consider the observations of Xi through the basis expansion and the geometric √ Brownian
motion, respectively. For basis expansion, set the eigenfunctions ϕX, j (s) = 2 cos( jπs)
(j = 1, . . . , Jn ) with Jn = 8, along with the eigenvalues {λX, j } where λX, j = 9{1 − 0·1( j −
1)}; λX, j = 9j −1·2 .
For geometric Brownian motion dX (t) = μG dt + σG X (t)dW (t), we set μG = 3 and σG =
0·75 and let W (t) be a Wiener process. The eigenvalue-eigenfunction pairs {λX, j , ϕX, j (s)}
(j = 1, . . . , Jn ) are obtained by the functional principal component analysis using the
generated observations of geometric Brownian motion.
(c) Generate the random coefficients ξX,i, j (i = 1, . . . , nT ) such that ξX,i, j is a N (0, λX, j ) variate
for each j = 1, . . . , Jn .
n
(d) Set μX (s) = 0 and obtain the observations of Xi at Si,l by Xi (Si,l ) = Jj=1 ξX,i, j ϕX,l (Si,l )
(l = 1, . . . , nXi ; i = 1, . . . , nT ). √
(e) Set the eigenfunctions ϕY,k (t) = 2 sin(kπ t) (k = 1, . . . , Kn ) with Kn = 8, and the Jn ×Kn
matrix PJn Kn = {βk, j }1jJn ,1k Kn with
⎧
⎨ 0·8j , 1 j = k 6,
βj,k = 0·5 max( j,k) , 1j= | k 6,
⎩
0, j > 6, k > 6.
Averaging estimated functional linear regression models 959
(f) Set the diagonal matrix
ξ = diag{(k −1·5 )
k=1,...,Kn } and generate εξ ,i = (ξ i1 , . . . , ξ iKn )
T
from the multivariate normal distribution with mean zero and covariance
ξ .
(g) Let ξY,i = (ξY,i,1 , . . . , ξY,i,Kn )T . Form ξY,i = PJTn , Kn ξXi + εξ ,i (i = 1, . . . , nT ).
(h) Set μY (t) = 0 and generate the observation Yi at Ti,m by Yi (Ti,m ) = K k=1 ξY,i,k ϕY,k (Ti,m )
(i = 1, . . . , nT ).
(i) Obtain the observations Uil and Vim by adding measurement errors il and εim to Xi (Si,l )
and Yi (Ti,m ), where il and εim are independent N (0, 0·8) variates.
Step 1. Assign A = [{Yi (t), Xi (s)}]1in as the training set and the remaining n0 pairs as the
testing set, B . In the simulation, we set n = 300 and n0 = 100.
Step 3. Obtain the optimal weights w, and perform the Q-fold crossvalidation procedure based
on A.
(i) Divide A into Q subsets, {Qq }1qQ , with a sample size of Mq = n/Q.
(ii) For each q ∈ {1, . . . , Q}, perform the functional linear regression estimation on the leave-
Mq -out set, A\Qq , for each Jn ∈ J and Kn ∈ K.
[−q] [−q]
(iii) Obtain the estimated β̂Jn , Kn (s, t) and the predicted Ŷi, Jn , Kn (t) for {Xi (s) ∈ Qq }, Jn ∈ J and
Kn ∈ K.
[−q]
(iv) Arrange {Ŷi, Jn , Kn (t)−Yi (t)}Jn ∈J , Kn ∈K in a 1 by |J |×|K| array corresponding to the weight
w, and form the matrix H as in (12).
(v) Obtain the optimal weight ŵJn , Kn by solving arg minw cvQ (w) = wT Hw subject to l T w = 1
and w 0, using quadratic
programming.
(vi) Obtain β̂w (s, t) = Jn ∈J , Kn ∈K wJn , Kn β̂Jn , Kn (s, t), where wJn , Kn is an element of the optimal
weight ŵ derived above.
A large dimension of the weight vector w can cause numerical instability when searching
for the optimal weight vector w by minimizing cvQ (w) in (9) over w. For example, if we set
JL = KL = 1 and let J = {1, . . . , JU } and K = {1, . . . , KU }, where JU max{nXi } and
KU max{nYi }, then the number of weight parameters is JU KU − 1, which can be large. In
practice, we can determine the sets J and K by making use of selection criteria such as in Step 2.
By setting J and K more carefully, we can reduce the number of elements in the weight vector
w, which helps to simplify the numerical implementation.
960 X. Zhang, J.-M. Chiou AND Y. Ma
8·0
RMIPE 7·0
6·0
5·0
4·0
FVE90
FVE95
MA(FVE90±1;Q2)
MA(FVE90±2;Q2)
MA(FVE90±4;Q2)
MA(FVE90±1;Q4)
MA(FVE90±2;Q4)
MA(FVE90±4;Q4)
MA(FVE95±1;Q2)
MA(FVE95±2;Q2)
MA(FVE95±4;Q2)
MA(FVE95±1;Q4)
MA(FVE95±2;Q4)
MA(FVE95±4;Q4)
RW(FVE90;Q2)
RW(FVE90;Q4)
RW(FVE95;Q2)
RW(FVE95;Q4)
8·0
7·0
RMIPE
6·0
5·0
4·0
3·0
MA(AIC±1;Q2)
MA(AIC±2;Q2)
MA(AIC±4;Q2)
MA(AIC±1;Q4)
MA(AIC±2;Q4)
MA(AIC±4;Q4)
MA(BIC±1;Q2)
MA(BIC±2;Q2)
MA(BIC±4;Q2)
MA(BIC±1;Q4)
MA(BIC±2;Q4)
MA(BIC±4;Q4)
AIC
RW(AIC;Q2)
RW(AIC;Q4)
BIC
RW(BIC;Q2)
RW(BIC;Q4)
20
Number of components
15
10
5
FVE90(X)
FVE95(X)
FVE95(Y)
AIC(X)
AIC(Y)
BIC(X)
BIC(Y)
FVE90(Y)
Fig. 3. Outcome for the basis expansion design with λX, j = 9{1 − 0·1( j − 1)}: boxplots of rmipe with Q =
1, 2, 4 and the selected number of components based on 100 simulation datasets. rmipe, root mean integrated
squared prediction errors; FVE90, fraction-of-variance-explained (0·90); FVE95, fraction-of-variance-
explained (0·95); aic, Akaike information criterion; bic, Bayesian information criterion; RW(FVE90,Qb),
reweighting method with Jn and Kn determined by FVE90 and with p selected by CVQ=b ; RW(FVE95,Qb),
RW(aic,Qb) and RW(bic,Qb) have similar definitions; MA(FVE90±a,Qb), model averaging method with
Jn and Kn determined by FVE90, with d = a, and with weights selected by CVQ=b ; MA(FVE95±a,Qb),
MA(aic±a,Qb) and MA(bic±a,Qb) have similar definitions; FVE90(X ), number of components for Xi (s)
determined by FVE90; FVE95(X ), aic(X ), bic(X ), FVE90(Y ), FVE95(Y ), aic(Y ) and bic(Y ) have similar
definitions.
Averaging estimated functional linear regression models 961
5·3. Simulation results
We compare the performance of various settings of J and K in terms of the predicted Y0,i ,
using discrete approximations for the root mean integrated squared prediction errors defined
by (18).
In each design, we present the results with Q = 1, 2, 4 for the Q-fold crossvalidation method
to obtain the optimal weights. The results with Q = 8 perform similarly to those with Q = 4
in our numerical study and are omitted. The prediction performance is compared on the basis of
rmipe(w) under various settings of J and K with d = 1, 2, 4 coupled with a selection method
6. Discussion
To select the model weights using cvQ (w) in (9), we need to determine Q. In our Monte Carlo
study, we compare Q = 2 and Q = 4 for n = 300. The performance of Q = 8 is similar to that of
Q = 4 in our simulation study. Although it is possible to set an additional criterion to choose the
optimal value of Q, this may increase the computational burden. Our numerical results indicate
that a small value of Q is usually sufficient to choose the model weights for moderate sample
sizes.
We consider functional linear regression in which both the covariate and the response are
random functions, where the covariate and the response functions can be measured with error
and can be observed densely or sparsely, regularly or irregularly. The proposed approach can also
be used when either the response or the predictor variable contains a random function. It helps
to reduce prediction errors and the possibility of producing poor results using a single model
selection criterion.
One aspect omitted here is the variability of the resulting averaged prediction. In fact, research
regarding variability of optimal model averaging methods has been missing from the literature,
due to the difficulties caused by the random weights and model misspecification. A good starting
point will be linear models, and much effort is needed in pushing forward the research.
Acknowledgement
This work was supported by the National Natural Science Foundation of China, the U.S.
National Science Foundation, the U.S. National Institute of Neurological Disorders and Strokes,
962 X. Zhang, J.-M. Chiou AND Y. Ma
and the Academia Sinica and the Ministry of Science and Technology of Taiwan. The authors
wish to thank the editor, associate editor, and referees for helpful remarks.
Supplementary material
Supplementary material available at Biometrika online includes the detailed proofs of (11),
Theorems 1 and 2, and Corollary 1 as well as additional simulation results.