You are on page 1of 18

Biometrika (2018), 105, 4, pp. 945–962 doi: 10.

1093/biomet/asy041
Printed in Great Britain Advance Access publication 26 September 2018

Functional prediction through averaging estimated


functional linear regression models
By XINYU ZHANG

Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


Academy of Mathematics and Systems Science, Chinese Academy of Sciences,
52 Sanlihe Road, Beijing 100190, China
xinyu@amss.ac.cn

JENG-MIN CHIOU
Institute of Statistical Science, Academia Sinica, 128 Academia Road, Nankang District,
Taipei 11529, Taiwan
jmchiou@stat.sinica.edu.tw

AND YANYUAN MA
Department of Statistics, Pennsylvania State University, University Park,
Pennsylvania 16802, U.S.A.
yzm63@psu.edu

Summary
Prediction is often the primary goal of data analysis. In this work, we propose a novel model
averaging approach to the prediction of a functional response variable. We develop a crossval-
idation model averaging estimator based on functional linear regression models in which the
response and the covariate are both treated as random functions. We show that the weights cho-
sen by the method are asymptotically optimal in the sense that the squared error loss of the
predicted function is as small as that of the infeasible best possible averaged function. When the
true regression relationship belongs to the set of candidate functional linear regression models,
the averaged estimator converges to the true model and can estimate the regression parameter
functions at the same rate as under the true model. Monte Carlo studies and a data example
indicate that in most cases the approach performs better than model selection.

Some key words: Asymptotic optimality; Crossvalidation; Functional data; Model averaging; Weighting.

1. Introduction
Functional data analysis is an active research area that has experienced rapid growth in recent
decades. Comprehensive works on this topic include monographs (Ramsay & Silverman, 2005;
Ferraty & Vieu, 2006; Horváth & Kokoszka, 2012; Zhang, 2013; Hsing & Eubank, 2015) and
review articles (Müller, 2005; Cuevas, 2014; Wang et al., 2016).
Functional regression analysis relates the response to the covariates when either or both con-
tain random functions. Among various types of functional regression models, those with scalar
responses and functional covariates are the most intensively studied; see, for example, Cardot
et al. (2003), James (2002), Müller & Stadtmüller (2005), and Ferraty & Vieu (2009). Ramsay
& Dalzell (1991) introduced functional regression models in which the response variable is a


c 2018 Biometrika Trust
946 X. Zhang, J.-M. Chiou AND Y. Ma
function and the covariates can be vectors or functions. Faraway (1997) and Chiou et al. (2003)
developed functional response models with scalar covariates. Yao et al. (2005b) and Ferraty
et al. (2012) investigated functional linear regression models in which both the responses and the
covariates were functionals, with various extensions such as functional additive models (Müller
& Yao, 2008) and functional mixture prediction models (Chiou, 2012), among others. Although
the practice of regression analysis often concerns the best-fitting relations between the covariates
and response, its ultimate purpose is often prediction.
Prediction is challenging, especially when the target is functional. The accuracy of prediction

Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


often hinges on the suitability of the regression model. Model selection and model averaging can
be used to deal with model uncertainty and improve prediction accuracy. Model selection aims
to find the best model among a set of candidate models. There are a variety of model selection
criteria, including the classical Akaike information criterion (Akaike, 1970) and the Bayesian
information criterion (Schwarz, 1978). Model averaging combines a set of candidate models by
taking a weighted average of them to guard against the risk of selecting a single inferior model. In
general, model selection and model averaging are competing approaches, and neither performs
universally best. Intuitively, because any model selected by the model selection procedure can be
viewed as a particular averaged model result, model averaging is more flexible and hence serves
as a more cautious alternative.
Model averaging has been studied intensively (Hoeting et al., 1999; Claeskens & Hjort,
2008). Hansen & Racine (2012) proposed a jackknife model averaging estimator that selects
the weights by minimizing a crossvalidation criterion. The idea was previously investigated by
Wolpert (1992) and Breiman (1996). Hansen & Racine (2012) showed that the jackknife model
averaging estimator has an advantage over its competitors in that it produces small asymptotic
expected squared errors within a large class of linear estimators. Their finite-sample results also
suggest that the estimator is preferable to several other model selection and averaging criteria,
especially when the errors are heteroscedastic. Zhang et al. (2013) considered a nondiagonal
error covariance structure and lagged dependent variables. Cheng & Hansen (2015) extended the
jackknife model averaging method to leave-h-out crossvalidation criteria for forecasting in com-
bination with factor-augmented regression, and Gao et al. (2016) extended it to leave-subject-out
crossvalidation under a longitudinal data setting.
In this paper, we investigate functional linear regression analysis using model averaging,
when both the response and the covariates are functions. Because functional data are intrinsically
infinite-dimensional, a procedure to reduce the dimensionality to a finite number is required, and
the total numbers of components retained for the response and covariates will affect prediction
performance. Model selection methods provide one means of selecting the number of components,
although they may not yield the best results in terms of prediction error, and there is a risk
of selecting an inferior model. We propose model averaging, selecting the model weights that
minimize the expected crossvalidation prediction errors. We show that in large samples the
proposed method minimizes the squared error loss of the predicted function, and consistently
estimates the model parameters. The model averaging procedure for a functional linear regression
model is easy to implement. Our Monte Carlo study indicates that the proposed approach typically
performs better than other model selection methods.

2. Functional linear regression model


2·1. Models and Karhunen–Loève expansions
Denote independently and identically distributed pairs of random functions by {Xi (s), Yi (t)},
for s ∈ S ≡ [S1 , S2 ] and t ∈ T ≡ [T1 , T2 ] (i = 1, . . . , n). Here, Xi (s) in L2 (S ) is the covariate
Averaging estimated functional linear regression models 947
function and Yi (t) in L2 (T ) is the corresponding response function, where L2 (·) represents the
Hilbert space of square-integrable random functions. We further model the observed data, which
may contain additional measurement errors, as
Uil ≡ Ui (sil ) = Xi (sil ) + il , sil ∈ S ,
Vij ≡ Vi (tij ) = Yi (tij ) + εij , tij ∈ T ,

where the errors are assumed to be independent and identically distributed with E(il ) = 0,
E(εij ) = 0, E(il2 ) = σX2 and E(εij2 ) = σY2 , and {Xi (·), Yi (·)} is independent of il and εij

Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


(i = 1, . . . , n). We assume that the random functions Xi (s) and Yi (t) have unknown smooth
mean functions E {Xi (s)} = μX (s) and E {Yi (t)} = μY (t) and unknown covariance func-
tions cov{Xi (s1 ), Xi (s2 )} = CX (s1 , s2 ) and cov{Yi (t 1 ), Yi (t2 )} = CY (t1 , t2 ). We assume that
orthogonal expansions exist such that CX (s1 , s2 ) = ∞ j=1 λX, j ϕX, j (s1 )ϕX, j (s2 ) and CY (t1 , t2 ) =
∞
k=1 λY,k ϕY,k (t1 )ϕY,k (t2 ), where {λX, j , ϕX, j (·)} and {λY,k , ϕX, j (·)} are eigenvalue-eigenfunction
pairs of the covariance operators CX (·, ·) and CY (·, ·) with the eigenvalues {λX, j } and {λY,k } in non-
increasing order. The orthonormal eigenfunctions {ϕX, j } and {ϕY,k } satisfy S ϕX, j (s)ϕX,l (s) ds =
δjl and S ϕY,k (t)ϕY,l (t) dt = δkl , where δjl and δkl are Kronecker delta symbols. Note that
{ϕX, j (s)}∞ ∞
j=1 and {ϕY,k (t)}j=1 form orthonormal bases of L2 (S ) and L2 (T ), respectively.
We write the Karhunen–Loève expansions of Xi (s) and Yi (t) as
∞ ∞
Xi (s) = μX (s) + ξX,i, j ϕX, j (s), Yi (t) = μY (t) + ξY,i,k ϕY,k (t),
j=1 k=1

where {ξX,i, j }∞ ∞
j=1 and {ξY,i,k }j=1 are sets of random coefficients such that ξX,i, j = S {Xi (s) −
μX (s)}ϕX, j (s) ds with mean zero and variance λX, j , and ξY,i,k = T {Yi (t) − μY (t)}ϕY,k (t) dt with
mean zero and variance λY,k . These stochastic expansions are useful for deriving the functional
linear regression model.

2·2. Functional linear regression model


A functional linear regression model with Xi centring on its mean function μX implies that

E{Yi (t) | Xi } = μY (t) + β(s, t) {Xi (s) − μX (s)} ds, (1)
S

where Xi represents the entire curve Xi (s) on S and β(s, t) is the coefficient function evaluated
at (s, t). Using the expansion of Yi (t), we have
 ∞   ∞
 

E{Yi (t) | Xi } = E μY (t) + ξY,i,k ϕY,k (t)Xi = μY (t) + E ξY,i,k Xi ϕY,k (t).
k=1 k=1

(2)

Following He et al. (2000), we can write


∞ ∞
β(s, t) = βk,l ϕX,l (s)ϕY,k (t),
k=1 l=1

where βk,l = E(ξX,i,l ξY,i,k )/E(ξX,i,l


2 ); hence

 ∞ ∞ ∞ 
β(s, t) {Xi (s) − μX (s)} ds = βk,l ϕY,k (t) ξX,i, j ϕX,l (s)ϕX, j (s) ds.
S k=1 l=1 j=1 S
(3)
948 X. Zhang, J.-M. Chiou AND Y. Ma
Inserting (2) and (3) into (1), we obtain
∞ ∞ ∞ ∞ 

E ξY,i,k | Xi ϕY,k (t) = βk,l ϕY,k (t) ξX,i, j ϕX,l (s)ϕX, j (s) ds
k=1 k=1 l=1 j=1 S
∞ ∞
= βk, j ϕY,k (t)ξX,i, j ,
k=1 j=1

or equivalently
 ∞


Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


E ξY,i,k (ξX,i, j )∞
j=1 = βk, j ξX,i, j (4)
j=1

for all k. At each k, (4) is alinear regression model without an intercept but with an infinite number
of covariates, i.e., Y = ∞ j=1 βj Xj + e in the familiar notation. At the ith observation, ξY,i,k is
the response variable, {ξX,i, j }∞ ∞
j=1 are the covariates, and {βk, j }j=1 are the regression parameters.
Let {(ξ̂X,i, j )∞ ∞
j=1 , ξ̂Y,i,k }, ϕ̂X, j (s) and ϕ̂Y,k (t) denote the estimates of {(ξX,i, j )j=1 , ξY,i,k }, ϕX, j (s) and
ϕY,k (t) based on the observed Ui (sil )s and Vi (tij )s. Solving the linear regression model (4) using
{(ξ̂X,i, j )∞
j=1 , ξ̂Y,i,k } and the estimates ϕ̂Y,k (t), we can obtain the estimates of βk, j , β̂k, j , and let
∞ ∞
β̂(s, t) = β̂k, j ϕ̂X, j (s)ϕ̂Y,k (t).
k=1 j=1

Assume we also have a new functional covariate X0 (s), but the corresponding response function
Y0 (t) is unknown and must be predicted, assuming that {X0 (s), Y0 (t)} has the same distribution
as {Xi (s), Yi (t)} (i = 1, . . . , n). Like Xi (s), X0 (s) is observed with measurement error, i.e., only
the U0 (s0l ) are observed. The relation in (1) leads to

E{Y0 (t) | X0 } = μY (t) + β(s, t) {X0 (s) − μX (s)} ds
S
    
∞ ∞ ∞
= μY (t) + βk, j ϕX, j (s)ϕY,k (t) ξX,0, j ϕX, j (s) ds
S k=1 j=1 j=1
∞ ∞
= μY (t) + βk, j ξX,0, j ϕY,k (t).
k=1 j=1

Using this and the estimates β̂k, j , ξ̂X,0, j and ϕ̂Y,k (t), we aim to predict Y0 (t) by considering
 ∞  ∞
Ŷ0 (t) = μ̂Y (t) + β̂(s, t) X̂0 (s) − μ̂X (s) ds = μ̂Y (t) + β̂k, j ξ̂X,0, j ϕ̂Y,k (t),
S k=1 j=1

where ξ̂X,0, j is estimated using the U0 (s0l )s and X̂0 (s) = ∞ j=1 ξ̂X,0, j φ̂X, j (s) + μ̂X (s).
We omit the detailed calculations, which are well documented in the functional data analysis
literature. Specifically, see Yao et al. (2005a,b) for the estimation procedure for the mean func-
tions μX (·) and μY (·), the covariance functions CX (·, ·) and CY (·, ·) with their corresponding
eigenvalues and eigenfunctions {λX, j , ϕX, j (s)} and {λY,k , ϕY,k (t)}, and the random coefficients
ξX,i, j and ξY,i, j based on the observations Ui (sil ) and Vi (tij ).

2·3. Dimension-reduced functional linear regression model


Functional data are intrinsically infinite-dimensional, but in practice a functional linear model
must be truncated to finite dimensions. It is common practice to keep only the first Jn < ∞ bases
Averaging estimated functional linear regression models 949
in {ϕX, j (·)} and Kn < ∞ bases in {ϕY,k (·)}, and to work with the truncated model
 Jn

E ξY,i,k (ξX,i, j )Jj=1
n
= βk, j ξX,i, j (k = 1, . . . , Kn ), (5)
j=1

where Kn and Jn can increase to infinity as n → ∞. The linear regression model (5) is then solved
to obtain β̂k, j for k = 1, . . . , Kn and j = 1, . . . , Jn , and conceptually form
Kn Jn
β̂Jn ,Kn (s, t) = β̂k, j ϕ̂X, j (s)ϕ̂Y,k (t).

Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


k=1 j=1

We still obtain μ̂X (s) and μ̂Y (t) using the smoothed means of the Ui (sil )s and Vi (tij )s, as
in § 2·2.
Thus, inserting the results into (1), we can predict the Y0 (t) for t ∈ T via
 
Kn 
Jn
Ŷ0, Jn ,Kn (t) = μ̂Y (t)+ β̂Jn , Kn (s, t) X̂0, Kn (s) − μ̂X (s) ds = μ̂Y (t) + β̂k, j ξ̂X,0, j ϕ̂Y,k (t),
S k=1 j=1
(6)
 n
where X̂0, Kn (s) = K
j=1 ξ̂X,0, j φ̂X, j (s) + μ̂X (s). The predicted function Ŷ0, Jn , Kn (t) in (6) depends
on the total numbers of components Jn and Kn , so the prediction performance varies with Jn
and Kn .

3. Model averaging for the functional linear regression model


3·1. Weighted average prediction
The choice of the best Jn and Kn in § 2·3 is usually made based on a model selection criterion
or by retaining a certain fraction of the total variance. Here, instead of selecting a single best
(Jn , Kn ) pair, we advocate averaging models corresponding to different (Jn , Kn ). The leading
Jn and Kn principal components of the covariate and response functions are retained in each
model, thus reflecting their relative importance. Using the notation of § 2, let the prediction of
Y0 (t) from a fixed (Jn , Kn ) choice be Ŷ0, Jn , Kn (t), where Jn ∈ J and Kn ∈ K for candidate
sets J and K. Typically, J = {JL , JL + 1, . . . , JU } and K = {KL , KL + 1, . . . , KU }, where
JL , KL and JU , KU are lower and upper bounds. For example, JL and KL can be set to small
integers, and JU and KU can be chosen as the maximum number of observed points on all
curves. Let wJn Kn be a weight assigned to the model with Jn eigenbases for the Ui (s) and Kn
eigenbases for the Vi (t). Let w be a vector formed by all such weights wJK . For example, w =
(wJL KL , wJL (KL +1) , . . . , wJL KU , . . . , wJU KL , w
JU (KL +1) , . . . , wJU KU ) . Now let W be the set of all
T

such weight vectors w, i.e., let W = {w : Jn ∈J ,Kn ∈K wJn Kn = 1, wJn Kn  0, Jn ∈ J , Kn ∈ K}.


We aim to find the corresponding values of wJn Kn such that
 
Ŷ0 (t, w) = wJn Kn Ŷ0, Jn , Kn (t)
Kn ∈K Jn ∈J

minimizes the expected integrated square error loss


 2
r̃(w) ≡ E Y0 (t) − Ŷ0 (t, w) dt. (7)
T
950 X. Zhang, J.-M. Chiou AND Y. Ma
Let hi (t) = E{Yi (t) | Xi } for i = 0, 1, . . . , n. By exchanging the order of the expectation and
the integral, we have
 2
r̃(w) = E Y0 (t) − Ŷ0 (t, w) dt
T

=E Y02 (t) + Ŷ02 (t, w) − 2Y0 (t)Ŷ0 (t, w) dt
T
    
  

Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


=E E Y02 (t)  X0 + E Ŷ02 (t, w)  X0 − 2E Y0 (t)Ŷ0 (t, w)  X0 dt
T
   
 
=E h20 (t) + Ŷ02 (t, w) − 2h0 (t)E Ŷ0 (t, w)  X0 + E Y02 (t)  X0 − h20 (t) dt
T
 2   

=E h0 (t) − Ŷ0 (t, w) dt + E var Y0 (t)  X0 dt ,
T T

where the second term is unrelated to w. Thus, minimizing r̃(w) is equivalent to minimizing
 2
r(w) ≡ E h0 (t) − Ŷ0 (t, w) dt. (8)
T

In the following development, we determine the optimal weights, design a procedure to predict
Y0 (t) by approximating and minimizing r̃(w), and establish the theoretical properties of the
procedure by examining its performance in terms of minimizing r(w).

3·2. Weight choice criterion


We use Q-fold crossvalidation to choose the weights. Specifically, we divide the dataset into
Q  2 groups such that for each group we have M = n/Q observations. When Q = n, we
have M = 1, and Q-fold crossvalidation becomes leave-one-curve-out crossvalidation, which
is computationally feasible when the sample is not too large. We first estimate the mean curves
μY (t) and μX (s) using all the data. Then, for a fixed (Jn , Kn ) pair, we consider the qth step
(q = 1, . . . , Q), where we leave out the qth group and use all of the remaining observations to
[−q]
obtain the coefficient function estimate β̂Jn , Kn (s, t). Now, for i = (q − 1)M + 1, . . . , qM , we
form the prediction of Yi (t),

[−q] [−q]
Ŷi, Jn , Kn (t) = μ̂Y (t) + β̂Jn , Kn (s, t) X̂i,Kn (s) − μ̂X (s) ds,
S

 n [−q]
where X̂i,Kn (s) = K j=1 ξ̂X,i, j φ̂X, j (s) + μ̂X (s). The computation of Ŷi, Jn , Kn (t) follows similarly
to that of the last expression in (6); we omit the details. Then the weighted average prediction of
Yi (t) is
[−q]
  [−q]
Ŷi (t, w) = wJn , Kn Ŷi, Jn , Kn (t).
Kn ∈K Jn ∈ J

Our Q-fold crossvalidation criterion is formulated as


Q M  [−q]
2
cvQ (w) = V(q−1)M +m (t) − Ŷ(q−1)M +m (t, w) dt. (9)
q=1 m=1 T
Averaging estimated functional linear regression models 951
The resulting weight vector is obtained as

ŵ = arg minw∈W cvQ (w),

so the proposed model average prediction is Ŷ0 (t, ŵ).


Two aspects of (9) are worth pointing out. First, a sensible criterion to minimize with respect
to w would have been
Q M  [−q]
2

Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


˜ Q (w) ≡
cv Y(q−1)M +m (t) − Ŷ(q−1)M +m (t, w) dt (10)
q=1 m=1 T

instead of cvQ (w). Nevertheless, we propose to work with cvQ (w) because
Q M  2
[−q]
cvQ (w) = Y(q−1)M +m (t) + ε(q−1)M +m (t) − Ŷ(q−1)M +m (t, w) dt
q=1 m=1 T
Q M  2
[−q]
= Y(q−1)M +m (t) − Ŷ(q−1)M +m (t, w) dt + n(T2 − T1 )σY2
q=1 m=1 T

= cv(w)
˜ + n(T2 − T1 )σY2 (11)

almost surely, where n(T2 − T1 )σY2 is unrelated to w; see the Supplementary Material for the
proof of (11). Therefore, minimizing cvQ (w) and cv ˜ Q (w) are equivalent almost surely.
Second, the computation of (9) is a standard quadratic programming problem and is extremely
[−q]
simple. Let Ṽq,m, Jn , Kn (t) = V(q−1)M +m (t) − Ŷ(q−1)M +m, Jn , Kn (t), define the inner product of two
 
functions on a domain T by f1 , f2 = T f1 (t)f2 (t) dt and the norm by
f1
= { T f12 (t)dt}1/2 ,
Q 
and let ṼJ∗n ,Kn , J ,K = q=1 Mm=1 Ṽq,m, Jn , Kn , Ṽq,m, Jn , Kn . Then
n n

Q  M  2
 
cvQ (w) =  wJn , Kn Ṽq,m, Jn , Kn 
q=1 m=1 Jn ∈J Kn ∈K
Q M     
= wJn , Kn Ṽq,m, Jn , Kn , wJn , Kn Ṽq,m, Jn , Kn
q=1 m=1 Jn ∈J Kn ∈K Jn ∈J Kn ∈K
   
= wJn , Kn wJn , Kn ṼJ∗n , Kn , J , K = wT Hw, (12)
Jn ∈ J Kn ∈K Jn ∈J Kn ∈K n n

where H = (Hr,s ) and the (r, s) element of H is Hr,s = ṼJ∗n , Kn , J , K with r = (Jn − 1)|K| + Kn
n n
and s = (Jn − 1)|K| + Kn , where Jn , Jn = 1, . . . , |J | and Kn , Kn = 1, . . . , |K|; here |J | and |K|
stand for the cardinalities of the sets J and K, respectively. It is now clear that (9) is a standard
quadratic programming problem of the form

minimize cvQ (w) = wT Hw over w ∈ RP subject to l T w = 1 and w  0,

where l is a column vector whose entries are all 1, and P = |J ||K|. This problem can be rapidly
solved even when P is very large.

3·3. Asymptotic weight choice optimality


We first introduce notation and several rather mild conditions before we present an important
result on the optimality of the selected model weights.
952 X. Zhang, J.-M. Chiou AND Y. Ma
(−i) (−i) (−i) (−i)
Let β̂Jn , Kn (s, t), μ̂Y (s), μ̂X (s) and X̂i, Kn (s) be estimators of β(s, t), μY (s), μX (s) and Xi (s)
without using the i th observation, so the leave-one-out prediction of Yi (t) based on the (Jn , Kn )
model is

(−i) (−i) (−i) (−i) (−i)
Ŷi, Jn ,Kn (t) = μ̂Y (t) + β̂Jn , Kn (s, t) X̂i, Kn (s) − μ̂X (s) ds.
S

(−i)   (−i)
Define the average leave-one-out prediction to be Ŷi (t, w) = Kn ∈K Jn ∈J wJn , Kn Ŷi, Jn , Kn (t),

Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


and let
 2
1 n (−i)
R(w) = E hi (t) − Ŷi (t, w) dt, (13)
n i=1 T

and
 2
1 n
R̄(w) = E hi (t) − Ŷi (t, w) dt. (14)
n i=1 T

Below we study the theoretical properties of the prediction risk r(w) in (8), which is associated
with the leave-one-out crossvalidation prediction risk R(w) in (13) defined above.

Condition 1. There exist functions μ∗Y (t), μ∗X (s), βJ∗n ,Kn (s, t), Xi,K
∗ (s) and a series c → 0 such
n n
that as n → ∞, μ̂Y (t) − μ∗Y (t) = Op (cn ), μ̂X (s) − μ∗X (s) = Op (cn ), β̂Jn , Kn (s, t) − βJ∗n , Kn (s, t) =
Op (cn ) and X̂i,Kn (s) − Xi,K∗ (s) = O (c ) hold uniformly for s ∈ S , t ∈ T , J ∈ J , K ∈ K and
n p n n n
i ∈ {0, 1, . . . , n}.

Let Yi,∗Jn , Kn (t) = μ∗Y (t) + S βJ∗n , Kn (s, t) Xi,K
∗ (s) − μ∗ (s) ds,
n X

  
1 n  2
Yi∗ (t, w) = wJn , Kn Yi,∗Jn , Kn (t), R∗ (w) = hi (t) − Yi∗ (t, w) dt,
Kn ∈K Jn ∈J n i=1 T
(15)

and ηn = n inf w∈W R∗ (w).

Condition 2. As n → ∞, hi (t) = Op (1), Yi,∗Jn , Kn (t) = Op (1) and Yi (t) = Op (1) hold
uniformly for i ∈ {0, 1, . . . , n}, t ∈ T , Jn ∈ J and Kn ∈ K.

Condition 3. As n → ∞, ηn−1 n1/2 |J |2 |K|2 = op (1) and ηn−1 ncn = op (1).


[−q]
Condition 4. As n → ∞, Ŷ(q−1)M +m (t, w) − Ŷ(q−1)M +m (t, w) = op (n−1 ηn ) and
[−q]
Ŷ(q−1)M +m (t, w) = Op (1) hold uniformly for w ∈ W , t ∈ T , q ∈ {1, . . . , Q} and m ∈ {1, . . . , M }.

Condition 5. The prediction risk r(w) and the crossvalidation risk R(w) satisfy
supw∈W |r(w)/R(w) − 1| → 0 as n → ∞.

Condition 1 requires the convergence of the average estimators μ̂Y (t), μ̂X (s) and the estimator
β̂Jn ,Kn (s, t) under each model and describes their convergence rates. They need not have the same
Averaging estimated functional linear regression models 953
convergence rate. For example, when μ̂Y (t) − μ∗Y (t)
= Op μ̂X (s) − μ∗X (s) = Op (n−a2 ),
(n−a1 ),
∗ −a
β̂Jn , Kn (s, t)−βJn , Kn (s, t) = Op (n ) and X̂i,Kn (s)−Xi (s)
3 = Op (n−a4 ), cn can be n− min(a1 , a2 ,a3 ,a4 ) .
This very mild condition is satisfied by all the valid functional data estimators that we know;
for example, Yao et al. (2005b) provide the rates of convergence and the conditions for these
rates in their (41), (A1), (A2) and the assumption in Lemma A.1. Since these conditions are
lengthy, we omit them. Condition 2 excludes some pathological cases in which the limiting
value explodes. Condition 3 contains more essential requirements in situations in which our
asymptotic results apply. The first part requires that ηn grow at a rate no slower than n1/2 . This

Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


implies that all candidate models are misspecified. This condition is similar to condition (21) of
Zhang et al. (2013) and condition (7) of Ando & Li (2014). The second part of Condition 3 is
similar to the first part, and the two parts are identical when cn = n−1/2 , which is achievable
when the complete curves Ui (s) and Vi (t) are observed. When these curves are observed only
at discrete points, cn is much larger than n−1/2 and the second requirement in Condition 3
dominates the first requirement. Regardless of which situation we are dealing with, the key
is the order of ηn , which is determined by the degree of misspecification and is typically of
order n in the worst case; hence, Condition 3 is easily satisfied. Condition 4 requires that the
[−q]
difference between Ŷ(q−1)M +m (t, w) and the leave-M -out prediction Ŷ(q−1)M +m (t, w) decrease
with sufficient speed. From (A29) and its proof in the Supplementary Material, we know that
−1
Condition 4 is implied by simpler conditions, i.e., Condition 8 and
 M ηn = o(1). Lastly, we
(−i) (−i) (−i) (−i)
discuss Condition 5. Let Ŷ0, Jn , Kn (t) = μ̂Y (t) + S β̂Jn , Kn (s, t) X̂0,Kn (s) − μ̂X (s) ds and
(−i)   (−i)
Ŷ0 (t, w) = Kn ∈K Jn ∈J wJn , Kn Ŷ0, Jn , Kn (t). As the observations {Xi (s), Yi (t)} (i = 0, . . . , n)
are independently and identically distributed,
 2  2
1 n (−i) (−1)
R(w) = E hi (t) − Ŷi (t, w) dt = E h0 (t) − Ŷ0 (t, w) dt
n i=1 T T
 2  2
(−1)
=E h0 (t) − Ŷ0 (t, w) dt + E Ŷ0 (t, w) − Ŷ0 (t, w) dt
T T

(−1)
+ 2E h0 (t) − Ŷ0 (t, w) Ŷ0 (t, w) − Ŷ0 (t, w) dt
T
 2
(−1)
= r(w) + E Ŷ0 (t, w) − Ŷ0 (t, w) dt
T

(−1)
+ 2E h0 (t) − Ŷ0 (t, w) Ŷ0 (t, w) − Ŷ0 (t, w) dt.
T

(−1)
Here, Ŷ0 (t, w) and Ŷ0 (t, w) should be very close because, although they use observa-
tions {U0 (s0l ); U2 (s2l ), V2 (t2j ), . . . , Un (snl ), Vn (tnj )} and {U0 (s0l ); U1 (s1l ), V1 (t1j ), . . . , Un (snl ),
Vn (tnj )}, respectively, they are otherwise constructed in an identical manner. Thus, Condition 5
is also reasonable.
We are now ready to describe our theoretical findings.

Theorem 1. Under Conditions 1–5,

r(ŵ)/ inf w∈W r(w) → 1 (16)

in probability as n → ∞.
954 X. Zhang, J.-M. Chiou AND Y. Ma
Theorem 1 shows that the prescribed model averaging procedure is asymptotically optimal in
the sense that its squared error loss is asymptotically identical to that of the infeasible best possible
model averaging estimator. The proof of Theorem 1 is in the Supplementary Material, with all
other proofs. Condition 3 used in Theorem 1 requires that all candidate models be misspecified.
This excludes the lucky situation in which one of these candidate models happens to be the true
model. Indeed, if the candidate model set contains a true model, then the more suitable practice
would be to perform model selection instead of model averaging. Of course, in practice, it is not
known whether the candidate model set happens to contain the true model, and without this prior

Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


knowledge, the probability that the true model is contained in the candidate model set is zero.
Thus, model averaging is more prudent than model selection. In the next subsection, we show
that even in the unlikely event that the true model is included in the candidate set, our procedure
achieves the same convergence rate as regression estimation under the true model.

3·4. Estimation consistency


Define the model average estimator of β(s, t) as
 
β̂w (s, t) = wJn , Kn β̂Jn , Kn (s, t).
Kn ∈K Jn ∈J

Let (J ∗ , K ∗ ) correspond to a true model; this may not be the only true model. For example, any
model containing model (J ∗ , K ∗ ) is also true. Assume that uniformly for s ∈ S and t ∈ T ,

β̂J ∗ ,K ∗ (s, t) − β(s, t) = Op (dn ), dn → 0. (17)

This simply states that the estimator of the regression coefficient function βJ ∗ ,K ∗ (s, t) is consistent,
which is true for all valid estimators in the literature. Depending on the situation, the convergence
rate dn can vary. In the ideal case in which the complete curves Ui (s) and Vi (t) are observed,
dn = n−1/2 for fixed J ∗ and K ∗ . In more common cases in which the curves are observed at
some discrete points, because smoothing methods are typically implemented first to estimate the
curves μX (s) and μY (t) and to estimate their covariance processes CX (s, s ) and CY (t, t ), the
convergence rate of dn is slower than n−1/2 . For increasing J ∗ and K ∗ , dn depends on J ∗ and K ∗ ,
the covariance processes, the eigenvalue gaps and decay rates, and the bandwidths and has very
complex form (Yao et al., 2005b). We further impose the following conditions.

Condition 6. As n → ∞, there exists a series gn → 0 such that Ŷ(q−1)M +m (t, w) −


[−q]
Ŷ(q−1)M +m (t, w) = Op (gn ) holds uniformly for w ∈ W , t ∈ T , q ∈ {1, . . . , Q} and
m ∈ {1, . . . , M }.

Condition 7. There exists δ(t)  c > 0 such that uniformly for w ∈ W and t ∈ T and for
almost all i ∈ {1, . . . , n},
 2  2
β̂w (s, t) − β(s, t) X̂i (s) − μ̂X (s) ds β̂w (s, t) − β(s, t) ds  δ(t),
S S

where X̂i (s) is defined similarly to X̂0 (s).

Condition 6 resembles and often strengthens Condition 4. It requires the difference between the
[−q]
regular prediction Ŷ(q−1)M +m (t, w) and the leave-M -out prediction Ŷ(q−1)M +m (t, w) to decrease
as the sample size increases. As mentioned, we show that Condition 6 holds quite naturally and
Averaging estimated functional linear regression models 955
give an expression for gn in Corollary 1. Condition 7 seems complex; essentially it states that
most X̂i (s)s do not degenerate, in the sense that their inner products with β̂w (s, t) − β(s, t) do not
approach zero. We now describe the performance of the weighted estimation results when the
true model is among the candidate models.

Theorem 2. When the true model (J ∗ , K ∗ ) is one of candidate models, under Conditions 1 and
  2
2 and Conditions 6 and 7, the weighted estimator satisfies T S β̂ŵ (s, t) − β(s, t) ds dt =
Op (dn2 + n−1 + gn ), where gn is defined in Condition 6.

Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


Theorem 2 shows that when the true model (J ∗ , K ∗ ) is one of the candidate models, the
prescribed model averaging procedure can achieve the convergence rate of β̂ŵ (s, t). The rate is
determined by the convergence rate of the β(s, t) estimation under the true model dn , as well
as gn , which is the difference between the regular prediction Ŷ(q−1)M +m (t, w) and the leave-
[−q]
M -out prediction Ŷ(q−1)M +m (t, w). Although dn and n−1 are determined by the performance of
the functional linear regression and the sample size, and hence are not within our control in
designing the model averaging procedure, we can adjust gn . Next we analyse gn and provide
more explicit results under an additional condition.

Condition 8. As n → ∞, X̂i (s) and μ̂X (s) are Op (1) uniformly for i ∈ {1, . . . , n} and s ∈ S .

Corollary 1. If the conditions in Theorem 2 and Condition 8 are satisfied, then gn =


  2
|J ||K|Mn−1 and T S β̂ŵ (s, t) − β(s, t) ds dt = Op (dn2 + |J ||K|Mn−1 ).

As mentioned, dn = n−1/2 or slower. However, if M , |J | and |K| do not increase


with n, the second term in Corollary 1 is of order n−1 , so the first term dominates, and
  2
T S β̂ŵ (s, t) − β(s, t) ds dt = Op (dn2 ). This is the convergence rate of the regression estima-
tor under the true model, without the need to establish that the weight assigned to it approaches 1.
In other words, there is no price to pay for model averaging and the lack of knowledge regarding
whether the candidate model set contains the true model. We call this the oracle property of model
averaging.

4. Data example
Traffic flow prediction is important in transportation studies because traffic flow is a crucial
macroscopic traffic characteristic. Here, we aim to predict a future or unobserved short-term
traffic flow trajectory using the current or partially observed traffic flow trajectory (Chiou, 2012).
We use historical records of daily traffic flow rates, i.e., vehicle counts per five-minute interval,
collected in 2014 by a dual-loop vehicle detector located southbound near Shea-Shan Tunnel on
Taiwan Freeway 5. The 354 daily trajectories used in the analysis exclude 11 trajectories with
massive missing values due to detector malfunction. We randomly sample 219 days as the training
data to build the functional regression model, and use the remaining 135 days as the test data to
evaluate the prediction performance.
We set the time domains S and T for X and Y to ([0, 10], [10, 20]) for Scenario I and
([6, 14], [14, 24]) for Scenario II, and use fraction-of-variance-explained and the Akaike and
Bayesian information criteria to select the numbers of functional principal components of X and
Y , as shown in Table 1. The results selected using the Akaike and Bayesian information criteria
are quite close and are much larger than those selected using the fraction-of-variance criterion.
956 X. Zhang, J.-M. Chiou AND Y. Ma
Table 1. Number of principal components selected for traffic flow prediction under two scenarios
of (S , T )
Scenario (S, T ) FVE(0·90) FVE(0·95) aic bic
I ([0, 10], [10, 20]) (2, 3) (3, 5) (15, 20) (14, 20)
II ([6, 14], [14, 22]) (2, 2) (4, 3) (18, 15) (16, 13)
aic, Akaike information criterion; bic, Bayesian information criterion criteria; FVE, fraction-of-variance-explained.

(a) (b) (c)

Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


31 31 31
30 30 30
29 29 29
28 28 28
RMIPE

RMIPE

RMIPE
27 27 27
26 26 26
25 25 25
24 24 24
MA(0)

MA(1)

MA(2)

MA(4)

MA(6)

MA(8)

MA(0)

MA(1)

MA(2)

MA(4)

MA(6)

MA(8)

MA(0)

MA(1)

MA(2)

MA(4)

MA(6)

MA(8)
MA(10)

MA(10)

MA(10)
RW

RW

RW
Fig. 1. Line charts for the root mean integrated squared prediction errors, rmipe, of traffic flow rates for X in S =
[0, 10] and Y in T = [10, 20] with Q = 2 (a), 4 (b), and 10 (c). The (Kn , Jn ) for the reweighting method, RW, the
selection method, MA(0), and the candidate sets (J , K) for the model averaging method, MA(d), are determined
by the fraction-of-variance-explained (0·90) (dashed and ◦), fraction-of-variance-explained (0·95) (dashed and ),
Akaike (solid and ), and Bayesian information criteria (solid and •).

We use these criteria to determine the range of component numbers for model averaging. We also
investigate how averaging different numbers of models impacts the prediction results.
To perform prediction using model averaging, we choose the candidate sets J and K by setting
J = {j ∈ N : |L̂ − j|  d} and K = {k ∈ N : |P̂ − k|  d}, where L̂ and P̂ are obtained using the
Akaike or Bayesian information criteria or fraction-of-variance-explained. We call the resulting
procedure d-divergence model averaging. Here, d is a small positive integer; when d = 0,
it corresponds to the selection criterion without model averaging. We then obtain the optimal
weights through Q-fold crossvalidation, where Q is chosen to be 2, 4 and 10. In addition to the the
model averaging approach, we further included a reweighting method, as suggested by a referee.
 n Jn
The reweighting method is defined as Ŷ0 (t, p) = μ̂Y (t)+ K k=1 j=1 pk, j b̂k, j ξ̂X,0, j φ̂Y,k (t), where
Kn and Jn are determined by one of the above three criteria and p = (p1,1 , . . . , pKn , Jn ) is selected
by minimizing the Q-fold crossvalidation without any constraint on p.
Figure 1 illustrates the root mean integrated prediction errors in Scenario I,
 n0  2 1/2
rmipe(w) = n−1
0 | T | −1
Y0,i (t) − Ŷ 0,i (t, w) dt , (18)
i=1 T

where |T | is the interval length of T and n0 is the number of test data. The results suggest that
the reweighting method does not perform well and that a small number of components (3,5) for
(X , Y ) selected by the fraction-of-variance-explained (0·95) criterion almost reaches the minimal
root mean integrated squared prediction errors. In addition, the performance depends little on
Q. The 1-divergence model averaging coupled with fraction-of-variance-explained (0·90) also
performs well, and the prediction error decreases only slightly as d increases. The results using
the Akaike and Bayesian information criteria with larger numbers of components have larger
Averaging estimated functional linear regression models 957
(a) (b) (c)
29 29 29
28 28 28
27 27 27
26 26 26
RMIPE

RMIPE

RMIPE
25 25 25
24 24 24
23 23 23
22 22 22

Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


MA(0)

MA(1)

MA(2)

MA(4)

MA(6)

MA(8)

MA(0)

MA(1)

MA(2)

MA(4)

MA(6)

MA(8)

MA(0)

MA(1)

MA(2)

MA(4)

MA(6)

MA(8)
MA(10)

MA(10)

MA(10)
RW

RW

RW
Fig. 2. Line charts for the root mean integrated squared prediction error of traffic flow rates for X in S = [6, 14] and
Y in T = [14, 22] with Q = 2 (a), 4 (b), and 10 (c). The abbreviations are the same as in Fig. 1.

Table 2. Model averaging weights (%) obtained by Q-fold,


Q = 2, crossvalidation with columns and rows corresponding
to J with d = 1 and K with d = 1
FVE(0·90): (L̂, P̂) = (2, 3) FVE(0·95): (L̂, P̂) = (3, 5)
(Jn , Kn ) 2 3 4 (Jn , Kn ) 4 5 6
1 0·0 0·0 0·0 2 0·0 0·0 0·0
2 0·0 0·0 0·0 3 0·0 36·9 23·7
3 0·0 55·7 44·3 4 10·8 3·8 24·8
aic: (L̂, P̂) = (15, 20) bic: (L̂, P̂) = (14, 20)
(Jn , Kn ) 19 20 21 (Jn , Kn ) 19 20 21
14 77·1 0·0 0·0 13 94·9 0·0 0·0
15 0·0 0·0 0·0 14 0·0 0·0 0·0
16 22·9 0·0 0·0 15 5·1 0·0 0·0
The abbreviations are the same as in Table 1.

prediction errors, but these results also improve when using model averaging with d = 8 or
bigger.
Figure 2 suggests that Scenario II requires more components, since the models selected by
the Akaike information criterion (18,15) and the Bayesian information criterion (16,13) yield
relatively small prediction errors, and the model averaging approaches only slightly reduce
them. While fraction-of-variance-explained (0·90) and fraction-of-variance-explained (0·95) both
underestimate the numbers of components, model averaging is able to compensate because the
prediction errors decrease as d increases to 6 or 8.
We display the weights for Q = 2 for each selection criterion with d = 1 and d = 2 in Tables 2
and 3. Most of the (Jn , Kn ) combinations shrink to zero and only a few play a role in prediction.
For example, Table 3 indicates that fraction-of-variance-explained (0·95) with d = 2 results in
the optimal weights 0·36, 0·30, 0·23, and 0·11, respectively, for the models with the component
numbers (3,5), (5,3), (5,7) and (3,6), and all other weights are zero.
The data example demonstrates that when there exists a right number of components chosen by
a criterion, further employing model averaging does not cause additional loss. In addition, when
a criterion does not select the right number of components, model averaging can guard against
prediction loss. Model averaging offers the optimal weights to improve prediction performance
and can protect against potential prediction loss caused by a single selection criterion.
958 X. Zhang, J.-M. Chiou AND Y. Ma
Table 3. Model averaging weights (%) obtained by Q-fold (Q = 2) crossvalidation
with columns and rows corresponding to J with d = 2 and K with d = 2
FVE(0·90): (L̂, P̂) = (2, 3) FVE(0·95): (L̂, P̂) = (3, 5)
(Jn , Kn ) 1 2 3 4 5 (Jn , Kn ) 3 4 5 6 7
1 0·0 0·0 0·0 0·0 0·0 1 0·0 0·0 0·0 0·0 0·0
2 0·0 0·0 0·0 0·0 0·0 2 0·0 0·0 0·0 0·0 0·0
3 3·1 0·0 0·0 0·0 53·2 3 0·0 0·0 36·0 11·2 0·0
4 0·0 0·0 29·5 0·0 14·2 4 0·0 0·0 0·0 0·0 0·0
5 29·8 0·0 0·0 0·0 23·0

Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


aic: (L̂, P̂) = (15, 20) bic: (L̂, P̂) = (14, 20)
(Jn , Kn ) 18 19 20 21 22 (Jn , Kn ) 18 19 20 21 22
13 76·5 0·0 0·0 0·0 0·0 12 56·2 0·0 0·0 6·9 0·0
14 0·0 0·0 0·0 0·0 0·0 13 16·3 0·0 0·0 0·0 0·0
15 0·0 0·0 0·0 0·0 0·0 14 0·0 0·0 0·0 0·0 0·0
16 0·0 0·0 0·0 0·0 0·0 15 0·0 0·0 0·0 0·0 0·0
17 0·0 23·5 0·0 0·0 0·0 16 0·0 20·6 0·0 0·0 0·0
The abbreviations are the same as in Table 1.

5. Simulation studies
5·1. Simulation designs
We assume that pairs of the smooth stochastic processes Xi and Yi are observed at time-points
Si,l and Ti,m , respectively, with measurement errors, for l = 1, . . . , nXi and m = 1, . . . , nYi ,
assuming that nXi  2 and nYi  2. We consider the following simulation designs.
(a) Consider the sparse designs. Set nT = 400 as the number of individual trajectories of
a dataset, among which the first n = 300 trajectories are used as training data and the
other n0 = 100 as test data. Set nXi = nYi (i = 1, . . . , nT ), which are sampled from the
discrete uniform distribution on [10, 12]. For each i, the associated Si,l (l = 1, . . . , nXi ) and
Ti,m (m = 1, . . . , nYi ) are randomly sampled from 26 equally spaced grid points on [0, 1].
Generate 100 replicates of the simulation datasets.
(b) Consider the observations of Xi through the basis expansion and the geometric √ Brownian
motion, respectively. For basis expansion, set the eigenfunctions ϕX, j (s) = 2 cos( jπs)
(j = 1, . . . , Jn ) with Jn = 8, along with the eigenvalues {λX, j } where λX, j = 9{1 − 0·1( j −
1)}; λX, j = 9j −1·2 .
For geometric Brownian motion dX (t) = μG dt + σG X (t)dW (t), we set μG = 3 and σG =
0·75 and let W (t) be a Wiener process. The eigenvalue-eigenfunction pairs {λX, j , ϕX, j (s)}
(j = 1, . . . , Jn ) are obtained by the functional principal component analysis using the
generated observations of geometric Brownian motion.
(c) Generate the random coefficients ξX,i, j (i = 1, . . . , nT ) such that ξX,i, j is a N (0, λX, j ) variate
for each j = 1, . . . , Jn .
n
(d) Set μX (s) = 0 and obtain the observations of Xi at Si,l by Xi (Si,l ) = Jj=1 ξX,i, j ϕX,l (Si,l )
(l = 1, . . . , nXi ; i = 1, . . . , nT ). √
(e) Set the eigenfunctions ϕY,k (t) = 2 sin(kπ t) (k = 1, . . . , Kn ) with Kn = 8, and the Jn ×Kn
matrix PJn Kn = {βk, j }1jJn ,1k Kn with

⎨ 0·8j , 1  j = k  6,
βj,k = 0·5 max( j,k) , 1j= | k  6,

0, j > 6, k > 6.
Averaging estimated functional linear regression models 959
(f) Set the diagonal matrix ξ = diag{(k −1·5 )
k=1,...,Kn } and generate εξ ,i = (ξ i1 , . . . , ξ iKn )
T

from the multivariate normal distribution with mean zero and covariance ξ .
(g) Let ξY,i = (ξY,i,1 , . . . , ξY,i,Kn )T . Form ξY,i = PJTn , Kn ξXi + εξ ,i (i = 1, . . . , nT ).

(h) Set μY (t) = 0 and generate the observation Yi at Ti,m by Yi (Ti,m ) = K k=1 ξY,i,k ϕY,k (Ti,m )
(i = 1, . . . , nT ).
(i) Obtain the observations Uil and Vim by adding measurement errors il and εim to Xi (Si,l )
and Yi (Ti,m ), where il and εim are independent N (0, 0·8) variates.

Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


5·2. Algorithm
Given a simulation dataset {(Xi,l , Yi,m )} (l = 1, . . . , nXi ; m = 1, . . . , nYi ; i = 1, . . . , n), we
must first determine the sets J and K, and then perform functional principal component analysis
for the samples of the random functions (Xi , Yi ) and estimate the regression coefficients βk, j and
βJn , Kn (s, t) for any (Jn , Kn ) pair for Jn ∈ J and Kn ∈ K. The algorithm is as follows.

Step 1. Assign A = [{Yi (t), Xi (s)}]1in as the training set and the remaining n0 pairs as the
testing set, B . In the simulation, we set n = 300 and n0 = 100.

Step 2. Determine J and K; we set J = {j ∈ N : |L̂−j|  d} and K = {k ∈ N : |P̂−k|  d},


where L̂ and P̂ are obtained using a selection criterion such as the Akaike or Bayesian information
criterion or fraction-of-variance-explained, and d is a fixed integer. We set d = 1, 2, 4 in the
simulation.

Step 3. Obtain the optimal weights w, and perform the Q-fold crossvalidation procedure based
on A.
(i) Divide A into Q subsets, {Qq }1qQ , with a sample size of Mq = n/Q.
(ii) For each q ∈ {1, . . . , Q}, perform the functional linear regression estimation on the leave-
Mq -out set, A\Qq , for each Jn ∈ J and Kn ∈ K.
[−q] [−q]
(iii) Obtain the estimated β̂Jn , Kn (s, t) and the predicted Ŷi, Jn , Kn (t) for {Xi (s) ∈ Qq }, Jn ∈ J and
Kn ∈ K.
[−q]
(iv) Arrange {Ŷi, Jn , Kn (t)−Yi (t)}Jn ∈J , Kn ∈K in a 1 by |J |×|K| array corresponding to the weight
w, and form the matrix H as in (12).
(v) Obtain the optimal weight ŵJn , Kn by solving arg minw cvQ (w) = wT Hw subject to l T w = 1
and w  0, using quadratic
 programming.
(vi) Obtain β̂w (s, t) = Jn ∈J , Kn ∈K wJn , Kn β̂Jn , Kn (s, t), where wJn , Kn is an element of the optimal
weight ŵ derived above.

Step 4. Given each X0,i in B , obtain the prediction


  
Ŷ0,i (t, w) = wJn , Kn Ŷ0,i, Jn , Kn (t) = μ̂Y (t) + β̂w (s, t) X̂0 (s) − μ̂X (s) ds.
Jn ∈ J Kn ∈K S

A large dimension of the weight vector w can cause numerical instability when searching
for the optimal weight vector w by minimizing cvQ (w) in (9) over w. For example, if we set
JL = KL = 1 and let J = {1, . . . , JU } and K = {1, . . . , KU }, where JU  max{nXi } and
KU  max{nYi }, then the number of weight parameters is JU KU − 1, which can be large. In
practice, we can determine the sets J and K by making use of selection criteria such as in Step 2.
By setting J and K more carefully, we can reduce the number of elements in the weight vector
w, which helps to simplify the numerical implementation.
960 X. Zhang, J.-M. Chiou AND Y. Ma

8·0

RMIPE 7·0

6·0

5·0

4·0
FVE90

FVE95
MA(FVE90±1;Q2)

MA(FVE90±2;Q2)

MA(FVE90±4;Q2)

MA(FVE90±1;Q4)

MA(FVE90±2;Q4)

MA(FVE90±4;Q4)

MA(FVE95±1;Q2)

MA(FVE95±2;Q2)

MA(FVE95±4;Q2)

MA(FVE95±1;Q4)

MA(FVE95±2;Q4)

MA(FVE95±4;Q4)
RW(FVE90;Q2)

RW(FVE90;Q4)

RW(FVE95;Q2)

RW(FVE95;Q4)

Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


9·0

8·0

7·0
RMIPE

6·0

5·0

4·0

3·0
MA(AIC±1;Q2)

MA(AIC±2;Q2)

MA(AIC±4;Q2)

MA(AIC±1;Q4)

MA(AIC±2;Q4)

MA(AIC±4;Q4)

MA(BIC±1;Q2)

MA(BIC±2;Q2)

MA(BIC±4;Q2)

MA(BIC±1;Q4)

MA(BIC±2;Q4)

MA(BIC±4;Q4)
AIC

RW(AIC;Q2)

RW(AIC;Q4)

BIC

RW(BIC;Q2)

RW(BIC;Q4)

20
Number of components

15

10

5
FVE90(X)

FVE95(X)

FVE95(Y)

AIC(X)

AIC(Y)

BIC(X)

BIC(Y)
FVE90(Y)

Fig. 3. Outcome for the basis expansion design with λX, j = 9{1 − 0·1( j − 1)}: boxplots of rmipe with Q =
1, 2, 4 and the selected number of components based on 100 simulation datasets. rmipe, root mean integrated
squared prediction errors; FVE90, fraction-of-variance-explained (0·90); FVE95, fraction-of-variance-
explained (0·95); aic, Akaike information criterion; bic, Bayesian information criterion; RW(FVE90,Qb),
reweighting method with Jn and Kn determined by FVE90 and with p selected by CVQ=b ; RW(FVE95,Qb),
RW(aic,Qb) and RW(bic,Qb) have similar definitions; MA(FVE90±a,Qb), model averaging method with
Jn and Kn determined by FVE90, with d = a, and with weights selected by CVQ=b ; MA(FVE95±a,Qb),
MA(aic±a,Qb) and MA(bic±a,Qb) have similar definitions; FVE90(X ), number of components for Xi (s)
determined by FVE90; FVE95(X ), aic(X ), bic(X ), FVE90(Y ), FVE95(Y ), aic(Y ) and bic(Y ) have similar
definitions.
Averaging estimated functional linear regression models 961
5·3. Simulation results
We compare the performance of various settings of J and K in terms of the predicted Y0,i ,
using discrete approximations for the root mean integrated squared prediction errors defined
by (18).
In each design, we present the results with Q = 1, 2, 4 for the Q-fold crossvalidation method
to obtain the optimal weights. The results with Q = 8 perform similarly to those with Q = 4
in our numerical study and are omitted. The prediction performance is compared on the basis of
rmipe(w) under various settings of J and K with d = 1, 2, 4 coupled with a selection method

Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


and for Q = 2, 4.
In the simulation studies, the numbers of components selected by the Akaike and Bayesian
information criteria are much larger than those selected by the fraction-of-variance-explained
(0·90) and fraction-of-variance-explained (0·95) criteria, as shown in the bottom panels of Fig. 3,
which affect the ranges of the candidate sets J and K. In the case of the basis expansion design
with λX, j = 9{1 − 0·1( j − 1)} shown in Fig. 3, both the Akaike and Bayesian information criteria
reach the minimum root mean integrated squared prediction errors; so the model averaging
approaches provide little further improvement. However, fraction-of-variance-explained (0·90)
and fraction-of-variance-explained (0·95) result in larger prediction errors, and the corresponding
model averaging approaches indeed can further reduce the prediction errors. In addition, the
performance of model averaging is stable, whereas that of model selection can be very variable,
possibly due to the selection of unsuitable models in some datasets. Overall, the simulation study
illustrates the advantage of model averaging over a single model selection criterion in terms of
prediction. Last, as in the real data example, the reweighting method does not perform well in
these simulations. Additional simulation results are presented in the Supplementary Material.

6. Discussion
To select the model weights using cvQ (w) in (9), we need to determine Q. In our Monte Carlo
study, we compare Q = 2 and Q = 4 for n = 300. The performance of Q = 8 is similar to that of
Q = 4 in our simulation study. Although it is possible to set an additional criterion to choose the
optimal value of Q, this may increase the computational burden. Our numerical results indicate
that a small value of Q is usually sufficient to choose the model weights for moderate sample
sizes.
We consider functional linear regression in which both the covariate and the response are
random functions, where the covariate and the response functions can be measured with error
and can be observed densely or sparsely, regularly or irregularly. The proposed approach can also
be used when either the response or the predictor variable contains a random function. It helps
to reduce prediction errors and the possibility of producing poor results using a single model
selection criterion.
One aspect omitted here is the variability of the resulting averaged prediction. In fact, research
regarding variability of optimal model averaging methods has been missing from the literature,
due to the difficulties caused by the random weights and model misspecification. A good starting
point will be linear models, and much effort is needed in pushing forward the research.

Acknowledgement
This work was supported by the National Natural Science Foundation of China, the U.S.
National Science Foundation, the U.S. National Institute of Neurological Disorders and Strokes,
962 X. Zhang, J.-M. Chiou AND Y. Ma
and the Academia Sinica and the Ministry of Science and Technology of Taiwan. The authors
wish to thank the editor, associate editor, and referees for helpful remarks.

Supplementary material
Supplementary material available at Biometrika online includes the detailed proofs of (11),
Theorems 1 and 2, and Corollary 1 as well as additional simulation results.

Downloaded from https://academic.oup.com/biomet/article/105/4/945/5107379 by Ghent University user on 15 February 2021


References
Akaike, H. (1970). Statistical predictor identification. Ann. Inst. Statist. Math. 22, 203–17.
Ando, T. & Li, K. (2014). A model-averaging approach for high-dimensional regression. J. Am. Statist. Assoc. 109,
254–65.
Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Ann. Statist. 24, 2350–83.
Cardot, H., Ferraty, F. & Sarda, P. (2003). Spline estimators for the functional linear model. Statist. Sinica 13,
571–91.
Cheng, X. & Hansen, B. E. (2015). Forecasting with factor-augmented regression: A frequentist model averaging
approach. J. Economet. 186, 280–93.
Chiou, J.-M. (2012). Dynamical functional prediction and classification, with application to traffic flow prediction.
Ann. Appl. Statist. 6, 1588–614.
Chiou, J.-M., Müller, H.-G. & Wang, J.-L. (2003). Functional quasi-likelihood regression models with smooth
random effects. J. R. Statist. Soc. B 65, 405–23.
Claeskens, G. & Hjort, N. L. (2008). Model Selection and Model Averaging. Cambridge: Cambridge University
Press.
Cuevas, A. (2014). A partial overview of the theory of statistics with functional data. J. Statist. Plan. Infer. 147, 1–23.
Faraway, J. J. (1997). Regression analysis for a functional response. Technometrics 39, 254–61.
Ferraty, F., Van Keilegom, I. & Vieu, P. (2012). Regression when both response and predictor are functions. J. Mult.
Anal. 109, 10–28.
Ferraty, F. & Vieu, P. (2006). Nonparametric Functional Data Analysis: Theory and Practice. New York: Springer.
Ferraty, F. & Vieu, P. (2009). Additive prediction and boosting for functional data. Comp. Statist. Data Anal. 53,
1400–13.
Gao, Y., Zhang, X., Wang, S. & Zou, G. (2016). Model averaging based on leave-subject-out cross-validation.
J. Economet. 192, 139–51.
Hansen, B. E. & Racine, J. (2012). Jackknife model averaging. J. Economet. 167, 38–46.
He, G., Müller, H.-G. & Wang, J.-L. (2000). Extending correlation and regression from multivariate to functional
data. In Asymptotics in Statistics and Probability. Utrecht: VSP, pp. 197–210.
Hoeting, J. A., Madigan, D., Raftery, A. E. & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statist.
Sci. 14, 382–417.
Horváth, L. & Kokoszka, P. (2012). Inference for Functional Data with Applications. Berlin: Springer.
Hsing, T. & Eubank, R. (2015). Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear
Operators. New York: Wiley.
James, G. (2002). Generalized linear models with functional predictors. J. R. Statist. Soc. B 64, 411–32.
Müller, H.-G. (2005). Functional modeling and classification of longitudinal data. Scand. J. Statist. 32, 223–40.
Müller, H.-G. & Stadtmüller, U. (2005). Generalized functional linear models. Ann. Statist. 33, 774–805.
Müller, H.-G. & Yao, F. (2008). Funcional additive models. J. Am. Statist. Assoc. 103, 1534–44.
Ramsay, J. & Dalzell, C. (1991). Some tools for functional data analysis. J. R. Statist. Soc. B 53, 539–72.
Ramsay, J. O. & Silverman, B. W. (2005). Functional Data Analysis. New York: Springer, 2nd ed.
Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461–4.
Wang, J.-L., Chiou, J.-M. & Müller, H.-G. (2016). Functional data analysis. Ann. Rev. Statist. Appl. 3, 257–95.
Wolpert, D. (1992). Stacked generalization. Neural Networks 5, 241–60.
Yao, F., Müller, H.-G. & Wang, J.-L. (2005a). Functional data analysis for sparse longitudinal data. J. Am. Statist.
Assoc. 100, 577–90.
Yao, F., Müller, H.-G. & Wang, J.-L. (2005b). Functional linear regression analysis for longitudinal data. Ann.
Statist. 33, 2873–903.
Zhang, J.-T. (2013). Analysis of Variance for Functional Data. London: Taylor & Francis.
Zhang, X., Wan, A. T. K. & Zou, G. (2013). Model averaging by jackknife criterion in models with dependent data.
J. Economet. 174, 82–94.

[Received on 2 December 2016. Editorial decision on 2 April 2018]

You might also like