You are on page 1of 13

1

Transferring Model Structure in Bayesian Transfer


Learning for Gaussian Process Regression
Milan Papež and Anthony Quinn

Abstract—Bayesian transfer learning (BTL) is defined in this [14], Bayesian optimization [15], Earth observation [16], face
paper as the task of conditioning a target probability distribution recognition [17], and clinical data analysis [18].
on a transferred source distribution. The target globally models This paper is specifically interested in Bayesian transfer
the interaction between the source and target, and conditions on
a probabilistic data predictor made available by an independent learning (BTL) [19]–[22]. Without loss of generality, we define
arXiv:2101.06884v1 [cs.LG] 18 Jan 2021

local source modeller. Fully probabilistic design is adopted to BTL as the transfer of knowledge from a source task to a
solve this optimal decision-making problem in the target. By target task, a definition which is widely extensible to multitask
successfully transferring higher moments of the source, the settings. Under the Bayesian framework, source and target
target can reject unreliable source knowledge (i.e. it achieves modellers separately adopt distinct probability distributions to
robust transfer). This dual-modeller framework means that the
source’s local processing of raw data into a transferred predictive model local unknown quantities of interest, i.e. our approach
distribution—with compressive possibilities—is enriched by (the involves two stochastically independent modellers without
possible expertise of) the local source model. In addition, the requiring consistency to be established between them, whereas
introduction of the global target modeller allows correlation conventional multitask learning involves—as already stated—
between the source and target tasks—if known to the target— a single global modeller. These modellers each have exclusive
to be accounted for. Important consequences emerge. Firstly,
the new scheme attains the performance of fully modelled access to their respective local data but not to each other’s
(i.e. conventional) multitask learning schemes in (those rare) data. Therefore, the target cannot condition its model on the
cases where target model misspecification is avoided. Secondly, source data, the essential assumption of conventional global
and more importantly, the new dual-modeller framework is multitask learning. The target’s model can be a local model,
robust to the model misspecification that undermines conven- as in [23]–[28]. Alternatively, it may itself extend its focus
tional multitask learning. We thoroughly explore these issues
in the key context of interacting Gaussian process regression beyond its local target task and adopt a global model involving
tasks. Experimental evidence from both synthetic and real data both source and target tasks, as in the standard multitask
settings validates our technical findings: that the proposed BTL setting. This is the setup we adopt in this paper for the first
framework enjoys robustness in transfer while also being robust time. To emphasize: the target modeller globally models the
to model misspecification. interactions between its own task and the source task, but is
Index Terms—Bayesian transfer learning (BTL), multitask informed only by its local target data. Meanwhile, the source
learning, local and global modelling, fully probabilistic design, modeller, adopting a local Bayesian model, is informed by its
incomplete modelling, Gaussian process regression. local (source) data, and then communicates its knowledge to
the target only via a source probability distribution instantiated
I. I NTRODUCTION by its (source) sufficient statistics. Therefore, only the target
takes care of interactions and only the source communicates

M ULTITASK learning [1] with Gaussian process (GP)


regression tasks is of major concern in the statistical
signal processing and machine learning communities. Many
source probabilistic knowledge. The key distinctions between
global and local modelling of the source and target tasks are
captured in Fig. 1, whose technical details will be explained
contemporary instances of this framework have been reported, fully in Section II. The target modeller has no prescriptive
including methods based on convolved GPs [2]–[5], convolu- rule of probability calculus with which to condition on the
tional GPs [6], product GPs [7], [8], deep (nested) GPs [9]– source distribution in this BTL scenario (Fig. 1(c)). The reason
[11], and deep neural networks [12]. All these methods adopt a for this is that the joint model of the source distribution
single global model to describe the relationship among source and the target’s quantities of interest is not available. The
and target tasks, involving a joint probability distribution challenge for the target modeller then lies in deciding how to
of all latent and output-data processes, which is consistent condition on the source probability distribution in this incom-
with the foundational methodology of Bayesian learning [13]. pletely modelled case. Fully probabilistic design (FPD) [29],
The flexibility of multitask GP regression is evidenced by a [30]—the optimal Bayesian decision-making strategy based
wide range of applications, such as reinforcement learning on the minimum cross-entropy principle [31]—addresses this
problem. Under this framework, the target modeller conditions
M. Papež is with the Institute of Information Theory and Automation
(UTIA), Czech Academy of Sciences, Czech Republic, Prague 18208, Czech on the transferred source knowledge by solving a constrained
Republic (papez@utia.cas.cz). optimization problem.
A. Quinn is with UTIA (aquinn@utia.cas.cz), and also with the Depart- The principal novelty of this paper, therefore, is that the
ment of Electronic and Electrical Engineering, Trinity College Dublin, the
University of Dublin, Dublin D04 V1W8, Ireland (aquinn@tcd.ie). target modeller is a global modeller (in sympathy with the
The research has been supported by GAČR grant 18-15970S. aforementioned standard multitask learning) that conditions on
2

FS (yS? |yS )

yS,i yT,i yS,i ? yT,i


yi yS,i

F(f |y) F(fS , fT |yS , yT ) Mo (fS , fT |FS , yT )

fi fS,i fT,i fS,i fS,i fT,i

Local modeller Global modeller Source Target


modeller: local modeller: global
(a) Singletask learning (b) Standard multitask learning (c) Bayesian transfer learning (this paper)

Fig. 1: (a) Singletask learning: an isolated local Bayesian modeller with complete knowledge of F(y, f ). (b) Standard multitask learning: an interacting global
Bayesian modeller with complete knowledge of F(yS , yT , fS , fT ). (c) Bayesian transfer learning: an isolated local source Bayesian modeller with complete
knowledge of FS (yS , fS ) and an interacting global target Bayesian modeller with incomplete knowledge of F(yS? , yT , fS , fT , FS ). The global target modeller
does not have access to yS . It compensates for this missing information by utilizing the output-data predictive distribution, FS , that is made available by the
local source modeller. The global target modeller improves its performance based on FS . The gray and white circles highlight know and unknown variables,
respectively.

a local source probability distribution. Recall, from Bayesian of the global target modeller is introduced and the general
foundations, that this latter distribution is a function of both solution based on the FPD framework is provided. Section III
its local source data and local source model. What is implied instantiates this general setting of BTL in the case of GP
by this BTL framework is that the global target modeller regression tasks. Section IV provides a thorough exploration
independently adopts a different model from the source. This of these advances via synthetic data experiments, focussing
leads to important advances beyond the current available particularly on its ability to capture correlation structure, to
notions and techniques for BTL, all of which are explored achieve robust transfer and to benefit from the transfer of an
in this paper: expressive local model in the source. This will allow us to
• The introduction of the global target modeller facilitates comment in an informed way on the performance limits of
learning about the correlation structure between the source our approach. Section V validates our BTL scheme in a real-
and target tasks, extending our own previous BTL frame- data context. Section VI provides further discussion of the key
work which involved only locally modelled tasks [23]– properties of our method, including its source knowledge com-
[25], [27], [28]. It also generalizes FPD-based approaches pression and computational complexity aspects. Section VII
to pooling [32], [33], as well as conventional (global) concludes the paper with suggestions for future work.
Bayesian multitask learning [2]–[12]. We demonstrate this
formally, and also via extensive experimental evidence. II. BAYESIAN TRANSFER LEARNING FROM A LOCAL
• We stand to benefit from the fact that the local source TO GLOBAL MODELLER
modeller is independent of the target and is therefore,
potentially, a more informed—i.e. expressive or expert— Consider a (non-parametric) regression task where the aim
modeller of its local data. This can provide a major perfor- is to infer unknown (latent) function values, f ≡ (f (xi ))ni=1 ,
mance dividend over standard multitask learning solutions based on known input data (regressors), x ≡ (xi ∈ Rnx )ni=1 ,
that adopt only a single global (typically centralized) model. and known output data, y ≡ (yi ∈ R)ni=1 , where each yi is an
• The source probability distribution is represented by its indirect (e.g. noisy) observation of f (xi ) ∈ R. A Bayesian
sufficient statistics, thereby achieving an optimal encoding modeller (Fig. 1(a)) addresses this problem by adopting a
(compression) of source knowledge extracted from its (of- complete joint probability model, F(y, f |x, θ), and then com-
ten high-dimensional) data and (possibly complex) model puting the posterior distribution, F(f |y, x, θ), for some fixed
structure. This can avoid transferred message overheads that and finite-dimensional parameter, θ ∈ Rnθ , given a realization
arise in standard multitask learning, which relies on the of y. This constitutes standard Bayesian conditioning in the
transfer of unprocessed source data for its conditioning task context of the complete joint model1 , F(y, f |x, θ).
(i.e. learning). This paper addresses the problem of learning in a pair of
• Our previous FPD-based BTL schemes were unable to GP regression tasks, called the source and target learning tasks,
transfer higher moments of the source distribution, a re- respectively. The source task is local, being informed by local
source that is necessary in achieving robust transfer (i.e. output data, yS , available only to the source. Its local process-
the rejection of imprecise source knowledge) [23], [24], ing of knowledge, FS (yS , fS ), and data, yS , is to be exploited
[27], [34]. By careful design of the FPD-based constrained in order to improve learning at the target task. The standard
optimization problem in this paper, we formally solve this approach is that a global multitask modeller (Fig. 1(b)) elicits
problem, obviating the informal adaptations in [23], [24], a joint model, F(yS , yT , fS , fT ), capturing interactions between
and the computationally expensive augmentation strategies the source and target processes, and conditions on source and
in [25], [28].
1 In the sequel, we will suppress the details of the conditioning where this
The paper is organized as follows: Section II defines the
is evident. Moreover, F and M denote known (fixed-form) and unknown
BTL problem, where the central aim is to transfer the source (variational-form) distributions, respectively. We use “≡” to denote “is defined
output-data predictor to the target modeller. The new concept to be equal to”.
3

target output data, yS and yT , respectively, to which it must where   


have access [2]–[12]. This yields M
D(M||MI ) = EM log ,
MI
F(fS , fT |yS , yT ). (1)
is the Kullback-Leibler divergence [36] from M to MI , and EM
This scheme requires the prior elicitation of the joint model is the expected value under M. The FPD-optimal model (3) is
of the two-task learning system. We refer to this as complete the solution (i) that respects the knowledge constraints given
modelling. by the set of possible models, M ∈ M, and (ii) that is closest
In this work, our aim is to relax these restrictive assumptions to the ideal model, MI . The latter must declare the target’s
of the conventional multitask learning framework, as follows preferences about M. These are as follows:
(Fig. 1(c)): (i) The set of possible models, M, is delineated by restricting
(i) The target will not have direct access to the source data, the functional form of (2),
yS ; and
(ii) the target’s joint model2 , F(yS , yT , fS , fT ), is distinct from M(yS? |fS , fT , FS , yT ) ≡ F(yS? |fS ), (4)
the source’s local model, FS (yS , fS ), and each is elicited where we assume that y is conditionally independent of
?
S
independently. (fT , FS , yT ) given fS . (2) then has the form
Specifically, the goal of the present paper is to substitute
the inaccessible source data, yS , with appropriately specified M(yS? , fS , fT |FS , yT ) ≡ F(yS? |fS )M(fS , fT |FS , yT ), (5)
probabilistic source knowledge, FS (·). Technically, the infer- where the FS -conditioned factor remains unknown and
ential objective of the global target modeller is to compute unrestricted. Since F(yS? |fS ) is fixed, then M(fS , fT |FS , yT )
F(fS , fT |FS , yT ), conditioned now on the transferred source is the only unknown quantity in (5), and the set of possible
distribution, FS , but not on yS . This conditioning on FS is models is therefore
incompletely specified because the target does not have a
joint model of the type, F(FS , . . .), i.e. one that involves M ∈ M ≡ {models (5) with F(yS? |fS ) fixed
the target’s hierarchical specification of the source’s trans- and M(fS , fT |FS , yT ) variational}. (6)
ferred distribution. Instead, the target’s learning task becomes
one of optimally choosing its knowledge-conditioned model, (ii) Adopting the same chain rule expansion as in (2), the
M(. . . |FS , yT ), in this incompletely modelled case (necessarily, target’s ideal model is specified as
then, M is unknown). The fact that the global target modeller MI (yS? , fS , fT |FS , yT ) ≡ FS (yS? |yS )F(fS , fT |yT ). (7)
(Fig. 1(c)) processes FS (a result of learning) is an important
progression beyond conventional global multitask learning Comparing with (2), the following ideal declarations have
(Fig. 1(b)), and we reserve the term Bayesian transfer learning been made by the target:
for the case in (Fig. 1(c)). Among its consequences are the MI (yS? |fS , fT , FS , yT ) ≡ FS (yS? |yS ), (8)
following:
MI (fS , fT |FS , yT ) ≡ F(fS , fT |yT ). (9)
(i) the target learns only from the sufficient statistics of the
source, without itself receiving the source data, yS (of Specifically, in (8), the target assumes as its ideal that
course, it also processes its local target data, yT ); and yS? is conditionally independent of (fS , fT , yT ) given FS ,
(ii) the target can be enriched by the (possibly expert) local and—in order to transfer the source output-data predictor,
source model. FS —adopts the source’s predictor of yS? , i.e. FS (yS? |yS ), as
It remains to decide on the optimal form of the target’s FS - its ideal model for yS? . Similarly, in (9), the target ideally
conditioned joint model, M(·|FS , yT ). Here, FS ≡ FS (yS? |yS ), models (fS , fT ) conditionally independently of FS given its
i.e. the source transfers its source-data-conditioned predictor own yT , and adopts its own joint model, F(fS , fT |·), as its
of an unrealized (i.e. unobserved) output, denoted by yS? , for ideal, MI (fS , fT |·).
which the known input is x?S . Hence, the target’s unspecified These specifications imply a unique design for the target’s
joint model must be augmented, to yield source-knowledge-constrained optimal distribution (3), as es-
tablished in the following proposition.
M(yS? , fS , fT |FS , yT ) = M(yS? |fS , fT , FS , yT )M(fS , fT |FS , yT ),
(2) Proposition 1. The target modeller constrains the unknown
which follows from the chain rule. Fully probabilistic design model (2) to belong to the set of possible models, M ∈ M (6),
(FPD) is the axiomatically justified Bayesian decision-making and adopts the ideal model MI (7). Then the target’s FPD-
framework [29], [35] for optimally choosing the required optimal model (3) is
conditional distribution. FPD is closely related to the minimum
cross-entropy principle [31], as further explained in [30]. It Mo (yS? , fS , fT |FS , yT ) = F(yS? |fS )Mo (fS , fT |FS , yT ), (10)
solves the following constrained optimization problem: where
o ?
M (y , fS , fT |FS , yT ) ≡ argmin D(M||MI ),
S (3) Mo (fS , fT |FS , yT ) ∝ F(fS , fT |yT )
M∈M
× exp − D F(yS? |fS )||FS (yS? |yS ) . (11)
 
2 We adopt the following notation: undecorated distributions, F and M, are
target models, with FS denoting source distributions. Proof. See Appendix A-B.
4

Algorithm 1: The FPD-optimal global target GP regression modeller predictor via construction of the GP posterior at unobserved
Input: kSS , kST , kTT , xS , xT , m̄S , R̄S , yT , σT
2
test points, as follows.
For all entries of x, compute
Remark 1. The isolated local source modeller adopts
 
m̄S
moq (x) = kq (K + blkdiag(R̄S , σT
2
In ))−1 , (AL1a)
yT FS (yS , fS |θS ) (12), which is assumed to be strict-sense station-
o
kqp (x, x0 ) = kqp (x, x0 ) − kq (K + blkdiag(R̄S , σT
2
In ))−1 k0p , ary. Then, the posterior distribution—evaluated at a single test
(AL1b) point, x ∈ Rnx (now shown explicitly in the conditioning)—is
where FS (fS |yS , θS , x) = N (m̄S , k̄S ),
kSp (xS , x0 )
 
k0p ≡ , (AL2a)
 
kq ≡ kqS (x, xS ) kqT (x, xT ) , where
kTp (xT , x0 )
m̄S (x) = kS (KS + σS2 In )−1 yS , (13a)
 
kSS (xS , xS ) kST (xS , xT )
K≡ . (AL2b)
kTS (xT , xT ) kTT (xT , xT ) 0 0 2
k̄S (x, x ) = kS (x, x ) − kS (KS + σ I ) k ,
S n
−1 0
S (13b)
Here, q, p ∈ (S, T), and blkdiag(U, V ) is the block diagonal matrix
obtained from U ∈ Ru×u and V ∈ Rv×v . with kS ≡ kS (x, xS ), k0S ≡ kS (xS , x0 ), and KS ≡ kS (xS , xS ).
Output: moq ≡ moq (xq ), Koqp ≡ kqp
o (x , x )
q p Here, (x, x0 ) is a second order test pair. The posterior pre-
dictor of unobserved outputs, yS? , for which the (known) test
points are x?S , is
The design (11) is the optimal Bayesian conditioning
FS (yS? |yS , θS ) = N (m̄S , R̄S ), (14)
mechanism we have been looking for. It forms the up-
date from the pre-posterior, F(fS , fT |yT ), to the FPD-optimal, where R̄S = K̄S + σS2 In? .
FS -conditioned, posterior, Mo (fS , fT |FS , yT ). The exponential
structure in (11) constitutes the updating term which incor- Proof. See, for example, [40], Chapter 2.
porates FS . It has the Boltzmann structure that is typical of Recall that only the source is a local modeller of the kind
entropic designs [30], [37]. above. In contrast, the target modeller augments its model to
assess the global source-target system (Fig. 1(c)). Therefore,
III. L OCAL AND GLOBAL G AUSSIAN PROCESS the joint prior distribution, F(yS? , yT , fS , fT |θ), adopted by
REGRESSION MODELLERS the global target modeller before processing yT and FS , is
specified by
We instantiate the framework in Section II to the case
F(yS? |fS , θ) ≡ N (fS , (σ ? )2 In? ), (15a)
of source and target GP regression tasks [38], [39]. For
this purpose, consider a stochastic process, f ∼ F, over a
2
F(yT |fT , θ) ≡ N (fT , σ I ),
T n (15b)
scalar random function, f ∈ R, such that any n-dimensional F(fS , fT |θ) ≡ N (0, K), (15c)
collection of evaluation points (i.e. regressors), x, induces a
joint probability distribution over the function values, f . The where  
KSS KST
GP, f (x) ∼ GP(mθf (x), kθf (x, x0 )), is conditioned on known K= (16)
KTST KTT
mean function, mθf : Rnx → R, assumed (without loss of
generality) to be zero. The known covariance (kernel) function is the symmetric covariance matrix describing the target’s prior
is kθf : Rnx ×nx → R. In this way, all joint distributions beliefs about the interactions between the source and target GP
are specified to be multivariate Gaussian, F(f ) ≡ N (m, K). function values, (fS , fT ). It is expressed in a block-matrix form,
Here, m ≡ mθf (x) and K ≡ kθf (x, x) are the known, n- involving three matrix degrees-of-freedom (T denotes matrix
dimensional mean vector, and n × n-dimensional covariance transposition). Specific cases will be considered in Section IV.
matrix, respectively. We suppress θf —the parameters of mθf Also, (σ ? )2 is the target’s conditional variance of the source
and kθf —in the sequel for notational simplicity. output data at any test point, and σT2 is the target’s conditional
This GP, f , is indirectly (i.e. noisily) observed as y at the variance of any target output datum.
evaluation points, x, via an additive, uncorrelated, Gaussian It remains to compute the target’s posterior joint inference
noise process. Hence, the joint distribution is of the GP function values, (fS , fT ), under its joint model (15),
i.e. after it has processed its local data, yT , and the transferred
F(y|f , θ) = N (f , σ 2 In ), (12a) knowledge, FS , respectively.
F(f |θ) = N (m, K), (12b)
Proposition 2. The global target modeller elicits the joint
with the GP as prior (12b). Here, σ is the conditional output-
2 prior distribution (15) and processes the source output-data
data variance, In is the n × n-dimensional identity matrix, predictor (14). Then, the FPD-optimal posterior distribution
and θ ≡ (θf , σ 2 ) are all known parameters. In the rest of (11)—evaluated at a test point x—is
this section, we restore the explicit conditioning on θ, as it is
Mo (fS , fT |FS , yT , θ, x) = N (mo , k o ), (17)
relevant to the calculus.
We now adopt this stochastic structure in the isolated (i.e. where
unconditioned on transferred knowledge) local source task of
   
moS (x) o
kSS (x, x0 ) kST
o
(x, x0 )
BTL (Fig. 1(c)). Our purpose is to transfer an output-data mo = , ko = ,
moT (x) o
kTS (x, x0 ) kTT
o
(x, x0 )
5

(a) σT
2 = 1, a = 1, a = 0.8
S T (b) σT
2 = 1, a = 1, a = 1
S T (c) aT = 1, σS2 = 1, σT
2 =1 (d) aT = 1, σS2 = 0.01, σT
2 =1

101 101 1.6 1.6


SNT SNT SNT SNT
TNT TNT TNT TNT
ICM ICM 1.2 ICM 1.2 ICM
FPDa FPDa FPDa FPDa
FPDb FPDb FPDb FPDb
MAE

MAE

MAE

MAE
100 100 0.8 0.8

0.4 0.4

10−1 10−1 0.0 0.0


10−4 10−2 100 102 104 10−4 10−2 100 102 104 −2 −1 0 1 2 −2 −1 0 1 2
σS2 σS2 aS aS
Fig. 2: Synthetic data experiments. The MAE (21) vs. the conditional output-data variance of the source, σS2 , for the source weight (a) aT = 0.8, and (b)
aT = 1. The MAE vs. the source weight, aS , for the conditional output-data variance of the source (c) σS2 = 1 and (d) σS2 = 0.01. The results are averaged
over 5000 independent simulation runs, with the solid line and the shaded area being the median and the interquartile range, respectively.

Algorithm Description
Source No Transfer (SNT) Remark 1 (Fig. 1a)
and so xS = xT = x. Here, fS (x) = aS u(x) and fT (x) =
Target No Transfer (TNT) Remark 1 (Fig. 1a) aT u(x). It follows that the covariance matrix (16) is
Intrinsic Coregionalization Model (ICM) [5] (Fig. 1b)
Fully Probabilistic Design (FPDa) Algorithm 1 (Fig. 1c) K = B ⊗ kθu (x, x), (19)
Fully Probabilistic Design (FPDb) [27]
where
TABLE I: Algorithms compared in Sections IV and V. FPDa is the newly   T  
aS aS bSS bST
proposed BTL algorithm. B= ≡ ∈ R2×2 (20)
aT aT bST bTT
is the coregionalization matrix [41], and ⊗ denotes the Kro-
with moq and kqp
o
being specified in (AL1) for q, p ∈ (S, T). necker product of the matrix arguments. We adopt the mean
absolute error (MAE) as the performance measure:
Proof. See Appendix A-C. ∗
n
1 X
The proposed global target GP regression modeller, sup- MAE = ∗ |fT (x∗i ) − moT (x∗i )|, (21)
n i=1
ported by FPD-optimal BTL from the local source GP regres-
sion modeller, is summarized in Algorithm 1. where fT (x∗i ) is the true target function value (Fig. 1), moT (x∗i )
is the lower sub-vector of the posterior predictive mean

estimate (AL1a), | · | is the absolute value, and (x∗i )ni=1 are
IV. S YNTHETIC DATA E XPERIMENTS
the n∗ ≥ 1 test points.
In this section, we investigate the performance of our algo-
rithm (Algorithm 1) against a number of alternative transfer
A. The bounding performance of our BTL algorithm (Fig. 2)
learning algorithms, in particular focussing on distinctions
between the analysis model underlying each algorithm and Settings: We choose kθu in (19) as the squared exponential
the synthesis model used to generate the synthetic data. We (covariance) kernel function with the parameters θu ≡ (σu2 , lu2 )
illustrate the following key properties of Algorithm 1: (the full list of kernel functions involved in Section IV are
• the ability to achieve robust knowledge transfer—i.e. re- defined in Table II). The complete parameterization of (18)
jection of imprecise source knowledge—due to successful is therefore θ = (σS2 , σT2 , aS , aT , σu2 , lu2 ) = (1, 1, 0.8, 1, 2, 0.4).
transfer of all moments of FS (14); Furthermore, n = 64 scalar input data (i.e. nx = 1) are
• the ability to process known correlation between the source generated via x ∼ U(−3.5, 3.5), where U(a, b) is the uniform
and target latent functions, (fS , fT ), via the target’s specifi- distribution on the open interval (a, b). The n∗ = 200 test

cation of the covariance structure, K (16); points, (x∗i )ni=1 , are placed on a uniform grid in the interval
• the experimental bounding of the performance of Algo- (−5, 5).
rithm 1, which holds under the ideal condition when the Performance: We compare the performance of the algo-
synthesis and analysis models are equal; and rithms in Table I as we adapt the source knowledge quality
• transfer of source model structure via FS (14), leading to via σS2 (18) and the correlation between the source and target
improved positive transfer of Algorithm 1 under mismatch latent functions via aS in (20), holding all other parameters in
between the source and target models. θ constant. In this experiment, we eliminate misspecification
between the ICM synthesis model (18) and the analysis models
The first three issues are explored in Section IV-A, and the of the various algorithms in Table I; i.e. they all have perfect
fourth in Section IV-B. Throughout, the synthesis model is knowledge of θ. Fig. 2(a) and (b) show the MAE for target
the (rank-1) intrinsic coregionalization model (ICM) [5], weights, aT = 0.8 and 1.0, respectively, as a function of
u ∼ GP(0, kθu ), source variance, σS2 . The TNT algorithm (i.e. isolated target
task) defines the baseline performance since it does not depend
yS ∼ N (aS u(x), σS2 ), (18)
2
on the source knowledge, and so it is invariant to all source
yT ∼ N (aT u(x), σ ), T settings. If any of the BTL algorithms (FPDa and FPDb)
6

(a) SNT (b) TNT (c) |ICM − TNT| (d) |FPDa − TNT|
0.2 0.7 0.05 0.05
C C C C ∗ ∗ ∗ ∗ ∗
L L L L ∗ ∗ ∗ ∗ ∗
synthesis model

synthesis model

synthesis model

synthesis model
P P P P ∗ ∗ ∗ ∗
Co Co Co Co ∗ ∗ ∗ ∗
SE SE SE SE ∗
RQ RQ RQ RQ ∗ ∗
M M M M
0 0 0.01 0.01
C

Co

SE
RQ

Co

SE
RQ

Co

SE
RQ

M
analysis model analysis model analysis model
Fig. 3: Synthetic data experiments. (a) The MAE (21) of the SNT algorithm utilizing the analysis model which is equivalent to the source part of the ICM
synthesis model. (b) The MAE of the TNT algorithm for all combinations of its analysis model kernel and the ICM synthesis model kernel (Table II). In (a)
and (b), lower is better. (c) and (d): The absolute value of differential MAE of the ICM algorithm and the FPDa algorithm, respectively, for all combinations
of analysis model kernel and ICM synthesis model kernel (the absolute differential is with respect to the MAE of the TNT algorithm in (b), for ease of
comparison of the positive transfer attained (always) by the respective algorithms). In (c) and (d), higher is better. The results are averaged over 1000
independent simulation runs. The asterisk ∗ highlights the cases where the FPDa algorithm achieves greater positive transfer than the ICM algorithm.

Covariance function Expression


Constant (C) k(x, x0 ) = σ 2
Linear (L) k(x, x0 ) = σ 2 xx0
Polynomial (P) k(x, x0 ) = (σ 2 xx0 + γ)d

(Co) k(x, x0 ) = σ 2 cos 2π n 0 2


P x 
Cosine i=1 (xi − xi )/l
0 2
r(x,x )
(SE) k(x, x0 ) = σ 2 exp − 21

Squared Exponential l2
0 2 −α
1 r(x,x )
Rational Quadratic (RQ) k(x, x0 ) = σ 2 1 + 2 l2 α
√ √
Matérn (ν = 3/2) (M) k(x, x0 ) = σ 2 (1 + 3r(x, x0 )) exp{− 3r(x, x0 )}
TABLE II: Kernel functions adopted in this paper, see [40]. Here, σ 2 is the signal variance, γ is the offset, d is the polynomial degree, l2 is the length-scale,
α is the fluctuation parameter, and r(x, x0 ) ≡ ||x − x0 || is the Euclidean distance between x and x0 . Note that the Matérn kernel is assessed only in the case
of the smoothness parameter ν = 3/2.

yields an MAE below or above this level, it is said to deliver σS2 = σT2 = 1. For this setting, we explore the influence of
positive or negative transfer, respectively. If any algorithm aS in Fig. 2(c), holding aT = 1. The main feature here is
saturates at this level, it is said to achieve robust transfer. The the symmetry of all the performance curves: around aS = 0
source MAE of the SNT algorithm (also an isolated task) is for FPDa, ICM and SNT; around aS = 1 for FPDb; and
computed in terms of the quantities fS and moS , and, therefore, (trivially) everywhere for TNT. In all cases, this is due to
is unrelated to the target MAE (21). Indeed, we present this the fact that aS quadratically enters the correlation structure,
source MAE to track the influence of changing σS2 and aS on B (20), of the global target modeller. When σS2 6= σT2 (e.g. the
the performance of the SNT algorithm. The latter provides case in Fig. 2(d) where σS2  σT2 ), the main point to note is
the transferred source output-data predictor to the FPDa and that FPDb cannot benefit from the improved positive transfer
FPDb algorithms. Consequently, the performance of SNT will available from this more precise source. FPDa does benefit
impact FPDa and FPDb, as explained in the next paragraph. from this source precision because it, uniquely, exploits the
The ICM algorithm delimits the optimal performance for target’s correlation structure, B.
all σS2 , since it adopts the ICM synthesis model (18) as its In summary, Fig. 2 illustrates the fact that the ICM
analysis model (i.e. misspecification is completely eliminated). algorithm—with exactly matched synthesis and analysis
Our FPDa algorithm (Algorithm 1) very closely follows models—delimits the best predictive performance that can be
the performance of the ICM algorithm (Fig. 2), despite (i) attained by the FPDa algorithm. That said, misspecification
adopting incomplete modelling, and (ii) processing source of the ICM analysis model—as will occur with probability
statistics rather than the raw source output data, yS ; i.e. FPDa one in real-data settings—will undermine its performance, and
is a robust BTL algorithm. This improves on our previous provide opportunity for FPDa to outperform it. This is so,
FPDb algorithm, which achieves positive transfer only for since FPDa allows source and target to elicit their models
σS2 < σT2 (high-quality source knowledge) but negative transfer independently (Fig. 1(c)). We demonstrate this crucial model
for σS2 > σT2 (low-quality source knowledge); i.e. it is non- robustness feature of our BTL algorithm in the next section.
robust.
Fig. 2 depicts the MAE for fixed target weight aT = 1, B. Transfer of the source’s (local) analysis model (Fig. 3)
and varying source weight, aS (Fig. 2(c) and (d)). The FPDa In our BTL framework (i.e. FPDa), the source modeller
algorithm again achieves close tracking of the ICM algo- independently chooses a different model from the global target
rithm for all the parameter settings. We note that the critical modeller to drive its source learning task. The source distribu-
point where the performance curves converge occurs when tion, FS (14), therefore encodes not only the information from
7

the local source data, yS , but also the source model structure. from the upper-right yellow quadrant in Fig. 3(d). These more
In this section, we demonstrate the features of this approach complex kernel functions for analysis are, indeed, a common
under conditions of mismatch (i.e. disagreement) between the choice for real data, where it is often hard to choose a kernel
source’s and target’s model of yS . Specifically, we will now function based on a mere visual inspection only. The remaining
design an experiment where the source modeller more closely analysis-synthesis combinations—especially those under the
captures the synthesis model than the global target modeller diagonal in Fig. 3(d)—yield FPDa predictive performances in
does; i.e. the source exhibits local expertise, as is often the the target that are almost indistinguishable from those of the
case in practice. ICM algorithm.
Settings: The purpose of this study is to explore the im-
pact of kernel structure mismatch between the global target’s V. R EAL -DATA E XPERIMENTS
analysis model and the ICM synthesis model. In all these The Swiss Jura geological dataset3 records the concentra-
cases, we arrange for the source analysis model to match tions of seven heavy metals—measured in units of parts-
the synthesis model, achieving local source expertise (above). per-million (ppm)—at 359 (spatial) locations in a 14.5 km2
The structures of the synthesis and analysis models are varied region. This dataset has been widely studied in the context
only via the kernel function, kθu (·, ·) (19), and its parameters, of multitask learning (see, for example, [5]). In our study, the
θu , for the seven cases in Table II. The specific choices objective is to predict concentrations of the less easily detected
of these parameters in our current study are as follows: metal, cadmium (Cd), based on knowledge of the more easily
(σ 2 , γ, d, l2 , α) ≡ (1.0, 1.0, 3, 0.2, 1.0). Recall that—in the detected metal, nickel (Ni). In order to demonstrate our BTL
previous simulation study (Fig. 2)—the remaining parame- approach (Algorithm 1) in this context, the source learning
ters, (σS , σT , aS , aT ), influence the transfer learning properties task Ni concentrations, and transfers its probabilistic output-
of the ICM and FPDa algorithms. We choose the settings, data predictor to the target learning task of Cd concentration
σS = σT = 1 and aS = aT = 1, where all the transfer prediction. Note that we perform transfer with only this Ni
algorithms benefit equally from the source data, and so the source task, whereas, in [5], zinc (Zn) concentrations were
only differential benefit between the algorithms will be in also processed at the source.
respect of the quality of the analysis model adopted by the Alignment with the FPD framework: The SNT algorithm
source in computing its transferred knowledge. processes the Ni concentrations measured at all nS = 359
Performance: Fig. 3 shows the MAE for all possible com- locations (i.e. source data, yS ) and uses these to compute the
binations of the kernel functions in the ICM synthesis model Ni output-data predictor (14). The FPDa algorithm processes
(Table II), and, again adopted in the TNT, ICM, and FPDa nT = 259 of the measured Cd concentrations (i.e. target data,
analysis models of the respective algorithms in Table I (i.e. yT ), along with the transferred Bayesian predictive moments,
these are 7 × 7 = 49 kernel combinations in toto). Since the m̄S and R̄S , of the source Ni output-data predictor (14),
SNT algorithm is arranged to capture perfectly the source part in order to improve the prediction of the Cd concentrations
of the ICM synthesis model, therefore it requires only a single at the n∗ = 100 hold-out locations, x∗ ≡ (x∗i )ni=1 , with

column in Fig. 3(a). We do not present the FPDb algorithm, spatial coordinates xi ≡ (xx,i , xy,i ). The aim is to compute
∗ ∗ ∗
since—with the currently adopted experimental settings—it
T ≡ mT (x ) and KTT ≡ kTT (x , x ), where mT and kTT
∗ ∗ ∗
mo∗ o o∗ o o o
reduces to the FPDa algorithm (Fig. 2(c)). The TNT algorithm are the lower sub-vector and lower-right sub-matrix of (AL1a)
delineates the baseline MAE performance of no transfer (as and (AL1b), respectively. In the Ni source task (SNT), the
it did in Section IV-A), allowing us to assess the extent if MAE for the GP predictive mean (13a) was computed across a
positive transfer for the various kernel settings in ICM and wide range of candidate kernel functions, weighting the inputs,
FPDa. Note that both algorithms stay in the positive transfer xS , via automatic relevance determination (ARD) [40]. The
regime for all combinations of the synthesis and analysis optimal choice was the rational quadratic kernel (Table II).
models, thus benefiting from the source knowledge even in In the isolated Cd target task (also with ARD), the Matérn
the misspecification cases. For this reason, we have illustrated kernel proved to be optimal, and was adopted for all the target
the absolute value differential MAE in Fig. 3(c) and (d) (i.e. learning algorithms (Table I).
higher values are better in these two sub-figures). Furthermore, Parameter learning: The notion of a synthesis model (of
the red asterisks indicate those kernel combinations for which the type in (18)) does not, of course, arise in this real-
our FPDa algorithm outperforms the ICM algorithm. Fig. 3(c) data context. Instead, we adopt maximum likelihood (ML)
and (d) reveal that the ICM algorithm slightly outperforms estimation [40] in order to learn the parameters of the analysis
the FPDa algorithm on all diagonal entries, i.e. where there is models underlying the various algorithms in Table I. For this
no mismatch between the synthesis and analysis models. This purpose, we use the iterative L-BFGS-B algorithm [42]—
is consistent with our finding in Section IV-A, namely that widely available as a library tool—to compute a local opti-
the ICM algorithm achieves the optimal performance in this mum of the log-marginal likelihood surface. The maximum
(unrealistic) case of perfect analysis-synthesis model match- number of iterations is limited to 20,000. We initialize the
ing. Conversely, the FPDa algorithm demonstrates improved parameters (Table II) of the rational quadratic kernel with
robustness to kernel structure misspecification when compared (σu2 , lu2 , αu ) ≡ (1.0, 1.0, 1.0) and the parameters of the Matérn
to the ICM algorithm, particularly when FPDa adopts a more kernel with (σu2 , lu2 ) ≡ (1.0, 1.0), where the value of the
complex analysis model—i.e. SE, QR, M (Table II)—and the
synthesis model is simple, i.e. C, L, P, CO. This can be seen 3 https://sites.google.com/site/goovaertspierre/
8

SNT TNT ICM FPDa


6 33 6 3 6 3 6 3
Mean Maps

4 4 4 4
xy (km)

xy (km)

xy (km)

xy (km)
2 2 2 2

0 0 0 0 0 0 0 0
0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6
xx (km) xx (km) xx (km) xx (km)
SNT TNT ICM FPDa
6 9 6 1 6 1 6 1
Std. Deviation Maps

4 4 4 4
xy (km)

xy (km)

xy (km)

xy (km)
2 2 2 2

0 0 0 0 0 0 0 0
0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6
xx (km) xx (km) xx (km) xx (km)
Fig. 4: Real-data experiments. Marginal concentration predictions of Ni (ppm) via the SNT algorithm, and of Cd via the TNT, ICM, and FPDa algorithms
(Table I) in a region of the Swiss Jura. Top: the marginal predictive mean map at super-sampled spatial coordinates, x∗ ≡ (x∗x , x∗y ); bottom: the associated
marginal predictive standard deviation maps. The red crosses are the locations of observed Ni and Cd concentrations used for training (359 for Ni in the
source task (first column), and 259 for Cd in the target task (remaining columns)).

Algorithm MAE
TNT 5.9273e−01 ± 0.0001e−01
ICM, FPDa), equipping these with uncertainty represented by
ICM 5.2808e−01 ± 0.0003e−01 the corresponding marginal predictive standard deviation maps
FPDa 4.9966e−01 ± 0.0005e−01 (bottom row). These have been computed on a uniform 50×50
FPDb 1.9510e+01 ± 0.0003e+01
spatial grid of the input, x∗ ≡ (x∗x , x∗y ). In this sense, the
TABLE III: Real-data experiments. The MAE (21) of predicted Cadmium current study can be characterized as an image reconstruction
concentrations for the TNT, ICM, FPDa, and FPDb algorithms in the Jura
dataset. The results are averaged over 10 runs with different initialization
task, driven by non-uniformly sampled and noisy measure-
of the estimated unknown parameters, presenting the mean and standard ments (i.e. pixel values), with registration between the source
deviation. and target images. The Matérn kernel provides the required
prior regularization that induces spatial smoothness in the (Cd)
reconstruction [43], [44]. Note that the FPDb algorithm is not
length-scale, lu2 , now applies to both input dimensions of x illustrated in Fig. 4, since it performs poorly (Table III).
in accordance with the ARD mechanism. The coefficients, aS Visual inspection of the TNT, ICM and FPDa columns
and aT , parameterizing the coregionalization matrix, B (20), of Fig. 4 reveals that our algorithm (FPDa) localizes the
are initialized as (aS , aT ) = (0.1, 0.1). For this Jura dataset, Cd deposits more sharply than either the ICM algorithm, or
we found that the adopted ML procedure was insensitive to (unsurprisingly) the TNT algorithm, which does not benefit
perturbations of these initial parameters. from any Ni source learning. In particular, note the horizontal
Performance: To assess quantitatively the prediction perfor- blurring of the Cd predictive mean map for ICM (third
mance of the algorithms in Table I, we compute the MAE column, top), an artefact that is avoided in FPDa (fourth
(21) across the n∗ = 100 hold-out locations where we column, top). This localization of the FPDa prediction supports
wish to predict Cadmium. We find that the FPDa algorithm the exploration of Cd deposits in a focussed area around
delivers positive transfer (i.e. it has a lower MAE than the x∗ = (3.8, 4.6) km2 . In contrast, the ICM predictive mean
baseline TNT algorithm), and, importantly, it outperforms the map does not sufficiently resolve the xx coordinate to support
ICM algorithm. This improves on our former FPDb algorithm on exploration decision in this area.
which suffers negative transfer because—as explained in Sec-
tion IV—it is not equipped to adopt and learn the target’s VI. D ISCUSSION
correlation structure via B (20). Our FPDa performance is We will now assess our FPD-optimal BTL framework
close to that reported in [5], despite the fact that we transfer (i.e. Algorithm 1 (FPDa)) in the context of the conventional
predictive source knowledge only from Ni concentrations, Bayesian multitask learning framework (via the ICM algo-
whereas Zn concentrations are also processed in [5], as already rithm). We will focus on a number of specific themes that
noted above. Returning to our own FPDa algorithm and the emerge from the evidence in Sections IV and V.
alternatives in Table I, we summarize our findings in Fig. 4. Transfer of moments: In the previously considered Gaussian
In the top row, we illustrate the spatial maps of the marginal settings [23], [24], [27] of our FPD-optimal BTL framework,
predictive mean concentration of Ni (SNT) and Cd (TNT, the second-order moments of the output-data predictor, FS ,
9

were not successfully transferred, leading to negative transfer. algorithm performs almost as well (Fig. 2), despite the fact
This is also true of FPDb in this paper (Fig. 2(a) and (b) that it does not rely on complete instantiation of the synthesis
and Table III). In all those papers, the order of arguments model. Indeed, as explained in Section II, FPD-optimal BTL
in the KLD was the reverse of the one proposed in this does not require completion of the interaction model between
paper (11), a reversal which is essential for the robust transfer the source and target at all, and has therefore proved to be
property of our FPDa algorithm. This reversal has ensured more robust to ignorance of the synthesis model than the ICM
the successful transfer of the second-order moment, R̄S , of algorithm has. In real-data applications—such as in Section
(14), even in the current Gaussian setting, yielding the block- V—there is, of course, no synthesis model. The benefit of our
diagonal covariance structure (AL1). Equipping the FPDa incompletely modelled approach in terms of robustness has
algorithm with R̄S delivers robust knowledge transfer, as been demonstrated: FPDa can provide a closer fit to the data
demonstrated in the following remark. than ICM, as presented in Fig. 4 and Table III. The SNT task
involves a second independent local modeller (Fig. 1(c)) which
Remark 2. Consider (AL1) for q = p = T (i.e. the target after constructs its source output-data predictor, FS (14), using the
transfer) in the limit where the source knowledge becomes same source data that the ICM algorithm processes. The
uniform (and, therefore, non-informative [45]). In this case, source’s local modelling expertise (Section IV-B) has enriched
moT (x) → m̄(x), the knowledge transfer, and provided a supplementary learning
resource which is not available to ICM.
o
kTT (x, x0 ) → k̄(x, x0 ),
Computational load of our BTL algorithm: The principal
being the moments of the isolated TNT learning task. This computational cost of all the algorithms in Table I resides in
rejection of non-informative source knowledge by the target inverting the linear systems involved in the standard Gaussian
in our FPDa algorithm is what we mean by robust transfer. (conditional) data update (13). This cost scales cubically with
the number of data in the source and target tasks, respectively,
The role of R̄S in effecting robust transfer—as well as in i.e. O(n3q ), q ∈ (S, T) [46]. Owing to the sequential nature of
strengthening the positive transfer above threshold which we the knowledge processing in FPDb (i.e. source processing of
observed in Fig. 2(a) and (b)—will be studied more technically raw data yS , then target processing of the transferred source
in a forthcoming paper. Bayesian predictor, FS , along with the target raw data, yT ),
Weighting of the transferred knowledge: Recall that the the net computational load of FPDb is O(n3S ) + O(n3T ). In
matrix, K (16)—which is instantiated in the ICM case in (19) contrast, our new BTL algorithm (FPDa) shares the prop-
and (20)—expresses the target’s interaction with the source. erty with ICM of bidirectional knowledge flow (Fig. 1(c)).
Inspecting (AL1) and (AL2), we also see that K controls This leads to the matrix augmentation form of the inverse
the weighting mechanism in the FPDa algorithm, tuning the systems in (AL1) and, correspondingly, a more computation-
influence of the transferred knowledge, FS , on the optimal ally expensive—O((nS + nT )3 )—algorithm. There is a rich
target conditional, Mo (fT |FS , yT , K, ·) (11). This is seen in literature on efficient inversions for GP regression, exploiting
the fact that its target moments, moT and KoTT , are functions possible reductions (notably sparsity) in the covariance matrix,
of K. The study of this behaviour can be formalized, by K (16). These same reductions have been applied in multitask
adopting similar limit arguments as in Remark 2. Further- GP learning [46], such as in ICM. Since our new FPDa BTL
more, the optimization of K—which is a hyperparameter algorithm shares the same augmented matrix structure (AL1)
of Mo (fT |FS , yT , K, ·) (see (17))—via maximum likelihood as ICM, it, too, can benefit from these reductions.
estimation (Section V) ensures that this weighting is data- Source knowledge compression: Conventional multitask
driven (i.e. adaptive). This is also an advance beyond our learning—such as in Fig. 1(b)—requires the communication
previous proposals for FPD-optimal BTL [23], [24], [27], in of all nS of the source’s extended (training) data, (xS , yS ), to
which no transfer weighting mechanism was induced. This the processor (either centrally or in some distributed manner),
progress has been achieved only because the target modeller i.e. the transferred message size (quantified as the number
is an extended modeller of the interacting source-target tasks, of scalars) is nS (nxS + 1). Conversely, our FPDa algorithm
unlike the situation in our previous work where the target (Fig. 1(c)) is a truly Bayesian transfer learning algorithm,
models only its local task. Technical details relating to this whose communication overhead resides only in the transfer of
inducing of a data-driven transfer weighting will be developed the source’s sufficient statistics, (m̄S , R̄S ), of its output-data
further in future work. predictor at predictive inputs, x?S , i.e. FS (yS? |yS , xS , x?S , θS ) =
Robustness to analysis-synthesis model mismatch: In sim- N (m̄S , R̄S ) (14). Therefore, the transferred knowledge is
ulations contexts—such as in Section IV—the data are sim- fully encoded by these source sufficient statistics without
ulated from the chosen synthesis model, while each assessed the requirement to transfer the raw extended source data,
algorithm is consistent with an analysis model. This provides (yS , xS , x?S ). The message size for our BTL algorithm is there-
n? +1
an opportunity to assess the robustness of each algorithm to fore n?S (1 + S2 ). It follows that the condition which must
model misspecification, as studied in Section IV-B. We have hold for compression to be achieved in our BTL algorithm
already noted that the ICM algorithm achieved a performance versus conventional raw data transfer is4 n?S < 2nxS −1, i.e. the
optimum across all the studied algorithms in Section IV since number of predictive points must be lower than two times the
its (ICM) analysis model equals the (ICM) synthesis model
used to generate all data in that study. Nevertheless, the FPDa 4 Here, we quote the implied inequality for the special case, n?S ≡ nS .
10

dimension of each input (minus one). This condition expresses performance is undermined otherwise. Furthermore, it is in-
the objective of avoiding dilution of source knowledge in the trinsic to that framework that the global modeller must process
predictor [47]. Note also that the source sufficient statistics raw data from the remote sources. This potentially incurs
are functions of the source GP kernel structure, kS (Remark a communication overhead and forces the target to process
1). This can be highly expressive—with many degrees of remote source data about which it may lack local expertise.
freedom—in cases of local source expertise. We have provided Against this, the complete nature of multitask modelling
a preliminary study of the benefit to the target of a distinct ensures its optimal performance in the (admittedly unlikely)
source covariance structure (Section IV-B, and see also the case where task interactions are accurately modelled.
third discussion theme above). Indeed, far more expressive The FPD-optimal Bayesian transfer learning (BTL) frame-
covariance structures in kS —involving weighted combinations work developed and tested in this paper has achieved important
of canonical kernel functions (Table II)—can also be adopted progress beyond the conventional state-of-the-art above. Its
by the source [48]. In all cases, our Bayesian knowledge trans- key advance is that it does not require elicitation of a model
fer compresses this resource into dimension-invariant sufficient of dependence between the interacting tasks (a defining aspect
statistics for subsequent processing by the target task. of BTL, in our opinion, called incomplete modelling), and so
——— our approach avoids the misspecification that inevitably arises
in conventional multitask learning. Instead, it chooses the
We now summarize key findings in respect of our BTL frame- conditional target distribution—from among the uncountable
work (and its incomplete modelling of interaction between the cases consistent with the partial model and with the knowl-
independent source and target modellers (Fig. 1(c))) versus edge constraints—in a decision-theoretically optimal manner.
conventional multitask learning (and its complete and unitary A number of important practical benefits flow from this.
modelling of all source-target variables (Fig. 1(b))). Firstly, the local source processes its local data—exploiting
• Our FPD-optimal BTL algorithm (Fig. 1(c)) is truly a trans- its local modelling expertise—into a Bayesian (i.e. proba-
fer learning algorithm—as opposed to a multitask learning bilistic) predictor. Secondly, only the sufficient statistics of
algorithm—because it does not require complete modelling this predictor need be transferred to the target. Thirdly, the
of source-target interactions. Among the consequences we target then sequentially has the opportunity to process this
discovered in this paper are (i) the possibility to avoid source knowledge along with its local target data. Fourthly,
problems of model misspecification, and (ii) the opportunity our framework allows this target to be a global modeller of
to transfer potentially compressed source sufficient statistics both tasks, ensuring that optimal positive transfer is preserved
of the data-predictive distribution, FS , instead of the raw in the case specified at the end of the previous paragraph.
data themselves. The success of our algorithm hinges on its ability to transfer
• The source and target modellers (Fig. 1(c)) are independent, all moments of the source predictor, an advance beyond our
and, indeed, our BTL algorithm is truly a multiple model earlier variants of BTL, and achieved via careful specification
algorithm, in contrast to the single global modeller of of the KLD objective (11). The source covariance matrix, R̄S ,
multitask learning (Fig. 1(b)). We have shown that a direct has been vital in knowledge-driving the weighting attached
consequence is the ability of our algorithm to transfer to the transferred predictor by the target. This has obviated
source knowledge enriched by an expressive local kernel the need for hierarchical relaxation of the target model which
structure. was necessary in our previous work in order to learn this
• Our BTL algorithm delegates the processing of local data weight, and which incurred relatively expensive variational
to the local (source) task before the resulting stochastic approximation [25], [28].
message, FS , is transferred for subsequent processing by the The FPD-optimal BTL design of the target’s source-
target task (along with its local (target) data). This opens conditional learning (11) can be interpreted as a binary op-
up the possibility for distributed and parallel processing in erator, closed within the class of probability distributions.
a way that is not intrinsic to multitask learning. It operates (non-commutatively) on the transferred source
• In our formulation of BTL (Fig. 1(c)), we have exploited output-data predictor, FS , and on the target’s posterior distribu-
the opportunity to transfer knowledge to a target task tion, F(fS , fT |yT ), yielding the optimal-knowledge-conditional
that is, itself, a global modeller. The paper has presented target posterior inference, Mo (fS , fT |FS , yT ). This closure of
evidence to show that this extension of the target ensures our BTL operator within the class of probability distributions
that unreliable source knowledge is rejected (i.e. robust recommends it in a wide range of continual learning tasks [43],
transfer is achieved (Remark 2)), and our positive transfer [49]. For instance, in incompletely modelled networks, the
above threshold is as good as that of conventional multitask requirement for all BTL knowledge objects to be distributions
learning, such as ICM. ensures that local computational resources onboard local nodes
are exploited in processing local data into these local distribu-
tions. Furthermore, target nodes can recursively act as source
VII. C ONCLUSION
nodes in a continual learning process, so that probabilistic
A vulnerability of conventional multitask learning arises decision-making is distributed across a network, subject to
from the fact that there must exist a global modeller of all specification of the network architecture. Such applications
the tasks in the system. This imposes a requirement on the of our work in Bayesian networks—and, indeed, in deep
global modeller to capture task interactions accurately, and learning contexts [50]—can provide a rich opportunity for our
11

Bayesian transfer learning paradigm to address key technology where the (fS -conditioned) KLD from F(yS? |fS , θ) to FS (yS? |yS )
challenges at this time. is given by
F(yS? |fS , θ) ?
Z
D(F||FS ) = F(yS? |fS , θ) log dy ,
A PPENDIX A FS (yS? |yS ) S
P ROOFS and the normalizing constant of Mo (fS , fT |·) is
A. Preliminaries Z
cMo = F(fS , fT |yT , θ) exp {−D(F||FS )} dfS dfT .
Lemma 1. Let us consider the following factorized, jointly
Gaussian model: These rearrangements reveal the marginal distribution (11) of
F(y|f ) ≡ N (y; f , Σ), the FPD-optimal model (10).
F(f ) ≡ N (f ; m, K).
C. Proof of Proposition 2
Then, the joint distribution is
      To simplify the subsequent rearrangements, let us rewrite
F(f , y) = N
f
;
m
,
K K
. (22) (11) as follows:
y m K K+Σ
Mo (fS , fT |FS , yT , θ) ∝
Proof. The proof follows from standard analysis.
F(yT |fT , θ)F(fS , fT |θ) exp {−D(F||FS )} . (23)
Lemma 2. Let us consider a slight generalization of the After adopting (15a) and (14), the exponential term in (23)
Gaussian distribution in (22): results in
     
f f̄ K K exp {−D(F||FS )} ∝ N (m̄S ; fS , R̄S ). (24)
F(f , y) ≡ N ; , .
y ȳ K K+Σ
Substituting (15b), (15c), and (24) into (23), we obtain
Then, the conditional and marginal densities are
Mo (fS , fT |FS , yT , θ) ∝
F(f |y) = N (f ; m̄, K̄),
N (yT ; fT , σT2 InT )N (m̄S ; fS , R̄S )
F(y) = N (y; ȳ, K + Σ),      
fS 0 KSS KST
where N ; , . (25)
fT 0 KTS KTT
m̄ = f̄ + K(K + Σ)−1 (y − ȳ), To use Lemma 1 with (25), we adopt the following notational
K̄ = K − K(K + Σ) K. −1 assignments:
     
Proof. The proof follows from standard analysis. m̄S f 0
y≡ , f ≡ S , m≡ ,
yT fT 0
   
R̄S O KSS KST
B. Proof of Proposition 1 Σ≡ , K ≡ , (26)
O σT2 InT KTS KTT
By substituting the constrained unknown and ideal models,
(5) and (7) respectively, into (3), we can write which allows us to write

D(M||MI ) Mo (fS , fT |FS , yT , θ) ∝ (27)


Z      
= F(yS? |fS , θ)M(fS , fT |FS , yT , θ) fS 0 KSS KST KSS KST
 fT  0 KTS KTT KTS KTT 
F(yS? |fS , θ)M(fS , fT |FS , yT , θ) ? N m̄S ;0, KSS
   .
KST KSS + R̄S KST 
× log dyS dfS dfT 2
FS (yS? |yS )F(fS , fT |yT , θ) yT 0 KTS KTT KTS KTT + σT InT
M(fS , fT |FS , yT , θ)
Z 
= M(fS , fT |FS , yT , θ) log Consequently, applying Lemma 2 to (27)—with the same
F(fS , fT |yT , θ) blocks as in (26)—we obtain
?
F(yS |fS , θ) ?
Z 
+ F(yS? |fS , θ) log dy df df Mo (fS , fT |FS , yT , θ) = N (mo , Ko ),
FS (yS? |yS ) S S T
where
Z
= M(fS , fT |FS , yT , θ)    o 
o moS mS (xS )
M(fS , fT |FS , yT , θ) m ≡ ≡ ,
mo moT (xT )
× log df df  oT
F(fS , fT |yT , θ) exp {−D(F||FS )} S T
  
KSS KoST kSSo o
(xS , xS ) kST (xS , xT )
Ko ≡ ≡ ,
+ log cMo − log cMo KoTS KoTT kTSo o
(xT , xS ) kTT (xT , xT )
M(fS , fT |FS , yT , θ)
Z  
= M(fS , fT |FS , yT , θ) log dfS dfT which is subblock-wise computed, q, p ∈ (S, T), as
Mo (fS , fT |FS , yT , θ)  
m̄S
− log cMo , moq (xq ) = kq (K + blkdiag(R̄S , σT2 In ))−1 ,
yT
12

o
kqp (xq , xp ) = kqp (xq , xp ) Geoscience and Remote Sensing Magazine, vol. 4, no. 2, pp. 58–78,
2016.
− kq (K + blkdiag(R̄S , σT2 In ))−1 kp , [17] Y. Zhang and D.-Y. Yeung, “Multi-task warped Gaussian process for
personalized age estimation,” in 2010 IEEE computer society conference
and on computer vision and pattern recognition. IEEE, 2010, pp. 2622–
  2629.
  kSp (xS , xp ) [18] M. Ghassemi, M. A. F. Pimentel, T. Naumann, T. Brennan, D. A. Clifton,
kq ≡ kqS (xq , xS ) kqT (xq , xT ) , kp ≡ .
kTp (xT , xp ) P. Szolovits, and M. Feng, “A multivariate timeseries modeling approach
to severity of illness assessment and forecasting in ICU with sparse,
Now, extending both xq and xp with x leads to heterogeneous clinical data,” in AAAI 29th Conference on Artificial
Intelligence, 2015.
Mo (fS , fT , fS , fT |FS , yT , θ, x) = N (mox , Kox ), [19] L. Torrey and J. Shavlik, “Transfer learning,” in Handbook of Research
on Machine Learning Applications and Trends: Algorithms, Methods,
where and Techniques, 2010, pp. 242–264.
  [20] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans-
moS (xS ) actions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–
moT (xT ) 1359, 2010.
mox ≡ 
 moS (x)  ,

[21] K. Weiss, T. M. Khoshgoftaar, and D. D. Wang, “A survey of transfer
learning,” Journal of Big Data, vol. 3, no. 1, p. 9, 2016.
moT (x) [22] J. Lu, V. Behbood, P. Hao, H. Zuo, S. Xue, and G. Zhang, “Transfer
 o o o 0 o 0
 learning using computational intelligence: A survey,” Knowledge-Based
kSS (xS , xS ) k ST (xS , xT ) kSS (xS , x ) kST (xS , x ) Systems, vol. 80, pp. 14–23, 2015.
k o
TS (xT , xS ) k o
(xT , xT ) ko
(xT , x0 ) ko
(xT , x0 ) [23] C. Foley and A. Quinn, “Fully probabilistic design for knowledge
Kox ≡  TT TS TT , transfer in a pair of Kalman filters,” IEEE Signal Processing Letters,
o
 kSS (x, xS ) k o
ST (x, xT ) ko
SS (x, x0 ) ko
ST (x, x0 )  vol. 25, no. 4, pp. 487–490, 2018.
o
kTS (x, xS ) k o
TT (x, xT ) ko
TS (x, x0 ) ko
TT (x, x0 ) [24] M. Papež and A. Quinn, “Dynamic Bayesian knowledge transfer be-
tween a pair of Kalman filters,” in IEEE 28th International Workshop
which—after marginalizing w.r.t. (fS , fT )—yields the sought on Machine Learning for Signal Processing (MLSP). IEEE, 2018.
result (17). [25] ——, “Robust Bayesian transfer learning between Kalman filters,” in
IEEE 29th International Workshop on Machine Learning for Signal
Processing (MLSP). IEEE, 2019.
R EFERENCES [26] L. Jirsa, L. Pavelkova, and A. Quinn, “Knowledge transfer in a pair of
uniformly modelled Bayesian filters,” in 2019 16th International Con-
[1] Y. Zhang and Q. Yang, “A survey on multi-task learning,” arXiv preprint ference on Informatics in Control, Automation and Robotics (ICINCO).
arXiv:1707.08114, 2018. IEEE, 2019.
[2] Y. W. Teh, M. Seeger, and M. I. Jordan, “Semiparametric latent factor [27] M. Papež and A. Quinn, “Bayesian transfer learning between Gaussian
models,” in 10th Conference on Artificial Intelligence and Statistics process regression tasks,” in IEEE 19th International Symposium on
(AISTATS), 2005. Signal Processing and Information Technology (ISSPIT). IEEE, 2019.
[3] E. V. Bonilla, K. M. Chai, and C. Williams, “Multi-task Gaussian process [28] M. Papež and A. Quinn, “Bayesian transfer learning between Student-t
prediction,” in Advances in neural information processing systems, 2008, filters,” Signal Processing, vol. 175, p. 107624, October 2020.
pp. 153–160. [29] M. Kárný and T. Kroupa, “Axiomatisation of fully probabilistic design,”
[4] T. V. Nguyen and E. V. Bonilla, “Collaborative multi-output Gaussian Information Sciences, vol. 186, no. 1, pp. 105–113, 2012.
processes,” in 38th Conference on Uncertainty in Artificial Intelligence [30] A. Quinn, M. Kárný, and T. V. Guy, “Fully probabilistic design of
(UAI). AUAI Press, 2014, pp. 643–652. hierarchical Bayesian models,” Information Sciences, vol. 369, pp. 532–
[5] M. A. Álvarez and N. D. Lawrence, “Computationally efficient con- 547, 2016.
volved multiple output Gaussian processes,” Journal of Machine Learn- [31] J. Shore and R. Johnson, “Axiomatic derivation of the principle of
ing Research, vol. 12, pp. 1459–1500, 2011. maximum entropy and the principle of minimum cross-entropy,” IEEE
[6] M. Van der Wilk, C. E. Rasmussen, and J. Hensman, “Convolutional Transactions on Information Theory, vol. 26, no. 1, pp. 26–37, 1980.
Gaussian processes,” in Advances in Neural Information Processing [32] J. Kracı́k and M. Kárný, “Merging of data knowledge in Bayesian
Systems, 2017, pp. 2849–2858. estimation,” in Proceedings of the 2nd International Conference on
[7] A. G. Wilson, D. A. Knowles, and Z. Ghahramani, “Gaussian process Informatics in Control, Automation and Robotics. Barcelona: INSTICC,
regression networks,” in 29th International Conference on Machine 2005, pp. 229–232.
Learning (ICML). Omnipress, 2012, pp. 1139–1146. [33] M. Kárný, J. Andrýsek, A. Bodini, T. Guy, J. Kracı́k, and F. Ruggeri,
[8] T. V. Nguyen and E. V. Bonilla, “Efficient variational inference for “How to exploit external model of data for parameter estimation?” In-
Gaussian process regression networks,” in 16th Conference on Artificial ternational Journal of Adaptive Control and Signal Processing, vol. 20,
Intelligence and Statistics (AISTATS), 2013, pp. 472–480. no. 1, pp. 41–50, 2006.
[9] A. Damianou and N. Lawrence, “Deep Gaussian processes,” in Artificial [34] S. Azizi and A. Quinn, “Hierarchical fully probabilistic design for
Intelligence and Statistics, 2013, pp. 207–215. deliberator-based merging in multiple participant systems,” IEEE Trans-
[10] M. Kandemir, “Asymmetric transfer learning with deep Gaussian pro- actions on Systems, Man, and Cybernetics: Systems, vol. 48, no. 4, pp.
cesses,” in 32nd International Conference on Machine Learning (ICML), 565–573, 2016.
2015, pp. 730–738. [35] M. Kárný, “Towards fully probabilistic control design,” Automatica,
[11] A. Boustati and R. S. Savage, “Multi-task learning in deep Gaussian vol. 32, no. 12, pp. 1719–1722, 1996.
processes with multi-kernel layers,” arXiv preprint arXiv:1905.12407, [36] S. Kullback and R. A. Leibler, “On information and sufficiency,” The
2019. Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951.
[12] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing, “Deep kernel [37] A. Quinn, M. Kárný, and T. V. Guy, “Optimal design of priors con-
learning,” in Artificial Intelligence and Statistics, 2016, pp. 370–378. strained by external predictors,” International Journal of Approximate
[13] J. M. Bernardo and A. F. M. Smith, Bayesian Theory. Wiley, 1994. Reasoning, vol. 84, pp. 150–158, 2017.
[14] A. Lazaric and M. Ghavamzadeh, “Bayesian multi-task reinforcement [38] C. M. Bishop, Pattern recognition and machine learning. Springer,
learning,” in 27th International Conference on Machine Learning 2006.
(ICML). Omnipress, 2010, pp. 599–606. [39] K. P. Murphy, Machine learning: A probabilistic perspective. MIT
[15] K. Swersky, J. Snoek, and R. P. Adams, “Multi-task Bayesian optimiza- Press, 2012.
tion,” in Advances in neural information processing systems, 2013, pp. [40] C. Rasmussen and C. Williams, Gaussian processes for machine learn-
2004–2012. ing. MIT Press, 2006.
[16] G. Camps-Valls, J. Verrelst, J. Munoz-Mari, V. Laparra, F. Mateo- [41] M. A. Álvarez, L. Rosasco, and N. D. Lawrence, “Kernels for vector-
Jimenez, and J. Gomez-Dans, “A survey on Gaussian processes for valued functions: A review,” Foundations and Trends® in Machine
Earth-observation data analysis: A comprehensive investigation,” IEEE Learning, vol. 4, no. 3, pp. 195–266, 2012.
13

[42] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, “Algorithm 778: L-BFGS- Networks and Learning Systems, 2020.
B: Fortran subroutines for large-scale bound-constrained optimization,” [47] M. J. Wainwright, High-dimensional statistics: A non-asymptotic view-
ACM Transactions on Mathematical Software (TOMS), vol. 23, no. 4, point. Cambridge University Press, 2019, vol. 48.
pp. 550–560, 1997. [48] G. Parra and F. Tobar, “Spectral mixture kernels for multi-output
[43] M. B. Ring, “Continual learning in reinforcement environments,” Ph.D. Gaussian processes,” in Advances in Neural Information Processing
dissertation, University of Texas at Austin, 1994. Systems, 2017, pp. 6681–6690.
[44] J.-F. Ton, S. Flaxman, D. Sejdinovic, and S. Bhatt, “Spatial mapping [49] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner, “Variational continual
with Gaussian processes and nonstationary Fourier features,” Spatial learning,” in International Conference on Learning Representations,
statistics, vol. 28, pp. 59–78, 2018. 2018.
[45] H. Jeffreys, Theory of Probability. Clarendon Press, 1961. [50] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
[46] H. Liu, Y.-S. Ong, X. Shen, and J. Cai, “When Gaussian process meets no. 7553, pp. 436–444, 2015.
big data: A review of scalable GPs,” IEEE Transactions on Neural

You might also like