You are on page 1of 8

Empirical Bayes for Learning to Learn

Tom Heskes tom@mbfys.kun.nl


SNN, University of Nijmegen, Geert Grooteplein 21, Nijmegen, 6525 EZ, The Netherlands

Abstract learn" described in, among others, Baxter (1997) and


Thrun and Pratt (1997).
We present a new model for studying mul-
titask learning, linking theoretical results to Multitask learning constitutes an ideal setting for
practical simulations. In our model all tasks studying learning to learn. In multitask learning we
are combined in a single feedforward neu- are dealing with many related tasks. The hope is that
ral network. Learning is implemented in a the tasks can \learn from each other", for example by
Bayesian fashion. In this Bayesian frame- sharing parameters. A typical example, also studied
work the hidden-to-output weights, being in this article, is the use of a neural architecture with
speci c to each task, play the role of model part of the weights shared and others speci c to each
parameters. The input-to-hidden weights, task. Training the network on all tasks, the risk of
which are shared between all tasks, are over tting the shared part is reduced and a common
treated as hyperparameters. Other hyper- set of features can be obtained. This idea has been
parameters describe error variance and cor- studied and tested on practical problems in e.g. Caru-
relations and priors for the model param- ana (1997) and Pratt and Jennings (1996). Whether
eters. An important feature of our model it works depends on whether the tasks are indeed suf-
is that the probability of these hyperparam- ciently similar, which is often hard to tell in advance.
eters given the data can be computed ex- Baxter (1997) proposed hierarchical Bayesian infer-
plicitely and only depends on a set of su- ence as a model for studying multitask learning. Pa-
cient statistics. None of these statistics scales rameters that are shared between tasks are treated
with the number of tasks or patterns, which as hyperparameters at a higher level than the task-
makes empirical Bayes for multitask learn- speci c model parameters. Information-theoretic ar-
ing a relatively straightforward optimization guments show that multitask learning can be highly
problem. Simulations on real-world data sets advantageous, especially when the number of hyper-
on single-copy newspaper and magazine sales parameters is much larger than the number of model
illustrate properties of multitask learning. parameters per task. The necessary assumption to ar-
Most notably we derive experimental curves rive at these theoretical results is that the tasks are
for \learning to learn" that can be linked to indeed related, i.e., that they are drawn from the same
theoretical results obtained elsewhere. (hyper)distribution.
In this article, we aim at a practical implementation of
Baxter's framework. The essence is the ability to com-
1. Introduction pute the probability of the hyperparameters given the
Machine learning is the art of building machines that data. In a full hierarchical Bayesian procedure, one
learn from data. Whereas the learning part is auto- should sample the hyperparameters from this distri-
matic, an expert has to take care of the building part. bution, and subsequently the model parameters from
This often involves choosing the structure and learning the distribution of model parameters given the data
algorithm, setting regularization parameters, and so and the hyperparameters. The empirical Bayesian ap-
on. In many cases the expert's capabilities are crucial proach is the frequentist shortcut: rather than sam-
for the success or failure of an application: he or she pling the distribution of hyperparameters, we will only
has to provide the right bias that makes the machine consider their most likely values.
generalize well. Why not try to move the expert's re- We will study a speci c model for solving many related
sponsibility to the next level and let the machine itself regression tasks, which is similar to the one studied
learn its own bias? This is the concept of \learning to
in Heskes (1998), except for all kinds of technical de- Hornik (1989). The feature matrix B is shared by all
tails. These details are necessary for building a robust tasks and will thus play the role of a hyperparameter.
and accurate prediction system, but a deathblow for The vectors Ai , on the other hand, are task-speci c
clear methodology and extensive testing. In this article and thus treated as model parameters.
we therefore abstract from the exact implementation Let D stand for the outputs yi for all tasks i and
and describe a simpler model, easier to analyze theo- patterns . Assuming independently and identically
retically but still applicable on real-world problems. distributed (iid) observations we have the likelihood
In Section 2 we work out our model of multitask learn- Y
ing. The empirical Bayesian approach is sketched in P(DjA; B; ; ) = (y : AT Bx ;  ) (2)
Section 3, with the crucial probability of the hyper- 
parameters given the data outlined in the Appendix.
The model is illustrated on real-world data involving where (y : m; ) means that y is normally distributed
newspaper and magazine sales in Section 4. Ideas for with mean m and covariance matrix .  is the noise
further research are discussed in Section 5. covariance matrix with components [ ]ii = (1+2 )2
and [ ]ij = 2 2 .
2. A Model of Multitask Learning The linearity of (1) induces a symmetry: the data
likelihood (2) cannot distinguish between fQA; Q?1B g
2.1 Data Likelihood and the original fA; B g for any invertible nhid  nhid
We are given a set of N combinations of input vectors matrix Q. We use this freedom to constrain B such
x and n outputs yi , each corresponding to a di erent that the covariance matrix of hidden unit activities is
task. Our model assumption is that the output yi is a the identity matrix:
linear function of the input x with additive noise i 1 X[Bx ][Bx ]T = B B T = I
and   : N  xx (3)
yi = ATi Bx + i +   : (1)
We will assume that there is a single output for each

with xx = xxT patterns the covariance matrix of the
task and write y for the vector of all n outputs yi . inputs. This constraint greatly simpli es the analy-
The input vectors x are the same for all tasks. This sis that follows. Going through this analysis it can
assumption is introduced to simplify the mathematical be seen that we do not need to require that all tasks
exposition, but will be relaxed later on. receive the same input vectors x , but just that they
The noise part i +   consists of two Gaussian noise have the same number of input vectors and the same
terms with average zero and standard deviation  and input covariance matrix (and thus the same covariance
, respectively. The noise i is speci c to task i, matrix of hidden unit activities).
the noise   the same for all tasks. This distinction With the constraint (3) the maximum likelihood solu-
between an individual noise term and a common noise tion Aml takes the simple form
term becomes relevant when the individual predictions
are translated to an aggregate level. For example, uc- Aml = argmaxA P(DjA; B; ; ) = Bxy ;
tuations in the error averaged over all tasks will be

constant andpproportional to  for nonzero , yet with xy = xyT patterns the input-output covari-
scale with 1= n for  = 0. Furthermore, substantial ance matrix. Note that the data likelihood (2) can
correlations might indicate that some important infor- be rewritten as the exponent of a term quadratic in
mation is lacking. Both  and  are the same for all A ? Aml times a term independent of A.
tasks and thus treated as hyperparameters.
The model part ATi Bx can be interpreted as a linear 2.2 Prior Information
neural network. The nhid  ninp matrix B contains the The model parameters Ai determine the impact of the
weights connecting the inputs to the hidden units, the hidden units on output i. As a start we could consider
nhid  n matrix A, with column vectors Ai , represents the Gaussian prior
the weights from the hidden to the outputs units. Typ-
ically we have nhid < ninp  n, i.e., a bottleneck of P(AijM; A ) = (Ai : M; A) ;
hidden units. The hidden units will tend to nd a low
dimensional representation of the inputs that is useful with M a vector of length nhid and A an nhid  nhid
for all tasks. The standard case, maximum likelihood covariance matrix. This corresponds to a so-called ex-
for  = 0, has been treated in great detail in Baldi and changeability assumption (the same M for all tasks)
and introduces a tendency for similar model param- hierarchical Bayesian approach that would also re-
eters across tasks. How similar is determined by the quire sampling over hyperparameters (see e.g. Robert,
covariance matrix A . 1994).
The exchangeability assumption is rather strong. It The loss L() only depends on the data through a
can be generalized to include task-dependent aver- set of \sucient statistics". An example of such a
ages, e.g., by making these (linearly) dependent on statistic is the ninp  ninp matrix Rxx = xy Txy =n.
an nprior  n matrix z with task characteristics: The other statistics are given in the Appendix. This
property of the loss function has important practical
P(AijMi ; A ) = (Ai : Mi ; A) with M = z : and theoretical consequences. In practice, the proce-
dure for obtaining the most likely hyperparameters no
is an nhid  nprior matrix with regression parameters. longer scales with the number of tasks or patterns (af-
The exchangeability prior is a special case: nprior = 1 ter initial computation of the statistics). This speed-
and z a vector of all ones. In the case of newspaper up enables extensive testing of all kinds of ideas and
sales, task characteristics are properties of the partic- options. In the rest of this article we will exploit these
ular outlet that are assumed to have an e ect on sales practical possibilities and illustrate properties of mul-
patterns. An example is (a useful representation of) titask learning on real-world databases, leaving precise
the outlet's geographical location. theoretical analyses for later.
3. Empirical Bayes 3.3 Link with Multilevel Analysis
3.1 Integrating out the Model Parameters The empirical Bayesian procedure proposed here can
be viewed as a special case of a hierarchical Bayesian
Let  = fB; ; ; ; Ag denote the set of all hyper- approach to multilevel analysis, a statistical method
parameters. In empirical Bayes the idea is to try and for analyzing nested data sets (see e.g. Bryk and Rau-
nd the hyperparameters that maximize the proba- denbusch, 1992). Hierarchical because of the di erent
bility P(jD), which are then xed when computing stages of inference, multilevel because of the di erent
statistics over the model parameters (see e.g. Robert, levels of noise. Important di erences with standard
1994). Using Bayes' formula we obtain multilevel analysis are the inference of the feature ma-
Z Z trix B and the incorporation of correlated errors.
P(jD) = dAP(A; jD) = dA P(A; ; D)
P(D)
P() Z dAP(DjA; B; ; )P(Aj ;  ) :
4. Illustrations on Real-World Data
= P(D) A 4.1 Preprocessing the Data Sets
Taking a at prior for P(), the goal is to minimize For illustration we use three di erent data sets. Two
the loss de ned as of them (data sets I and II), involve newspaper sales,
Z
the other magazine sales (III). Each of the sets con-
tains two to three years of weekly sales gures, i.e., N
L() = ? log dAP(DjA; B; ; )P(Aj ; A): between 100 and 150, for n = 343 (data set I) to 1000
(data set III) outlets. Explanatory variables taken into
This integral over the model parameters is highly di- account di er between data sets, with the number of
mensional, but perfectly doable since all terms in the inputs ranging from ninp = 9 to 20, but always contain
exponent are at most quadratic in A. The result is a few weeks of recent sales gures (speci c to the out-
given in the Appendix. let involved), two inputs coding season, and one con-
stant input. The sales gures to be estimated are all
3.2 Properties of the Loss Function rescaled per outlet to have mean zero and unit stan-
dard deviation. Outliers are detected and removed,
Here we highlight the main features of the loss func- the sales gures are corrected for sellouts, and missing
tion L(). First of all we note that L() scales with input values are lled in by averaging over the inputs of
Nn, i.e., with the total number of examples in the data nearby editions. The original input vectors are trans-
set. For the conditions considered in this article (both formed such that the input covariance matrices built
N and n a few hundred) and except for symmetries, from the training inputs of each task are indeed the
the probability P(jD) is sharply peaked around its same. We neglect the minor di erences in the number
optimal value. This validates the empirical Bayesian of training patterns for each task and take for N the
approach, which is asymptotically equivalent to a full
number of training patterns averaged over all tasks. proper parameterizations (e.g., square root of the prior
covariancep A ) and initial conditions (random, small
4.2 Simulation Paradigm and Error Measures , large A ), a standard conjugate gradient algo-
rithm happens to be extremely robust and leads to
The simulation paradigm for multitask learning is reproducible results. In the rst stages of learning the
somewhat more involved than the one for a single feature matrix B changes a lot, with the other pa-
task, since now we can sample both over tasks and rameters slowly tracking. The second stages of learn-
over patterns. As performance measure we take the ing, usually after roughly 100 conjugate gradient iter-
mean-squared error, denoted ations, show a joint ne-tuning of all parameters in-


E(Dtested jDmodel ; Dhyper ) = (y ? ATmp Bml x)2 : volved. Apparently the loss function is dominated by
Here the average is over all combinations of inputs and the dependency of the standard mean-squared error
outputs belonging to data set Dtested . The hyperpa- (the term L11(B) in the Appendix) on the feature ma-
rameters ml = argmax P(Dhyper j), including the trix B. As shown in Baldi and Hornik (1989), this
feature matrix Bml , are used to derive the most prob- term has no local minima, only saddle points.
able solutions Amp = argmaxA P(AjDmodel ; ml ). The obtained feature matrices B make a lot of sense.
Let us consider the general layout for a set of simula- One of the dominant features for all data sets is a
tions in which we would like to evaluate the e ect of weighted summation over recent sales gures. The
varying one parameter, for example the number of hid- other features are more speci c to the characteristics
den units, on a particular data set. Before each run, of the di erent data sets, but always reasonable and
which involves testing all alternatives, we randomly reproducible. Features obtained for a smaller network
split up the total set of tasks in two equal parts, D reappear with only minor changes in a network with
and its complement D.  The data sets are further ran- more hidden units, which is consistent with the prin-
domly subdivided into 80% training set, denoted Dtr , cipal component analysis of Baldi and Hornik (1989).
and 20% test set, denoted Dte . Note that for prac- The noise parameters  and  are on the order of 0:9
tical purposes we are here considering in-sample test and 0:3, respectively. This implies that somewhere
error rather than out-of-sample test error as in Hes- between 10 and 20% of the variance can be explained.
kes (1998). Following the above notation, we can now Because of the large number of tasks and the daily
distinguish four di erent types of errors: need for accurate predictions, performance improve-
intra inter ments of much less than 1% can be highly signi cant,
both statistically and nancially. The error correla-
training E(Dtr jDtr ; Dtr ) E(Dtr jDtr ; D tr ) tions are substantial, which is not unexpected since
test E(Dte jDtr ; Dtr ) E(Dte jDtr ; D tr ) there are many factors involving newspaper and mag-
Training and test refer to patterns, intra and inter to azine sales that a ect all outlets but cannot (yet) be
tasks. From the perspective of \learning to learn", the taken into account. An example is news content such
inter-task test error is the most interesting one: it mea- as sports results. These are not only dicult to model,
sures to what extent the hyperparameters obtained on but are often also available only after circulation deci-
one set of tasks generalize to another set. By inter- sions have been made. It should further be noted that
changing the role of D and D,  we compute these four all three data sets have a relatively low average num-
errors twice for each run and each alternative, and av- ber of copies sold per outlet. In general, the higher
erage over the two options. For example, we consider this number, the higher the signal-to-noise ratio.
in fact the statistics of The prior mean M and covariance matrix A are also
inter = 1 E(D jD ; D sensible. For example, the mean for the model parame-
Etest te tr  tr ) + E(D  te jD tr ; Dtr ) ;
ters connected to the feature representing recent sales
2
and similarly for the other types of errors. This strat- gures is always clearly positive. The leading eigen-
egy reduces the uctuations induced by the accidental values of the covariance matrix A are such that the
subdivision of tasks. A similar strategy for the subdi- prior information is worth about half a year of sales
vision in training and test patterns would leave us too gures. With higher numbers of hidden units, more
few training patterns. and more eigenvalues become zero, basically shunting
o the corresponding hidden unit. Prior means de-
4.3 Qualitative Findings pending on task characteristics have been considered
as well, and sometimes yield slightly better results. For
Being a nonlinear function of the hyperparameters, example, the mean connected to a feature represent-
L() might well have local minima. However, with
(b)

0.88 train (mp)

mean−squared error
0.86 train (ml)

(a) 0.84
0.82
0.95
test (ml) 0.8
0.94 test (mp)
mean−squared error

0.78
0.93
0 5 10
0.92 number of hidden units
0.91

0.9 (c)
0.002
0.89

mean−squared error
0.88
0.001
0 2 4 6 8 10
number of hidden units train (mp)
test (mp)
0

−0.001
0 5 10
number of hidden units

Figure 1. Mean-squared errors as a function of the number of hidden units. (a) Inter-task test error for most probable

(solid) and maximum likelihood (dashed) solutions. (b) The same for the inter-task training error. (c) Di erence between
inter- and intra-task test (solid) and training (dashed) errors. The lines serve to guide the eye. Error bars give the
standard deviation of the mean. Averages over 20 runs. See the text for further details.

ing seasonal patterns comes out signi cantly higher for to nhid = ninp , which is equivalent to the case of no
outlets in touristic areas than for outlets in urban ar- bottleneck. Loosely speaking, the maximumlikelihood
eas. More testing is needed, especially in situations test error for nhid = ninp is the test error for single-
with less training patterns per outlet where more ac- task learning. The results are shown in Figure 1. In
curate priors really start to pay o . each run n = 500 tasks were available for tting the
hyperparameters and another 500 for evaluation, with
4.4 Performance with Increasing Hidden Units N = 92 training patterns.
In our model we have both a bottleneck of hidden Looking at the inter-task test error on the lefthand
units, reducing the inputs to a smaller set of fea- side, it can be seen that both the feature matrix and
tures, and, on top of that, hyperparameters specify- the priors for the model parameters improve perfor-
ing a prior for the model parameters. Do we really mance. The test error based on the maximum likeli-
need both? To check this we applied the simulation hood solutions neglecting the prior rapidly grows with
paradigm sketched above to data set III, varying the increasing number of features, yielding by far the worst
number of features and computing error measures not performance with all tasks treated separately. The op-
only for the most probable solutions Amp , but also for timum for the most probable solutions is obtained for
the maximumlikelihood solutions Aml . The results for nhid = 4. Even although the priors on the model pa-
these maximum likelihood solutions indicate the per- rameters control the risk of over tting rather well, the
formance that can be obtained in a frequentist mul- test error for nhid = 4 is signi cantly lower than the
titask learning approach without taking into account one for nhid = ninp . In short, the main aspects of the
prior information following e.g. Caruana (1997) and model, feature reduction and exchangeability of model
Pratt and Jennings (1996). On purpose we considered parameters, work well on this set of regression tasks
a data set with a relatively small number of inputs and validate the concept of \learning to learn".
(ninp = 9). This makes it feasible to go all the way up The upper right graph shows that the inter-task train-
(b)

0.92

0.91
(a)
0.91 0.9
inter test
0.905 0.89
mean−squared error

mean−squared error
intra test
0.88 inter train
0.9
intra train
0.87
0.895
0.86
0.89
0.85
0.885
0.84
0 0.01 0.02 0.03 0.04 0.05 0.06
1/number of tasks
0.83

0.82

0 0.02 0.04 0.06


1/number of tasks

Figure 2. Mean-squared errors as a function of the number of tasks used for tting the hyperparameters. (a) Learning-

to-learning curve: the inter-task test error as a function of the number of tasks. (b) Other training curves: intra-task
test error, inter-task training error, and intra-task training error (from top to bottom). The lines are the best linear ts.
Error bars give the standard deviation of the mean. Averages over 75 runs. See the text for further details.

ing error indeed decreases as a function of the num- 50% for testing, as before. But now we varied the
ber of features, for the maximum likelihood solutions number of (training) tasks used for optimizing the hy-
a little faster than for the most probable solutions. perparameters from n = 17 to the maximal n = 171.
The lower right graph gives the di erence between the We considered a network with ninp = 20, nhid = 3, and
inter- and intra-task training and test errors for the nprior = 1 with N = 122 training patterns per task.
most probable solutions. These di erences, although The graph on the lefthand side yields a \learning-to-
mostly signi cant, are very small on the scale of the learning" curve: the inter-task test error as a function
other two graphs. Roughly speaking, they measure the of the number of tasks used to optimize the hyper-
impact of a single task on the hyperparameters opti- parameters. On the right are the learning curves for
mized on a set of n tasks. This impact scales with 1=n, the intra-task test, inter-task training, and intra-task
yielding very small di erences for the number of tasks training error. The simulations strongly suggest a lin-
n = 500 in this simulation. The intra-task training ear relationship between these errors and the inverse
error is consistently lower than the inter-task training of the number of tasks. Although the exact conditions
error, as could be expected. The same di erence for and error measures di er, this relationship is well in
the test error changes sign around the optimal number line with the theoretical ndings of Baxter (1997).
of features: with less hidden units the inter-task test
error is higher than the intra-task test error, and vice Here we will sketch a simpler alternative argument,
versa with more hidden units. In other words, when to be worked out more precisely in a separate article.
over tting it helps to optimize the hyperparameters on First we consider the di erence between the intra-task
an independent set of tasks, excluding the task under training error and the inter-task training error. Call
consideration. We have observed similar behavior in E1 the training error that would be obtained with an
other simulations, but do not yet fully understand it. in nite number of tasks available for tting the hyper-
parameters. With a nite number of tasks, over tting
4.5 Learning-to-Learning Curves will reduce the intra-task training error, yet increase
the inter-task training error. Assuming that the train-
In another set of simulations on data set I, we derived ing error is the dominant term in the loss function
the learning curves of Figure 2. Within each run, tasks L(), loose application of Akaike's (1974) information
were split up randomly in 50% tasks for training and
criterion yields tion problem for all hyperparameters. This is in con-
trast with Heskes (1998), where rst the feature ma-
E1 ? Etrain
intra  E inter ? E  2 jj trix B was learned in a standard frequentist manner,
train 1 Nn ; after which the hyperparameters for the prior were ob-
with jj the (e ective) dimension of the hyperparame- tained in an (empirical) Bayesian fashion using an EM
ters . Based on this expression, we would predict an algorithm.
absolute slope of 0.4, fairly close to the tted -0.38 for The empirical Bayesian approach shifts the machine
the intra-task training error and 0.42 for the inter-task learning problem from learning model parameters to
training error. Furthermore, the intra- and inter-task learning hyperparameters. From a technical point of
training error indeed seem to have the same intercept. view the basic learning problem stays the same, but
Next we compare the inter-task training and inter-task now appears at a higher level. At this higher level, we
test error. They are measured based on the same set can apply many of the algorithms originally designed
of hyperparameters, obtained on an independent set for model parameters, such as e.g. pruning techniques.
of tasks. Therefore we can again apply Akaike's in- The inference of the hyperparameters is in fact handled
formation criterion, but now considering the e ect of in a frequentist rather than a full Bayesian manner.
optimizing the model parameters on the training data. This seems ne for the large amount of tasks in our
Neglecting the e ect of the prior, we get simulations, but for smaller problems it would be bet-
inter ? E inter  2 jAj ;
2 ter to take the full Bayesian approach and sample over
Etest train N the hyperparameters. Another alternative is to ap-
proximate their distribution using a standard Laplace
with jAj the dimension of model parameters. This approximation (see e.g. Robert, 1994). A full hierar-
crude estimate yields a di erence of 0.36 to be com- chical Bayesian approach would bring our model even
pared with 0.37 for the experimental t. Furthermore, closer to the framework of Baxter (1997).
it suggests that the slope for the inter-task test error
is the same as for the inter-task training error, which Even now, under quite di erent conditions, the experi-
is also experimentally veri ed. There is no signi cant mentally obtained learning curves have the same avor
di erence between the experimental ts for the intra- as the results of Baxter (1997), namely
task and inter-task test errors, but again, we do not  
know how to link these theoretically. E  E0 + N1 a + nb ;

5. Discussion and Outlook with E0 some base-line error, a related to the number
of model parameters per task, b related to the num-
In this article we have presented a new model ber of hyperparameters, N the number of patterns per
for multitask learning, analyzed within an empirical task, and n the number of tasks. Our rst crude ex-
Bayesian framework. Model parameters speci c to planation of these and other experimental observations
each task and hyperparameters shared between tasks leaves plenty of room for better analysis. For example,
are treated at a di erent level, as advocated in Bax- the di erences between intra-task test and inter-task
ter (1997). Compared with the multitask learning ap- test error are puzzling and not yet understood. Fur-
proaches of Caruana (1997) and Pratt and Jennings thermore, it would be interesting to study the depen-
(1996), where input-to-hidden weights B and hidden- dency of the optimal hyperparameters on the sucient
to-output weights A are treated at the same frequentist statistics. This might be done in a student-teacher
level, the Bayesian approach has several advantages. paradigm, often used in statistical mechanics studies
First of all, the Bayesian approach facilitates the incor- of learning single tasks.
poration of prior knowledge, the parameters of which, To be useful as a robust and accurate multitask pre-
and A , are learned in the same manner as the fea- diction system, we need to relax some of the simpli-
ture matrix B. This turns the maximum likelihood fying assumptions and extend the model. Nonlinear
parameters Aml into most probable parameters Amp , hidden units are not really relevant in the noisy envi-
signi cantly reducing the risk of over tting. Secondly, ronment of newspaper and magazine sales. More im-
the Bayesian approach naturally takes into account the portant is an integration with (Bayesian) methodology
variability of the model parameters A around their for time series prediction. But even without these im-
most probable values Amp . In the model considered provements, the current model allows for rapid testing
in this article, the model parameters can be inte- of architectures and all kinds of other options, which
grated out explicitely, yielding a nonlinear optimiza- would otherwise be infeasible.
Acknowledgments For explainability, we decompose the result:
This research was supported by the Technology Foun- Nn [L () ? (1 ? )L ()] ;
dation STW, applied science division of NWO and the L() = L0 () + 2 2 1 2
technology programme of the Ministry of Economic
A airs. with  = [1 + n2 ]?1 a function of ;  = 1 for  = 0.
The rst term L0() contains several normalization
terms, all independent of the data except for the num-
References ber of patterns N and the number of tasks n:
Akaike, H. (1974). A new look at the statistical model
identi cation. IEEE Transactions on Automatic L0 () = N2 (n log2 ? log ) + 12 log det(~ A + I)
Control, 19, 716{723.
Baldi, P., & Hornik, K. (1989). Neural networks and + 21 (n ? 1) logdet(~ A + I) :
principal component analysis: Learning from exam-
ples without local minima. Neural Networks, 2, 53{ Here ~ A = NA =2 is the prior covariance matrix A
58. relative to the uncertainty induced by the noise vari-
ance. ~ A of order one means that prior information is
Baxter, J. (1997). A Bayesian/information theoretic given as much weight as the data speci c to each task.
model of learning to learn via multiple task sam- L1 () is further subdivided in two terms:
pling. Machine Learning, 28, 7{39.
Bryk, A., & Raudenbusch, S. (1992). Hierarchical lin- L1 () = L11 (B) + L12 (B; ; ~ A ) ;
ear models. Newbury Park: Sage. and similarly for L2 (). Di erences between L1 and
Caruana, R. (1997). Multitask learning. Machine L2 can be traced back to the decomposition (4): L1
Learning, 28, 41{75. collects all terms related to the identity matrix in U1,
L2 those from U2 . L11 is the standard mean-squared
Heskes, T. (1998). Solving a huge number of similar error considered in Baldi and Hornik (1989):
tasks: a combination of multi-task learning and a


hierarchical Bayesian approach. Proceedings of the L11(B) = y2 tasks patterns ? Tr BRxx B T ;
International Conference on Machine Learning (pp.
233{241). San Mateo: Morgan Kaufmann. with Rxx = xy Txy =n a covariance matrix of covari-
Pratt, L., & Jennings, B. (1996). A survey of trans- ances. The simple form derives from the constraint (3).
fer between connectionist networks. Connection Sci- The corresponding term for L2 reads
ence, 8, 163{184. D
L21 (B) = hyi2tasks
E
? RTx B T BRx ;
patterns
Robert, C. (1994). The Bayesian choice: A decision-
theoretic motivation. New York: Springer. where Rx = hxy itasks is the input-output covariance
Thrun, S., & Pratt, L. (Eds.). (1997). Learning to averaged over all outputs. The remaining terms are
learn. Dordrecht: Kluwer Academic. mainly a function of the hyperparameters and ~ A ,
but also depend on B and . z plays a role similar to
Bxy . With further de nitions Rz = hz itasks , Rxz =
Appendix xy z T =n, and Rzz = zz T =n, we can write
In this appendix we describe the loss function L(), L12(B; ; ~ A ) =
the function that should be minimized with respect to h i?1
the hyperparameters  in the empirical Bayesian ap- 
Tr BRxx B ? 2BRxz + Rzz A + I
T T T ~
proach. Its derivation is really just a matter of book-
keeping, with perhaps the only diculty the inversion and
of the noise covariance matrix  for nonzero . Here
the decomposition L22(B; ; ; ~ A ; ) =
 = 2 (1 + n2 )U2 + 2 U1 ; (4) (BRx ? Rz )T [~ A + I]?1[~ A + I]?1(BRx ? Rz ) :
with U2 = 1=n (1ij = 1 8ij ) and U1 = I ? U2 two
orthogonal projection matrices, may help.

You might also like