Professional Documents
Culture Documents
mean−squared error
0.86 train (ml)
(a) 0.84
0.82
0.95
test (ml) 0.8
0.94 test (mp)
mean−squared error
0.78
0.93
0 5 10
0.92 number of hidden units
0.91
0.9 (c)
0.002
0.89
mean−squared error
0.88
0.001
0 2 4 6 8 10
number of hidden units train (mp)
test (mp)
0
−0.001
0 5 10
number of hidden units
Figure 1. Mean-squared errors as a function of the number of hidden units. (a) Inter-task test error for most probable
(solid) and maximum likelihood (dashed) solutions. (b) The same for the inter-task training error. (c) Dierence between
inter- and intra-task test (solid) and training (dashed) errors. The lines serve to guide the eye. Error bars give the
standard deviation of the mean. Averages over 20 runs. See the text for further details.
ing seasonal patterns comes out signicantly higher for to nhid = ninp , which is equivalent to the case of no
outlets in touristic areas than for outlets in urban ar- bottleneck. Loosely speaking, the maximumlikelihood
eas. More testing is needed, especially in situations test error for nhid = ninp is the test error for single-
with less training patterns per outlet where more ac- task learning. The results are shown in Figure 1. In
curate priors really start to pay o. each run n = 500 tasks were available for tting the
hyperparameters and another 500 for evaluation, with
4.4 Performance with Increasing Hidden Units N = 92 training patterns.
In our model we have both a bottleneck of hidden Looking at the inter-task test error on the lefthand
units, reducing the inputs to a smaller set of fea- side, it can be seen that both the feature matrix and
tures, and, on top of that, hyperparameters specify- the priors for the model parameters improve perfor-
ing a prior for the model parameters. Do we really mance. The test error based on the maximum likeli-
need both? To check this we applied the simulation hood solutions neglecting the prior rapidly grows with
paradigm sketched above to data set III, varying the increasing number of features, yielding by far the worst
number of features and computing error measures not performance with all tasks treated separately. The op-
only for the most probable solutions Amp , but also for timum for the most probable solutions is obtained for
the maximumlikelihood solutions Aml . The results for nhid = 4. Even although the priors on the model pa-
these maximum likelihood solutions indicate the per- rameters control the risk of overtting rather well, the
formance that can be obtained in a frequentist mul- test error for nhid = 4 is signicantly lower than the
titask learning approach without taking into account one for nhid = ninp . In short, the main aspects of the
prior information following e.g. Caruana (1997) and model, feature reduction and exchangeability of model
Pratt and Jennings (1996). On purpose we considered parameters, work well on this set of regression tasks
a data set with a relatively small number of inputs and validate the concept of \learning to learn".
(ninp = 9). This makes it feasible to go all the way up The upper right graph shows that the inter-task train-
(b)
0.92
0.91
(a)
0.91 0.9
inter test
0.905 0.89
mean−squared error
mean−squared error
intra test
0.88 inter train
0.9
intra train
0.87
0.895
0.86
0.89
0.85
0.885
0.84
0 0.01 0.02 0.03 0.04 0.05 0.06
1/number of tasks
0.83
0.82
Figure 2. Mean-squared errors as a function of the number of tasks used for tting the hyperparameters. (a) Learning-
to-learning curve: the inter-task test error as a function of the number of tasks. (b) Other training curves: intra-task
test error, inter-task training error, and intra-task training error (from top to bottom). The lines are the best linear ts.
Error bars give the standard deviation of the mean. Averages over 75 runs. See the text for further details.
ing error indeed decreases as a function of the num- 50% for testing, as before. But now we varied the
ber of features, for the maximum likelihood solutions number of (training) tasks used for optimizing the hy-
a little faster than for the most probable solutions. perparameters from n = 17 to the maximal n = 171.
The lower right graph gives the dierence between the We considered a network with ninp = 20, nhid = 3, and
inter- and intra-task training and test errors for the nprior = 1 with N = 122 training patterns per task.
most probable solutions. These dierences, although The graph on the lefthand side yields a \learning-to-
mostly signicant, are very small on the scale of the learning" curve: the inter-task test error as a function
other two graphs. Roughly speaking, they measure the of the number of tasks used to optimize the hyper-
impact of a single task on the hyperparameters opti- parameters. On the right are the learning curves for
mized on a set of n tasks. This impact scales with 1=n, the intra-task test, inter-task training, and intra-task
yielding very small dierences for the number of tasks training error. The simulations strongly suggest a lin-
n = 500 in this simulation. The intra-task training ear relationship between these errors and the inverse
error is consistently lower than the inter-task training of the number of tasks. Although the exact conditions
error, as could be expected. The same dierence for and error measures dier, this relationship is well in
the test error changes sign around the optimal number line with the theoretical ndings of Baxter (1997).
of features: with less hidden units the inter-task test
error is higher than the intra-task test error, and vice Here we will sketch a simpler alternative argument,
versa with more hidden units. In other words, when to be worked out more precisely in a separate article.
overtting it helps to optimize the hyperparameters on First we consider the dierence between the intra-task
an independent set of tasks, excluding the task under training error and the inter-task training error. Call
consideration. We have observed similar behavior in E1 the training error that would be obtained with an
other simulations, but do not yet fully understand it. innite number of tasks available for tting the hyper-
parameters. With a nite number of tasks, overtting
4.5 Learning-to-Learning Curves will reduce the intra-task training error, yet increase
the inter-task training error. Assuming that the train-
In another set of simulations on data set I, we derived ing error is the dominant term in the loss function
the learning curves of Figure 2. Within each run, tasks L(), loose application of Akaike's (1974) information
were split up randomly in 50% tasks for training and
criterion yields tion problem for all hyperparameters. This is in con-
trast with Heskes (1998), where rst the feature ma-
E1 ? Etrain
intra E inter ? E 2 jj trix B was learned in a standard frequentist manner,
train 1 Nn ; after which the hyperparameters for the prior were ob-
with jj the (eective) dimension of the hyperparame- tained in an (empirical) Bayesian fashion using an EM
ters . Based on this expression, we would predict an algorithm.
absolute slope of 0.4, fairly close to the tted -0.38 for The empirical Bayesian approach shifts the machine
the intra-task training error and 0.42 for the inter-task learning problem from learning model parameters to
training error. Furthermore, the intra- and inter-task learning hyperparameters. From a technical point of
training error indeed seem to have the same intercept. view the basic learning problem stays the same, but
Next we compare the inter-task training and inter-task now appears at a higher level. At this higher level, we
test error. They are measured based on the same set can apply many of the algorithms originally designed
of hyperparameters, obtained on an independent set for model parameters, such as e.g. pruning techniques.
of tasks. Therefore we can again apply Akaike's in- The inference of the hyperparameters is in fact handled
formation criterion, but now considering the eect of in a frequentist rather than a full Bayesian manner.
optimizing the model parameters on the training data. This seems ne for the large amount of tasks in our
Neglecting the eect of the prior, we get simulations, but for smaller problems it would be bet-
inter ? E inter 2 jAj ;
2 ter to take the full Bayesian approach and sample over
Etest train N the hyperparameters. Another alternative is to ap-
proximate their distribution using a standard Laplace
with jAj the dimension of model parameters. This approximation (see e.g. Robert, 1994). A full hierar-
crude estimate yields a dierence of 0.36 to be com- chical Bayesian approach would bring our model even
pared with 0.37 for the experimental t. Furthermore, closer to the framework of Baxter (1997).
it suggests that the slope for the inter-task test error
is the same as for the inter-task training error, which Even now, under quite dierent conditions, the experi-
is also experimentally veried. There is no signicant mentally obtained learning curves have the same
avor
dierence between the experimental ts for the intra- as the results of Baxter (1997), namely
task and inter-task test errors, but again, we do not
know how to link these theoretically. E E0 + N1 a + nb ;
5. Discussion and Outlook with E0 some base-line error, a related to the number
of model parameters per task, b related to the num-
In this article we have presented a new model ber of hyperparameters, N the number of patterns per
for multitask learning, analyzed within an empirical task, and n the number of tasks. Our rst crude ex-
Bayesian framework. Model parameters specic to planation of these and other experimental observations
each task and hyperparameters shared between tasks leaves plenty of room for better analysis. For example,
are treated at a dierent level, as advocated in Bax- the dierences between intra-task test and inter-task
ter (1997). Compared with the multitask learning ap- test error are puzzling and not yet understood. Fur-
proaches of Caruana (1997) and Pratt and Jennings thermore, it would be interesting to study the depen-
(1996), where input-to-hidden weights B and hidden- dency of the optimal hyperparameters on the sucient
to-output weights A are treated at the same frequentist statistics. This might be done in a student-teacher
level, the Bayesian approach has several advantages. paradigm, often used in statistical mechanics studies
First of all, the Bayesian approach facilitates the incor- of learning single tasks.
poration of prior knowledge, the parameters of which, To be useful as a robust and accurate multitask pre-
and A , are learned in the same manner as the fea- diction system, we need to relax some of the simpli-
ture matrix B. This turns the maximum likelihood fying assumptions and extend the model. Nonlinear
parameters Aml into most probable parameters Amp , hidden units are not really relevant in the noisy envi-
signicantly reducing the risk of overtting. Secondly, ronment of newspaper and magazine sales. More im-
the Bayesian approach naturally takes into account the portant is an integration with (Bayesian) methodology
variability of the model parameters A around their for time series prediction. But even without these im-
most probable values Amp . In the model considered provements, the current model allows for rapid testing
in this article, the model parameters can be inte- of architectures and all kinds of other options, which
grated out explicitely, yielding a nonlinear optimiza- would otherwise be infeasible.
Acknowledgments For explainability, we decompose the result:
This research was supported by the Technology Foun- Nn [L () ? (1 ? )L ()] ;
dation STW, applied science division of NWO and the L() = L0 () + 2 2 1 2
technology programme of the Ministry of Economic
Aairs. with = [1 + n2 ]?1 a function of ; = 1 for = 0.
The rst term L0() contains several normalization
terms, all independent of the data except for the num-
References ber of patterns N and the number of tasks n:
Akaike, H. (1974). A new look at the statistical model
identication. IEEE Transactions on Automatic L0 () = N2 (n log2 ? log ) + 12 log det(~ A + I)
Control, 19, 716{723.
Baldi, P., & Hornik, K. (1989). Neural networks and + 21 (n ? 1) logdet(~ A + I) :
principal component analysis: Learning from exam-
ples without local minima. Neural Networks, 2, 53{ Here ~ A = NA =2 is the prior covariance matrix A
58. relative to the uncertainty induced by the noise vari-
ance. ~ A of order one means that prior information is
Baxter, J. (1997). A Bayesian/information theoretic given as much weight as the data specic to each task.
model of learning to learn via multiple task sam- L1 () is further subdivided in two terms:
pling. Machine Learning, 28, 7{39.
Bryk, A., & Raudenbusch, S. (1992). Hierarchical lin- L1 () = L11 (B) + L12 (B; ; ~ A ) ;
ear models. Newbury Park: Sage. and similarly for L2 (). Dierences between L1 and
Caruana, R. (1997). Multitask learning. Machine L2 can be traced back to the decomposition (4): L1
Learning, 28, 41{75. collects all terms related to the identity matrix in U1,
L2 those from U2 . L11 is the standard mean-squared
Heskes, T. (1998). Solving a huge number of similar error considered in Baldi and Hornik (1989):
tasks: a combination of multi-task learning and a
hierarchical Bayesian approach. Proceedings of the L11(B) = y2 tasks patterns ? Tr BRxx B T ;
International Conference on Machine Learning (pp.
233{241). San Mateo: Morgan Kaufmann. with Rxx = xy Txy =n a covariance matrix of covari-
Pratt, L., & Jennings, B. (1996). A survey of trans- ances. The simple form derives from the constraint (3).
fer between connectionist networks. Connection Sci- The corresponding term for L2 reads
ence, 8, 163{184. D
L21 (B) = hyi2tasks
E
? RTx B T BRx ;
patterns
Robert, C. (1994). The Bayesian choice: A decision-
theoretic motivation. New York: Springer. where Rx = hxy itasks is the input-output covariance
Thrun, S., & Pratt, L. (Eds.). (1997). Learning to averaged over all outputs. The remaining terms are
learn. Dordrecht: Kluwer Academic. mainly a function of the hyperparameters and ~ A ,
but also depend on B and . z plays a role similar to
Bxy . With further denitions Rz = hz itasks , Rxz =
Appendix xy z T =n, and Rzz = zz T =n, we can write
In this appendix we describe the loss function L(), L12(B; ; ~ A ) =
the function that should be minimized with respect to h i?1
the hyperparameters in the empirical Bayesian ap-
Tr BRxx B ? 2BRxz +Rzz A + I
T T T ~
proach. Its derivation is really just a matter of book-
keeping, with perhaps the only diculty the inversion and
of the noise covariance matrix for nonzero . Here
the decomposition L22(B; ; ; ~ A ; ) =
= 2 (1 + n2 )U2 + 2 U1 ; (4) (BRx ? Rz )T [~ A + I]?1[~ A + I]?1(BRx ? Rz ) :
with U2 = 1=n (1ij = 1 8ij ) and U1 = I ? U2 two
orthogonal projection matrices, may help.