This document proposes minimizing the description length of neural network weights to improve generalization. It discusses 3 methods: 1) Limiting network complexity by reducing connections or weight sharing. 2) Quantizing weights so each value uses a fixed number of bits. 3) Adding Gaussian noise to weights and adapting noise levels during learning to minimize description length and error. The key idea is keeping weights simple by penalizing information content to reduce overfitting.
This document proposes minimizing the description length of neural network weights to improve generalization. It discusses 3 methods: 1) Limiting network complexity by reducing connections or weight sharing. 2) Quantizing weights so each value uses a fixed number of bits. 3) Adding Gaussian noise to weights and adapting noise levels during learning to minimize description length and error. The key idea is keeping weights simple by penalizing information content to reduce overfitting.
This document proposes minimizing the description length of neural network weights to improve generalization. It discusses 3 methods: 1) Limiting network complexity by reducing connections or weight sharing. 2) Quantizing weights so each value uses a fixed number of bits. 3) Adding Gaussian noise to weights and adapting noise levels during learning to minimize description length and error. The key idea is keeping weights simple by penalizing information content to reduce overfitting.
Georey E. Hinton and Drew van Camp Department of Computer Science University of Toronto 10 King's College Road Toronto M5S 1A4, Canada
Abstract Limit the number of connections in the network
(and hope that each weight does not have too much Supervised neural networks generalize well if information in it). there is much less information in the weights Divide the connections into subsets, and force the than there is in the output vectors of the train- weights within a subset to be identical. If this ing cases. So during learning, it is impor- \weight-sharing" is based on an analysis of the nat- tant to keep the weights simple by penaliz- ural symmetries of the task it can be very eective ing the amount of information they contain. (Lang, Waibel and Hinton (1990); LeCun 1989). The amount of information in a weight can Quantize all the weights in the network so that a be controlled by adding Gaussian noise and probability mass, p, can be assigned to each quan- the noise level can be adapted during learning tized value. The number of bits in a weight is then to optimize the trade-o between the expected , log p, provided we ignore the cost of dening the squared error of the network and the amount quantization. Unfortunately this method leads to of information in the weights. We describe a dicult search space because the cost of a weight a method of computing the derivatives of the does not have a smooth derivative. expected squared error and of the amount of information in the noisy weights in a net- work that contains a layer of non-linear hidden units. Provided the output units are linear, the 2 Applying the Minimum Description exact derivatives can be computed eciently Length Principle without time-consuming Monte Carlo simula- When tting models to data, it is always possible to t tions. The idea of minimizing the amount of information that is required to communicate the training data better by using a more complex model, the weights of a neural network leads to a but this may make the model worse at tting new data. number of interesting schemes for encoding the So we need some way of deciding when extra complex- weights. ity in the model is not worth the improvement in the data-t. The Minimum Description Length Principle (Rissanen, 1986) asserts that the best model of some data is the one that minimizes the combined cost of 1 Introduction describing the model and describing the mist between the model and the data. For supervised neural networks In many practical learning tasks there is little available with a predetermined architecture, the model cost is the training data so any reasonably complicated model will number of bits it takes to describe the weights, and the tend to overt the data and give poor generalization to data-mist cost is the number of bits it takes to describe new data. To avoid overtting we need to ensure that the discrepancy between the correct output and the out- there is less information in the weights than there is in put of the neural network on each training case. We can the output vectors of the training cases. Researchers think in terms of a sender who can see both the input have considered many possible ways of limiting the in- vector and the correct output and a receiver who can formation in the weights: only see the input vector. The sender rst ts a neural network, of pre-arranged architecture, to the complete set of training cases, then sends the weights to the re- ceiver. For each training case the sender also sends the discrepancy between the net's output and the correct output. By adding this discrepancy to the output of the net, the receiver can generate exactly the correct output. t square deviation of the mists from zero.1 Using this value of j and summing over all training cases the last term of equation 2 becomes a constant and the data mist cost is: " # 0 v Cdata-mist = kN + N log 1 X(dc , yc )2 (3) 2 N c j j Figure 1: This shows the probability mass asso- ciated with a quantized value, v, using a quan- where k is a constant that depends only on t. tization width t. If t is much narrower than the Gaussian distribution, the probability mass is well Independently of whether we use the optimal value or approximated by the product of the height and the a predetermined value for j , it is apparent that the width, so the log probability is a sum of two terms. description length is minimized by minimizing the usual The log t term is a constant, and if the distribu- squared error function, so the Gaussian assumptions we tion is a zero-mean Gaussian, the log of the height have made about coding can be viewed as the MDL is proportional to v2 . justication of this error function.
3 Coding the data mists 4 A simple method of coding the
weights To apply the MDL principle we need to decide on a coding scheme for the data mists and for the weights. We could code the weights in just the same way as we Clearly, if the data mists are real numbers, an innite code the data mists. We assume that the weights of amount of information is needed to convey them. So we the trained network are nely quantized and come from shall assume that they are very nely quantized, using a zero-mean Gaussian distribution. If the standard de- intervals of xed width t. We shall also assume that viation, w , of this distribution is xed in advance, the the data mists are encoded separately for each of the description length of the weights is simply proportional output units. to the sum of their squares. So, assuming we use a Gaus- sian with standard deviation j for encoding the output The coding theorem tells us that if a sender and a re- errors, we can minimize the total description length of ceiver have agreed on a probability distribution that as- the data mists and the weights by minimizing the sum signs a probability mass, p(y), to each possible quan- of two terms: tized data mist, y, then we can code the mist using , log2 p(y) bits. If we want to minimize the expected number of bits, the best probability distribution to use C= X 1 ( X dcj , yjc )2 + 1 X w2 (4) is the correct one, but any other agreed distribution can 2w2 ij ij also be used. For convenience, we shall assume that for j 2 j c 2 output unit j the data mists are encoded by assuming that they are drawn from a zero-mean Gaussian distri- where c is an index over training cases. bution with standard deviation, j . Provided that j is This is just the standard \weight-decay" method. The large compared with the quantization width t, the as- fact that weight-decay improves generalization (Hinton, sumed probability of a particular data mist between 1987) can therefore be viewed as a vindication of this the desired output, dcj on training case c and the actual crude MDL approach in which the standard deviations output yjc is then well approximated by the probability of the gaussians used for coding the data mists and the mass shown in gure 1. weights are both xed in advance.2 " # An elaboration of standard weight-decay is to assume 1 ,(dcj , yjc )2 that the distribution of weights in the trained network p(dcj , yjc ) = t p exp (1) 2 j 2j2 can be modelled more accurately by using a mixture of several Gaussians whose means, variances and mix- Using an optimal code, the description length of a data ing proportions are adapted as the network is trained mist, dcj , yjc , in units of log2(e) bits (called \nats") is: 1 If the optimal value of j is to be used, it must be com- municated before the data mists are sent, so it too must be p (dc , yc )2 coded. However, since it is only one number we are probably , log p(dcj , yjc ) = , log t +log 2 +log j + j 22 j safe in ignoring this aspect of the total description length. 2 It is clear from equation 4 that it is only the ratio of j (2) the variances of the two Gaussians that matters. Rather than guessing this ratio, it is usually better to estimate it by To minimize this description length summed over all N seeing which ratio gives optimal performance on a validation training cases, the optimal value of j is the root mean set. (Nowlan and Hinton, 1992). For some tasks this more weight. After learning, the sender has a Gaussian pos- elaborate way of coding the weights gives considerably terior distribution, Q, for the weight. We describe a better generalization. This is especially true when only method of communicating both the weights and the a small number of dierent weight values are required. data mists and show that using this method the num- However, this more elaborate scheme still suers from ber of bits required to communicate the posterior distri- a serious weakness: It assumes that all the weights are bution of a weight is equal to the asymmetric divergence quantized to the same tolerance, and that this tolerance (the Kullback-Liebler distance) from P to Q. is small compared with the standard deviations of the Z Gaussians used for modelling the weight distribution. Q(w) Thus it takes into account the probability density of a G(P; Q) = Q(w) log dw (5) P (w) weight (the height in gure 1) but it ignores the pre- cision (the width). This is a terrible waste of bits. A 5.2 The \bits back" argument network is clearly much more economical to describe if some of the weight values can be described very impre- To communicate a set of noisy weights, the sender rst cisely without signicantly aecting the predictions of collapses the posterior probability distribution for each the network. weight by using a source of random bits to pick a precise MacKay (1992) has considered the eects of small value for the weight (to within some very ne tolerance changes in the weights on the outputs of the network t). The probability of picking each possible value is de- after the network has been trained. The next section termined by the posterior probability distribution for describes a method of taking the precision of the weights the weight. The sender then communicates these pre- into account during training so that the precision of a cise weights by coding them using some prior Gaussian weight can be traded against both its probability den- distribution, P , so that the communication cost of a sity and the excess data mist caused by imprecision in precise weight, w, is: the weight. C (w) = , log t , log P (w) (6) 5 Noisy weights t must be small compared with the variance of P so A standard way of limiting the amount of information in C (w) is big. However, as we shall see, we are due for a a number is to add zero-mean Gaussian noise. At rst big refund at the end. sight, a noisy weight seems to be even more expensive Having sent the precise weights, the sender then com- to communicate than a precise one since it appears that municates the data-mists achieved using those weights. we need to send a variance as well as a mean, and that Having received the weights and the mists, the receiver we need to decide on a precision for both of these. As we can then produce the correct outputs. But he can also shall see, however, the MDL framework can be adapted do something else. Once he has the correct outputs he to allow very noisy weights to be communicated very can run whatever learning algorithm was used by the cheaply. sender and recover the exact same posterior probability When using backpropagation to train a feedforward distribution, Q, that the sender collapsed in order to get neural network, it is standard practice to start at some the precise weights.3 Now, since the receiver knows the particular point in weight space and to move this point sender's posterior distribution for each weight and he in the direction that reduces the error function. An al- knows the precise value that was communicated, he can ternative approach is to start with a multivariate Gaus- recover all the random bits that the sender used to col- sian distribution over weight vectors and to change both lapse that distribution to that value. So these random the mean and the variance of this cloud of weight vec- bits have been successfully communicated and we must tors so as to reduce some cost function. We shall restrict subtract them from the overall communication cost to ourselves to distributions in which the weights are inde- get the true cost of communicating the model and the pendent, so the distribution can be represented by one mists. The number of random bits required to collapse mean and one variance per weight. the posterior distribution for a weight, Q, to a particular The cost function is the expected description length of nely quantized value, w, is: the weights and of the data mists. It turns out that high-variance weights are cheaper to communicate but R(w) = , log t , log Q(w) (7) they cause extra variance in the data mists thus mak- ing these mists more expensive to communicate. So the true expected description length for a noisy weight is determined by taking an expectation, under 5.1 The expected description length of the the distribution Q : weights 3 If the sender used random initial weights these can be We assume that the sender and the receiver have an communicated at a net cost of 0 bits using the method that agreed Gaussian prior distribution, P , for a given is being explained. Z Q(w) hEj i = h(dj , yj )2 i = (dj , yj )2 + Vyj (12) G(P; Q) = hC (w) , R(w)i = Q(w) log dw (8) P (w) So, for each input vector, we can use the table and the For Gaussians with dierent means and variances, the equations above to compute the exact value of hEj i. We asymmetric divergence is can P also backpropagate the exact derivatives of E = j hEj i provided we rst build another table to allow 1 derivatives to be backpropagated through the hidden G(P; Q) = log p + 2 q2 , p2 + (p , q )2 (9)
units. As before, the table is indexed by xh and Vxh but q 2p for the backward pass each cell of the table contains the four partial derivatives that are needed to to convert the 5.3 The expected description length of the output derivatives of h into its input derivatives using data mists the equations: To compute the data-mist cost given in equation 3 we need the expected value of (dcj , yjc )2 . This squared error @E @E @yh = @ @E @Vyh + @V (13) is caused partly by the systematic errors of the network @xh yh @ xh yh @xh and partly by the noise in the weights. Unfortunately, for general feedforward networks with noisy weights, the @E @E @yh @E @Vyh = @ + @V (14) expected squared errors are not easy to compute. Linear @Vxh yh @V xh yh @Vxh approximations are possible if the level of noise in the weights is suciently small compared with the smooth- ness of the non-linearities, but this defeats one of the 6 Letting the data determine the prior main purposes of the idea which is to allow very noisy So far, we have assumed that the \prior" distribution weights. Fortunately, if there is only one hidden layer that is used for coding the weights is a single Gaussian. and if the output units are linear, it is possible to com- This coding-prior must be known to both the sender pute the expected squared error exactly. and the receiver before the weights are communicated. The weights are assumed to have independent Gaussian If we x its mean and variance in advance we could pick noise, so for any input vector we can compute the mean inappropriate values that make it very expensive to code xh , and variance, Vxh , of the Gaussian-distributed to- the actual weights. We therefore allow the mean and tal input, xh , received by hidden unit h. Using a table, variance of the coding-prior to be determined during we can then compute the mean, yh and variance, Vyh , the optimization process, so the coding-prior depends of the output of the hidden unit, even though this out- on the data. This is a funny kind of prior! We could try put is not Gaussian distributed. A lot of computation to make sense of it in Bayesian terms by assuming that is required to create this two-dimensional table since we start with a hyper-prior that species probability many dierent pairs of xh and Vxh must be used, and distributions for the mean and variance of the coding- for each pair we must use Monte Carlo sampling or nu- prior and then we use the hyper-prior and the data to merical integration to compute yh and Vyh . Once the nd the best coding-prior. This would automatically table is built, however, it is much more ecient than take into account the cost of communicating the coding- using Monte Carlo sampling at runtime. prior to a receiver who only knows the hyper-prior. In practice, we just ignore the cost of communicating the Since the noise in the outputs of the hidden units is two parameters of the coding-prior so we do not need independent, they independently contribute variance to to invent hyper-priors. each linear output unit. The noisy weights, whj , also contribute variance to the output units. Since the out- 6.1 A more
exible prior distribution for the put units are linear, their outputs, yj , are equal to the weights total inputs they receive, xj . On a particular training case, the output, yj , of output unit j is a random vari- If we use a single Gaussian prior for communicating the able with the following mean and variance: noisy weights, we get a relatively simple penalty term, the asymmetric divergence, for the posterior distribu- yj = X yh whj (10) tion of each noisy weight. Unfortunately, this coding h scheme is not
exible enough to capture certain kinds Xh i of common structure in the weights. Suppose, for ex- Vyj = 2whj Vyh + 2yh Vwhj + Vyh Vwhj (11) ample, that we want a few of the weights to have values h near 1 and the rest to have values very close to 0. If the posterior distribution for each weight has low vari- The mean and the variance of the activity of output ance (to avoid the extra squared error caused by noise unit j make independent contributions to the expected in the weights) we inevitiably pay a high code cost for squared error hEj i. If the desired output of j on a par- weights because no single Gaussian prior can provide a ticular training case is dj , hEj i is given by: good model of a spike around 0 and a spike around 1. If we know in advance that dierent subsets of the energy of a system depends on the energies of the various weights are likely to have dierent distributions, we can alternative congurations of the system. Indeed, one use dierent coding-priors for the dierent subsets. As way to derive equation 17 is to dene a coding scheme MacKay (1992) has demonstrated, it makes sense to use in which the code cost resembles a free energy and to dierent coding-priors for the input-to-hidden weights then use a lemma from statistical mechanics. and the hidden-to-output weights since the input and output values may have quite dierent scales. If we do 7 A coding scheme that uses a mixture not know in advance which weights should be similar, we can model the weight distribution by an adaptive of Gaussians mixture of Gaussians as proposed by Nowlan and Hin- ton (1992). During the optimization, the means, vari- Suppose that a sender and a receiver have already ances and mixing proportions in the mixture adapt to agreed on a particular mixture of Gaussians distribu- model the clusters in the weight values. Simultaneously, tion. The sender can now send a sample from the pos- the weights adapt to t the current mixture model so terior Gaussian distribution of a weight using the fol- weights get pulled towards the centers of nearby clus- lowing coding scheme: ters. Suppose, for eaxample, that there are two Gaus- sians in the mixture. If one gaussian has mean 1 and 1. Randomly pick one of the Gaussians in the mixture low variance and the other gaussian has mean 0 and low with probability ri given by variance it is very cheap to encode low-variance weights with values near 1 or 0. i e,Gi ri = P ,G (18) Nowlan and Hinton (1992) implicitly assumed that the j j e j posterior distribution for each weight has a xed and 2. Communicate the choice of Gaussian to the re- negligible variance so they focussed on maximizing the ceiver. If we use the mixing proportions as a prior probability density of the mean of the weight under the for communicating the choice, the expected code coding-prior mixture distribution. We now show how cost is their technique can be extended to take into account the variance of the posterior distributions for the weights, X r log 1 (19) i assuming that the posterior distributions are still con- i i strained to be single Gaussians. As before, we ignore the 3. Communicate the sample value to the receiver us- cost of communicating the mixture distribution that is ing the chosen Gaussian. If we take into account to be used for coding the weights. the random bits that we get back when the receiver The mixture prior has the form: reconstructs the posterior distribution from which the sample was chosen, the expected cost of com- P (w) = X iPi (w) (15) municating the sample is i X riGi (20) where i is the mixing proportion of Gaussian Pi . The i asymmetric divergence between the mixture prior and So the expected cost of communicating both the the single Gaussian posterior, Q, for a noisy weight is choice of Gaussian and the sample value given that Z choice is Q(w) G(P; Q) = Q(w) log P dw (16) 1 = X r (, log e,Gi ) i i Pi(w) X X riGi + ri log i i i The sum inside the log makes this hard to integrate ana- i i i lytically. This is unfortunate since the optimization pro- (21) cess requires that we repeatedly evaluate both G(P; Q) 4. After receiving samples from all the posterior and its derivatives with respect to the parameters of weight distributions and also receiving the errors on P and Q. Fortunately, there is a much more tractable the training cases with these sampled weights, the expression which is an upper bound on G and can there- receiver can run the learning algorithm and recon- fore be used in its place. This expression is in terms of struct the posterior distributions from which the the Gi(Pi ; Q) the asymmetric divergences between the weights are sampled. This allows the receiver to posterior distribution, Q, and each of the Gaussians, Pi , reconstruct all of the Gi and hence to reconstruct in the mixture prior. the random bits used to choose a Gaussian from the mixture. So the number of \bits back" that must G^ (P1 ; P2 : : : ; Q) = , log X i e,Gi (17) be subtracted from the expected cost in equation i 21 is The way in which G^ depends on the Gi in equation H= X r log 1 (22) i ri 17 is precisely analogous to the way in which the free i We now use a lemma from statistical mechanics to get a 9 Preliminary Results simple expression for the expected code cost minus the bits back. We have not yet performed a thorough comparison be- tween this algorithm and alternative methods and it 7.1 A lemma from statistical mechanics may well turn out that further renements are required to make it competitive. We have, however, tried the For a physical system at a temperature of 1, the algorithm on one very high dimensional task with very Helmholtz free energy, F , is dened as the expected scarce training data. The task is to predict the eec- energy minus the entropy tiveness of a class of peptide molecules. Each molecule is described by 128 parameters (the input vector) and X X 1 has an eectiveness that is a single scalar (the ouput F= ri Ei , ri log (23) value). All inputs and outputs were normalized to have i i ri zero mean and unit variance so that the weights could be expected to have similar scales. The training set con- where i is an index over the alternative possible states sisted of 105 cases and the test set was the remaining of the system, Ei is the energy of a state, and ri is the 420 cases. We deliberately chose a very small training probability of a state. F is a function of the probabil- set since these are the circumstances in which it should ity distribution over states and the probability distribu- be most helpful to control the amount of information in tion which minimizes F is the Boltzmann distribution in the weights. We tried a network with 4 hidden units. which probabilities are exponentially related to energies This network contains 521 adaptive weights (including the biases of the output and hidden units) so it overts e,Ei the 105 training cases very badly if we do not limit the ri = P ,E (24) information in the weights. je j We used an adaptive mixture of 5 Gaussians as our At the minimum given by the Boltzmann distribution, coding-prior for the weights. The Gaussians were ini- the free energy is equal to minus the log of the partition tialized with means uniformly spaced between ,0:24 function: and +0:24 and separated by 2 standard deviations from their neighbors. The initial means for the posterior dis- F = , log X e,Ei (25) tributions of each weight were chosen from a Gaussian i with mean 0 and standard deviation 0:15. The stan- dard deviations of the posteriors for the weights were If we equate each Gaussian in the mixture with an al- all initialized at 0:1. ternative,state G of a physical system, we can equate e,Ei We optimize all of the parameters simultaneously us- with i e . So our method of picking a Gaussian from i ing a conjugate gradient method. For the variances we the mixture uses a Boltzmann distribution because it optimize the log variance so that it cannot go negative makes ri proportional to e,Ei . The random bits that and cannot collapse to zero. To ensure that the mix- are successfully communicated when the receiver recon- ing proportions of the Gaussians in the coding-prior lie structs the probabilities ri correspond exactly to the en- between 0 and 1 and add to 1 we optimize the xi where tropy of the Boltzmann distribution, so the total code cost (including the bits back) is equivalent to F and is exi therefore equal to the expression given in equation 17. i = P x (26) je j 8 Implementation If we penalize the weights by the full cost of describing them, the optimization quickly makes all of the weights With an adaptive mixture of Gaussians coding-prior, equal and negative and uses the bias of the output unit the derivatives of the cost function are moderately com- to x the output at the mean of the desired values in plicated so it is easy to make an error in implementing the training set. It seems that it is initially very easy to them. This is worrying because gradient descent algo- reduce the combined cost function by using weights that rithms are quite robust against minor errors. Also, it contain almost no information, and it is very hard to es- is hard to know how large to make the tables that are cape from this poor solution. To avoid this trap, we mul- used for propagating Gaussian distributions through lo- tiply the cost of the weights by a coecent that starts at gistic functions or for backpropagating derivatives. To :05 and gradually increases to 1 according to the sched- demonstrate that the implementation was correct and ule :05; :1; :15; :2; :3; :4; :5; :6; :7; :8; :9; 1:0. At each value to decide the table sizes we used the following semantic of the coecient we do 100 conjugate gradient updates check. We change each parameter by a small step and of the weights and at the nal value of 1:0 we do not ter- check that the cost function changes by the product of minate the optimization until the cost function changes the gradient and step size. Using this method we found by less than 10,6 nats (a nat is log2(e) bits). Figure that a 300 300 table with linear interpolation gives 2 shows all the incoming and outgoing weights of the reasonably accurate derivatives. four hidden units after one run of the optimization. It Figure 3: The nal probability distribution that is used for coding the weights. This distribution is implemented by adapting the means, variances and mixing proportions of ve gaussians.
is clear that the weights form three fairly sharp clus-
ters. Figure 3 shows that the mixture of 5 Gaussians has adapted to implement the appropriate coding-prior for this weight distribution. The performance of the network can be measured by comparing the squared error it achieves on the test data with the error that would be achieved by simply guess- ing the mean of the correct answers for the test data: P Relative Error = Pc (dc , yc )2 (27) c (dc , d)2 We ran the optimization ve times using dierent ran- domly chosen values for the initial means of the noisy weights. For the network that achieved the lowest value of the overall cost function, the relative error was 0:286. This compares with a relative error of 0:967 for the same network when we used noise-free weights and did not penalize their information content. The best relative error obtained using simple weight-decay with four non- linear hidden units was :317. This required a carefully chosen penalty coecient for the squared weights that corresponds to j2 =w2 in equation 4. To set this weight- decay coecient appropriately it was necessary to try many dierent values on a portion of the training set and to use the remainder of the training set to decide which coecient gave the best generalization. Once the best coecient had been determined the whole of the training set was used with this coecient. A lower er- ror of 0:291 can be achieved using weight-decay if we Figure 2: The nal weights of the network. Each gradually increase the weight-decay coecient and pick large block represents one hidden unit. The small the value that gives optimal performance on the test black or white rectangles represent negative or data. But this is cheating. Linear regression gave a positive weights with the area of a rectangle rep- huge relative error of 35:6 (gross overtting) but this resenting the magnitude of the weight. The bot- fell to 0:291 when we penalized the sum of the squares tom 12 rows in each block represent the incoming of the regression coecients by an amount that was cho- weights of the hidden unit. The central weight at sen to optimize performance on the test data. This is the top of each block is the weight from the hidden almost identical to the performance with 4 hidden units unit to the linear output unit. The weight at the and optimal weight-decay probably because, with small top-right of a block is the bias of the hidden unit. weights, the hidden units operate in their central linear range, so the whole network is eectively linear. These preliminary results demonstrate that our new hidden units, the integration over the Gaussian distri- method allows us to t quite complicated non-linear bution can be performed exactly and the exact weight models even when the number of training cases is less derivatives can be computed eciently. than the number of dimensions in the input vector. It is not clear how much is lost by ignoring the o- The results also show that the new method is slightly diagonal terms in the covariance matrix. David Mackay better than simple weight-decay on at least one task. (personal communication) has shown that if standard Much more experimental work is required to decide backpropagation is used to nd a single, locally optimal whether the method is competitive with other statis- point in weight space and a Gaussian approximation tical techniques for handling non-linear tasks in which to the posterior weight distribution is then constructed the amount of training data is very small compared with around this point, the covariances between dierent the dimensionality of the input. It is also worth men- weights are signicant. However, this does not mean tioning that the solution with the lowest value of the that the covariances are signicant when the learning total description length was the solution in which all algorithm is explicitly manipulating the Gaussian dis- the weights except the output bias are equal and nega- tribution because in this case the learning will try to tive. This solution has a relative error of approximately force the noise in the weights to be independent. The 1:0 so it is a serious embarrassment for the Minimum pressure for independence comes from the fact that the Description Length Principle or for our method of de- cost function will overestimate the information in the scribing the weights. weights if they have correlated noise. We are currently performing simulations to see if this pressure does in- 10 Discussion deed suppress the covariances. When using the standard backpropagation algorithm, it There is a correct, but intractable, Bayesian method is essential that the output of a hidden unit is a smooth of determining the weights in a feedforward neural net- function of its input. This is why the hidden units use work. We start with a prior distribution over all possible a smooth sigmoid function instead of a linear thresh- points in weight space. We then construct the correct old function. With noisy weights, however, it is possi- posterior distribution at each point in weight space by ble to use a version of the backpropagation algorithm multiplying the prior by the probability of getting the described above in networks that have one layer of lin- outputs in the training set given those weights.4 Fi- ear threshold units. The noise in the weights ensures nally we normalize to get the full posterior distribution. that the probability of a threshold unit being active is Then we use this distribution of weight values to make a smooth function of its inputs. As a result, it is easier predictions for new input vectors. to optimize a whole Gaussian distribution over weight In practice, the closest we can get to the ideal Bayesian vectors than it is to optimize a single weight vector. method is to use a Monte Carlo method to sample from the posterior distribution. This could be done by con- sidering random moves in weight space and accepting a 11 Acknowledgements move with a probability that depends on how well the This research was funded by operating and strategic resulting network ts the desired outputs. Neal (1993) grants from NSERC. Georey Hinton is the Noranda shows how the gradient information provided by back- fellow of the Canadian Institute for Advanced Research. propagation can be used to get a much more ecient We thank David Mackay, Radford Neal, Chris Williams method of obtaining samples from the posterior distri- and Rich Zemel for helpful discussions. bution. The major advantage of Monte Carlo meth- ods is that they do not impose unrealistically simple assumptions about the shape of the posterior distribu- tion in weight space. 12 References If we are willing to make simplifying assumptions about Hinton, G. E. (1987) Learning translation invariant the posterior distribution, time-consuming Monte Carlo recognition in a massively parallel network. In Goos, G. simulations can be avoided. MacKay (1992) nds a sin- and Hartmanis, J., editors, PARLE: Parallel Architec- gle locally optimal point in weight space and constructs tures and Languages Europe, pages 1{13, Lecture Notes a full covariance Gaussian approximation to the pos- in Computer Science, Springer-Verlag, Berlin. terior distribution around that point. The alternative method proposed in this paper is to use a simpler Gaus- Lang, K., Waibel, A. and Hinton, G. E. (1990) A Time- sian approximation (with no o-diagonal terms in the Delay Neural Network Architecture for Isolated Word covariance matrix) but to take this distribution into ac- Recognition. Neural Networks, 3, 23-43. count during the learning. With one layer of non-linear Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. and Jackel, L. D. 4This assumes that the output of the neural net repre- (1989) Back-Propagation Applied to Handwritten Zip- sents the mean of a Gaussian distribution from which the nal output is randomly selected. So the nal output could code Recognition. Neural Computation, 1, 541-551. be exactly correct even though the output of the net is not. Mackay, D. J. C. (1992) A practical Bayesian framework for backpropagation networks. Neural Computation, 4, 448-472. Neal, R. M. (1993) Bayesian learning via stochastic dy- namics. In Giles, C. L., Hanson, S. J. and Cowan, J. D. (Eds), Advances in Neural Information Processing Systems 5, Morgan Kaufmann, San Mateo CA. Nowlan. S. J. and Hinton, G. E. (1992) Simplifying neural networks by soft weight sharing. Neural Compu- tation, 4, 173-193. Rissanen, J. (1986) Stochastic Complexity and Model- ing. Annals of Statistics, 14, 1080-1100.