Keeping Neural Networks Simple by Minimizing The Description Length of The Weights

Keeping Neural Networks Simple by Minimizing
the Description Length of the Weights

Georey E. Hinton and Drew van Camp
Department of Computer Science
University of Toronto
10 King's College Road
Toronto M5S 1A4, Canada
Abstract Limit the number of connections in the network

(and hope that each weight does not have too much
Supervised neural networks generalize well if information in it).
there is much less information in the weights Divide the connections into subsets, and force the
than there is in the output vectors of the train- weights within a subset to be identical. If this
ing cases. So during learning, it is impor- \weight-sharing" is based on an analysis of the nat-
tant to keep the weights simple by penaliz- ural symmetries of the task it can be very eective
ing the amount of information they contain. (Lang, Waibel and Hinton (1990); LeCun 1989).
The amount of information in a weight can Quantize all the weights in the network so that a
be controlled by adding Gaussian noise and probability mass, p, can be assigned to each quan-
the noise level can be adapted during learning tized value. The number of bits in a weight is then
to optimize the trade-o between the expected , log p, provided we ignore the cost of dening the
squared error of the network and the amount quantization. Unfortunately this method leads to
of information in the weights. We describe a dicult search space because the cost of a weight
a method of computing the derivatives of the does not have a smooth derivative.
expected squared error and of the amount of
information in the noisy weights in a net-
work that contains a layer of non-linear hidden
units. Provided the output units are linear, the
2 Applying the Minimum Description
exact derivatives can be computed eciently Length Principle
without time-consuming Monte Carlo simula- When tting models to data, it is always possible to t
tions. The idea of minimizing the amount of
information that is required to communicate the training data better by using a more complex model,
the weights of a neural network leads to a but this may make the model worse at tting new data.
number of interesting schemes for encoding the So we need some way of deciding when extra complex-
weights. ity in the model is not worth the improvement in the
data-t. The Minimum Description Length Principle
(Rissanen, 1986) asserts that the best model of some
data is the one that minimizes the combined cost of
1 Introduction describing the model and describing the mist between
the model and the data. For supervised neural networks
In many practical learning tasks there is little available with a predetermined architecture, the model cost is the
training data so any reasonably complicated model will number of bits it takes to describe the weights, and the
tend to overt the data and give poor generalization to data-mist cost is the number of bits it takes to describe
new data. To avoid overtting we need to ensure that the discrepancy between the correct output and the out-
there is less information in the weights than there is in put of the neural network on each training case. We can
the output vectors of the training cases. Researchers think in terms of a sender who can see both the input
have considered many possible ways of limiting the in- vector and the correct output and a receiver who can
formation in the weights: only see the input vector. The sender rst ts a neural
network, of pre-arranged architecture, to the complete
set of training cases, then sends the weights to the re-
ceiver. For each training case the sender also sends the
discrepancy between the net's output and the correct
output. By adding this discrepancy to the output of
the net, the receiver can generate exactly the correct
output.
t
square deviation of the mists from zero.1 Using this
value of j and summing over all training cases the last
term of equation 2 becomes a constant and the data
mist cost is:
" #
0 v
Cdata-mist = kN +
N
log 1 X(dc , yc )2 (3)
2 N c j j
Figure 1: This shows the probability mass asso-
ciated with a quantized value, v, using a quan- where k is a constant that depends only on t.
tization width t. If t is much narrower than the
Gaussian distribution, the probability mass is well Independently of whether we use the optimal value or
approximated by the product of the height and the a predetermined value for j , it is apparent that the
width, so the log probability is a sum of two terms. description length is minimized by minimizing the usual
The log t term is a constant, and if the distribu- squared error function, so the Gaussian assumptions we
tion is a zero-mean Gaussian, the log of the height have made about coding can be viewed as the MDL
is proportional to v2 . justication of this error function.
3 Coding the data mists 4 A simple method of coding the

weights
To apply the MDL principle we need to decide on a
coding scheme for the data mists and for the weights. We could code the weights in just the same way as we
Clearly, if the data mists are real numbers, an innite code the data mists. We assume that the weights of
amount of information is needed to convey them. So we the trained network are nely quantized and come from
shall assume that they are very nely quantized, using a zero-mean Gaussian distribution. If the standard de-
intervals of xed width t. We shall also assume that viation, w , of this distribution is xed in advance, the
the data mists are encoded separately for each of the description length of the weights is simply proportional
output units. to the sum of their squares. So, assuming we use a Gaus-
sian with standard deviation j for encoding the output
The coding theorem tells us that if a sender and a re- errors, we can minimize the total description length of
ceiver have agreed on a probability distribution that as- the data mists and the weights by minimizing the sum
signs a probability mass, p(y), to each possible quan- of two terms:
tized data mist, y, then we can code the mist using
, log2 p(y) bits. If we want to minimize the expected
number of bits, the best probability distribution to use C=
X 1 (
X
dcj , yjc )2 +
1 X w2 (4)
is the correct one, but any other agreed distribution can 2w2 ij ij
also be used. For convenience, we shall assume that for j 2 j c
2
output unit j the data mists are encoded by assuming
that they are drawn from a zero-mean Gaussian distri- where c is an index over training cases.
bution with standard deviation, j . Provided that j is This is just the standard \weight-decay" method. The
large compared with the quantization width t, the as- fact that weight-decay improves generalization (Hinton,
sumed probability of a particular data mist between 1987) can therefore be viewed as a vindication of this
the desired output, dcj on training case c and the actual crude MDL approach in which the standard deviations
output yjc is then well approximated by the probability of the gaussians used for coding the data mists and the
mass shown in gure 1. weights are both xed in advance.2
" # An elaboration of standard weight-decay is to assume
1 ,(dcj , yjc )2 that the distribution of weights in the trained network
p(dcj , yjc ) = t p exp (1)
2 j 2j2 can be modelled more accurately by using a mixture
of several Gaussians whose means, variances and mix-
Using an optimal code, the description length of a data ing proportions are adapted as the network is trained
mist, dcj , yjc , in units of log2(e) bits (called \nats") is: 1
If the optimal value of j is to be used, it must be com-
municated before the data mists are sent, so it too must be
p (dc , yc )2 coded. However, since it is only one number we are probably
, log p(dcj , yjc ) = , log t +log 2 +log j + j 22 j safe in ignoring this aspect of the total description length.
2
It is clear from equation 4 that it is only the ratio of
j
(2) the variances of the two Gaussians that matters. Rather
than guessing this ratio, it is usually better to estimate it by
To minimize this description length summed over all N seeing which ratio gives optimal performance on a validation
training cases, the optimal value of j is the root mean set.
(Nowlan and Hinton, 1992). For some tasks this more weight. After learning, the sender has a Gaussian pos-
elaborate way of coding the weights gives considerably terior distribution, Q, for the weight. We describe a
better generalization. This is especially true when only method of communicating both the weights and the
a small number of dierent weight values are required. data mists and show that using this method the num-
However, this more elaborate scheme still suers from ber of bits required to communicate the posterior distri-
a serious weakness: It assumes that all the weights are bution of a weight is equal to the asymmetric divergence
quantized to the same tolerance, and that this tolerance (the Kullback-Liebler distance) from P to Q.
is small compared with the standard deviations of the Z
Gaussians used for modelling the weight distribution. Q(w)
Thus it takes into account the probability density of a G(P; Q) = Q(w) log dw (5)
P (w)
weight (the height in gure 1) but it ignores the pre-
cision (the width). This is a terrible waste of bits. A 5.2 The \bits back" argument
network is clearly much more economical to describe if
some of the weight values can be described very impre- To communicate a set of noisy weights, the sender rst
cisely without signicantly aecting the predictions of collapses the posterior probability distribution for each
the network. weight by using a source of random bits to pick a precise
MacKay (1992) has considered the eects of small value for the weight (to within some very ne tolerance
changes in the weights on the outputs of the network t). The probability of picking each possible value is de-
after the network has been trained. The next section termined by the posterior probability distribution for
describes a method of taking the precision of the weights the weight. The sender then communicates these pre-
into account during training so that the precision of a cise weights by coding them using some prior Gaussian
weight can be traded against both its probability den- distribution, P , so that the communication cost of a
sity and the excess data mist caused by imprecision in precise weight, w, is:
the weight.
C (w) = , log t , log P (w) (6)
5 Noisy weights
t must be small compared with the variance of P so
A standard way of limiting the amount of information in C (w) is big. However, as we shall see, we are due for a
a number is to add zero-mean Gaussian noise. At rst big refund at the end.
sight, a noisy weight seems to be even more expensive Having sent the precise weights, the sender then com-
to communicate than a precise one since it appears that municates the data-mists achieved using those weights.
we need to send a variance as well as a mean, and that Having received the weights and the mists, the receiver
we need to decide on a precision for both of these. As we can then produce the correct outputs. But he can also
shall see, however, the MDL framework can be adapted do something else. Once he has the correct outputs he
to allow very noisy weights to be communicated very can run whatever learning algorithm was used by the
cheaply. sender and recover the exact same posterior probability
When using backpropagation to train a feedforward distribution, Q, that the sender collapsed in order to get
neural network, it is standard practice to start at some the precise weights.3 Now, since the receiver knows the
particular point in weight space and to move this point sender's posterior distribution for each weight and he
in the direction that reduces the error function. An al- knows the precise value that was communicated, he can
ternative approach is to start with a multivariate Gaus- recover all the random bits that the sender used to col-
sian distribution over weight vectors and to change both lapse that distribution to that value. So these random
the mean and the variance of this cloud of weight vec- bits have been successfully communicated and we must
tors so as to reduce some cost function. We shall restrict subtract them from the overall communication cost to
ourselves to distributions in which the weights are inde- get the true cost of communicating the model and the
pendent, so the distribution can be represented by one mists. The number of random bits required to collapse
mean and one variance per weight. the posterior distribution for a weight, Q, to a particular
The cost function is the expected description length of nely quantized value, w, is:
the weights and of the data mists. It turns out that
high-variance weights are cheaper to communicate but R(w) = , log t , log Q(w) (7)
they cause extra variance in the data mists thus mak-
ing these mists more expensive to communicate. So the true expected description length for a noisy
weight is determined by taking an expectation, under
5.1 The expected description length of the the distribution Q :
weights 3
If the sender used random initial weights these can be
We assume that the sender and the receiver have an communicated at a net cost of 0 bits using the method that
agreed Gaussian prior distribution, P , for a given is being explained.
Z
Q(w) hEj i = h(dj , yj )2 i = (dj , yj )2 + Vyj (12)
G(P; Q) = hC (w) , R(w)i = Q(w) log dw (8)
P (w)
So, for each input vector, we can use the table and the
For Gaussians with dierent means and variances, the equations above to compute the exact value of hEj i. We
asymmetric divergence is can
P also backpropagate the exact derivatives of E =
j hEj i provided we rst build another table to allow
1 derivatives to be backpropagated through the hidden
G(P; Q) = log p + 2 q2 , p2 + (p , q )2 (9)

units. As before, the table is indexed by xh and Vxh but
q 2p for the backward pass each cell of the table contains the
four partial derivatives that are needed to to convert the
5.3 The expected description length of the output derivatives of h into its input derivatives using
data mists the equations:
To compute the data-mist cost given in equation 3 we
need the expected value of (dcj , yjc )2 . This squared error @E @E @yh
= @ @E @Vyh
+ @V (13)
is caused partly by the systematic errors of the network @xh yh @ xh yh @xh
and partly by the noise in the weights. Unfortunately,
for general feedforward networks with noisy weights, the @E @E @yh @E @Vyh
= @ + @V (14)
expected squared errors are not easy to compute. Linear @Vxh yh @V xh yh @Vxh
approximations are possible if the level of noise in the
weights is suciently small compared with the smooth-
ness of the non-linearities, but this defeats one of the
6 Letting the data determine the prior
main purposes of the idea which is to allow very noisy So far, we have assumed that the \prior" distribution
weights. Fortunately, if there is only one hidden layer that is used for coding the weights is a single Gaussian.
and if the output units are linear, it is possible to com- This coding-prior must be known to both the sender
pute the expected squared error exactly. and the receiver before the weights are communicated.
The weights are assumed to have independent Gaussian If we x its mean and variance in advance we could pick
noise, so for any input vector we can compute the mean inappropriate values that make it very expensive to code
xh , and variance, Vxh , of the Gaussian-distributed to- the actual weights. We therefore allow the mean and
tal input, xh , received by hidden unit h. Using a table, variance of the coding-prior to be determined during
we can then compute the mean, yh and variance, Vyh , the optimization process, so the coding-prior depends
of the output of the hidden unit, even though this out- on the data. This is a funny kind of prior! We could try
put is not Gaussian distributed. A lot of computation to make sense of it in Bayesian terms by assuming that
is required to create this two-dimensional table since we start with a hyper-prior that species probability
many dierent pairs of xh and Vxh must be used, and distributions for the mean and variance of the coding-
for each pair we must use Monte Carlo sampling or nu- prior and then we use the hyper-prior and the data to
merical integration to compute yh and Vyh . Once the nd the best coding-prior. This would automatically
table is built, however, it is much more ecient than take into account the cost of communicating the coding-
using Monte Carlo sampling at runtime. prior to a receiver who only knows the hyper-prior. In
practice, we just ignore the cost of communicating the
Since the noise in the outputs of the hidden units is two parameters of the coding-prior so we do not need
independent, they independently contribute variance to to invent hyper-priors.
each linear output unit. The noisy weights, whj , also
contribute variance to the output units. Since the out- 6.1 A more exible prior distribution for the
put units are linear, their outputs, yj , are equal to the weights
total inputs they receive, xj . On a particular training
case, the output, yj , of output unit j is a random vari- If we use a single Gaussian prior for communicating the
able with the following mean and variance: noisy weights, we get a relatively simple penalty term,
the asymmetric divergence, for the posterior distribu-
yj =
X
yh whj (10) tion of each noisy weight. Unfortunately, this coding
h
scheme is not exible enough to capture certain kinds
Xh i of common structure in the weights. Suppose, for ex-
Vyj = 2whj Vyh + 2yh Vwhj + Vyh Vwhj (11) ample, that we want a few of the weights to have values
h near 1 and the rest to have values very close to 0. If
the posterior distribution for each weight has low vari-
The mean and the variance of the activity of output ance (to avoid the extra squared error caused by noise
unit j make independent contributions to the expected in the weights) we inevitiably pay a high code cost for
squared error hEj i. If the desired output of j on a par- weights because no single Gaussian prior can provide a
ticular training case is dj , hEj i is given by: good model of a spike around 0 and a spike around 1.
If we know in advance that dierent subsets of the energy of a system depends on the energies of the various
weights are likely to have dierent distributions, we can alternative congurations of the system. Indeed, one
use dierent coding-priors for the dierent subsets. As way to derive equation 17 is to dene a coding scheme
MacKay (1992) has demonstrated, it makes sense to use in which the code cost resembles a free energy and to
dierent coding-priors for the input-to-hidden weights then use a lemma from statistical mechanics.
and the hidden-to-output weights since the input and
output values may have quite dierent scales. If we do 7 A coding scheme that uses a mixture
not know in advance which weights should be similar,
we can model the weight distribution by an adaptive of Gaussians
mixture of Gaussians as proposed by Nowlan and Hin-
ton (1992). During the optimization, the means, vari- Suppose that a sender and a receiver have already
ances and mixing proportions in the mixture adapt to agreed on a particular mixture of Gaussians distribu-
model the clusters in the weight values. Simultaneously, tion. The sender can now send a sample from the pos-
the weights adapt to t the current mixture model so terior Gaussian distribution of a weight using the fol-
weights get pulled towards the centers of nearby clus- lowing coding scheme:
ters. Suppose, for eaxample, that there are two Gaus-
sians in the mixture. If one gaussian has mean 1 and 1. Randomly pick one of the Gaussians in the mixture
low variance and the other gaussian has mean 0 and low with probability ri given by
variance it is very cheap to encode low-variance weights
with values near 1 or 0. i e,Gi
ri = P
,G (18)
Nowlan and Hinton (1992) implicitly assumed that the j j e j
posterior distribution for each weight has a xed and 2. Communicate the choice of Gaussian to the re-
negligible variance so they focussed on maximizing the ceiver. If we use the mixing proportions as a prior
probability density of the mean of the weight under the for communicating the choice, the expected code
coding-prior mixture distribution. We now show how cost is
their technique can be extended to take into account the
variance of the posterior distributions for the weights, X
r log
1 (19)
i
assuming that the posterior distributions are still con- i i
strained to be single Gaussians. As before, we ignore the 3. Communicate the sample value to the receiver us-
cost of communicating the mixture distribution that is ing the chosen Gaussian. If we take into account
to be used for coding the weights. the random bits that we get back when the receiver
The mixture prior has the form: reconstructs the posterior distribution from which
the sample was chosen, the expected cost of com-
P (w) =
X
iPi (w) (15) municating the sample is
i X
riGi (20)
where i is the mixing proportion of Gaussian Pi . The i
asymmetric divergence between the mixture prior and So the expected cost of communicating both the
the single Gaussian posterior, Q, for a noisy weight is choice of Gaussian and the sample value given that
Z
choice is
Q(w)
G(P; Q) = Q(w) log P dw (16) 1 = X r (, log e,Gi )
i i Pi(w)
X X
riGi + ri log i i
i
The sum inside the log makes this hard to integrate ana- i i i
lytically. This is unfortunate since the optimization pro- (21)
cess requires that we repeatedly evaluate both G(P; Q) 4. After receiving samples from all the posterior
and its derivatives with respect to the parameters of weight distributions and also receiving the errors on
P and Q. Fortunately, there is a much more tractable the training cases with these sampled weights, the
expression which is an upper bound on G and can there- receiver can run the learning algorithm and recon-
fore be used in its place. This expression is in terms of struct the posterior distributions from which the
the Gi(Pi ; Q) the asymmetric divergences between the weights are sampled. This allows the receiver to
posterior distribution, Q, and each of the Gaussians, Pi , reconstruct all of the Gi and hence to reconstruct
in the mixture prior. the random bits used to choose a Gaussian from the
mixture. So the number of \bits back" that must
G^ (P1 ; P2 : : : ; Q) = , log
X
i e,Gi (17) be subtracted from the expected cost in equation
i
21 is
The way in which G^ depends on the Gi in equation H=
X
r log
1 (22)
i ri
17 is precisely analogous to the way in which the free i
We now use a lemma from statistical mechanics to get a 9 Preliminary Results
simple expression for the expected code cost minus the
bits back. We have not yet performed a thorough comparison be-
tween this algorithm and alternative methods and it
7.1 A lemma from statistical mechanics may well turn out that further renements are required
to make it competitive. We have, however, tried the
For a physical system at a temperature of 1, the algorithm on one very high dimensional task with very
Helmholtz free energy, F , is dened as the expected scarce training data. The task is to predict the eec-
energy minus the entropy tiveness of a class of peptide molecules. Each molecule
is described by 128 parameters (the input vector) and
X X 1 has an eectiveness that is a single scalar (the ouput
F= ri Ei , ri log (23) value). All inputs and outputs were normalized to have
i i ri zero mean and unit variance so that the weights could
be expected to have similar scales. The training set con-
where i is an index over the alternative possible states sisted of 105 cases and the test set was the remaining
of the system, Ei is the energy of a state, and ri is the 420 cases. We deliberately chose a very small training
probability of a state. F is a function of the probabil- set since these are the circumstances in which it should
ity distribution over states and the probability distribu- be most helpful to control the amount of information in
tion which minimizes F is the Boltzmann distribution in the weights. We tried a network with 4 hidden units.
which probabilities are exponentially related to energies This network contains 521 adaptive weights (including
the biases of the output and hidden units) so it overts
e,Ei the 105 training cases very badly if we do not limit the
ri = P
,E (24) information in the weights.
je j
We used an adaptive mixture of 5 Gaussians as our
At the minimum given by the Boltzmann distribution, coding-prior for the weights. The Gaussians were ini-
the free energy is equal to minus the log of the partition tialized with means uniformly spaced between ,0:24
function: and +0:24 and separated by 2 standard deviations from
their neighbors. The initial means for the posterior dis-
F = , log
X
e,Ei (25) tributions of each weight were chosen from a Gaussian
i
with mean 0 and standard deviation 0:15. The stan-
dard deviations of the posteriors for the weights were
If we equate each Gaussian in the mixture with an al- all initialized at 0:1.
ternative,state
G
of a physical system, we can equate e,Ei We optimize all of the parameters simultaneously us-
with i e . So our method of picking a Gaussian from
i
ing a conjugate gradient method. For the variances we
the mixture uses a Boltzmann distribution because it optimize the log variance so that it cannot go negative
makes ri proportional to e,Ei . The random bits that and cannot collapse to zero. To ensure that the mix-
are successfully communicated when the receiver recon- ing proportions of the Gaussians in the coding-prior lie
structs the probabilities ri correspond exactly to the en- between 0 and 1 and add to 1 we optimize the xi where
tropy of the Boltzmann distribution, so the total code
cost (including the bits back) is equivalent to F and is exi
therefore equal to the expression given in equation 17. i = P
x (26)
je j
8 Implementation If we penalize the weights by the full cost of describing
them, the optimization quickly makes all of the weights
With an adaptive mixture of Gaussians coding-prior, equal and negative and uses the bias of the output unit
the derivatives of the cost function are moderately com- to x the output at the mean of the desired values in
plicated so it is easy to make an error in implementing the training set. It seems that it is initially very easy to
them. This is worrying because gradient descent algo- reduce the combined cost function by using weights that
rithms are quite robust against minor errors. Also, it contain almost no information, and it is very hard to es-
is hard to know how large to make the tables that are cape from this poor solution. To avoid this trap, we mul-
used for propagating Gaussian distributions through lo- tiply the cost of the weights by a coecent that starts at
gistic functions or for backpropagating derivatives. To :05 and gradually increases to 1 according to the sched-
demonstrate that the implementation was correct and ule :05; :1; :15; :2; :3; :4; :5; :6; :7; :8; :9; 1:0. At each value
to decide the table sizes we used the following semantic of the coecient we do 100 conjugate gradient updates
check. We change each parameter by a small step and of the weights and at the nal value of 1:0 we do not ter-
check that the cost function changes by the product of minate the optimization until the cost function changes
the gradient and step size. Using this method we found by less than 10,6 nats (a nat is log2(e) bits). Figure
that a 300 300 table with linear interpolation gives 2 shows all the incoming and outgoing weights of the
reasonably accurate derivatives. four hidden units after one run of the optimization. It
Figure 3: The nal probability distribution that
is used for coding the weights. This distribution
is implemented by adapting the means, variances
and mixing proportions of ve gaussians.
is clear that the weights form three fairly sharp clus-

ters. Figure 3 shows that the mixture of 5 Gaussians
has adapted to implement the appropriate coding-prior
for this weight distribution.
The performance of the network can be measured by
comparing the squared error it achieves on the test data
with the error that would be achieved by simply guess-
ing the mean of the correct answers for the test data:
P
Relative Error = Pc
(dc , yc )2 (27)
c (dc , d)2
We ran the optimization ve times using dierent ran-
domly chosen values for the initial means of the noisy
weights. For the network that achieved the lowest value
of the overall cost function, the relative error was 0:286.
This compares with a relative error of 0:967 for the same
network when we used noise-free weights and did not
penalize their information content. The best relative
error obtained using simple weight-decay with four non-
linear hidden units was :317. This required a carefully
chosen penalty coecient for the squared weights that
corresponds to j2 =w2 in equation 4. To set this weight-
decay coecient appropriately it was necessary to try
many dierent values on a portion of the training set
and to use the remainder of the training set to decide
which coecient gave the best generalization. Once the
best coecient had been determined the whole of the
training set was used with this coecient. A lower er-
ror of 0:291 can be achieved using weight-decay if we
Figure 2: The nal weights of the network. Each gradually increase the weight-decay coecient and pick
large block represents one hidden unit. The small the value that gives optimal performance on the test
black or white rectangles represent negative or data. But this is cheating. Linear regression gave a
positive weights with the area of a rectangle rep- huge relative error of 35:6 (gross overtting) but this
resenting the magnitude of the weight. The bot- fell to 0:291 when we penalized the sum of the squares
tom 12 rows in each block represent the incoming of the regression coecients by an amount that was cho-
weights of the hidden unit. The central weight at sen to optimize performance on the test data. This is
the top of each block is the weight from the hidden almost identical to the performance with 4 hidden units
unit to the linear output unit. The weight at the and optimal weight-decay probably because, with small
top-right of a block is the bias of the hidden unit. weights, the hidden units operate in their central linear
range, so the whole network is eectively linear.
These preliminary results demonstrate that our new hidden units, the integration over the Gaussian distri-
method allows us to t quite complicated non-linear bution can be performed exactly and the exact weight
models even when the number of training cases is less derivatives can be computed eciently.
than the number of dimensions in the input vector. It is not clear how much is lost by ignoring the o-
The results also show that the new method is slightly diagonal terms in the covariance matrix. David Mackay
better than simple weight-decay on at least one task. (personal communication) has shown that if standard
Much more experimental work is required to decide backpropagation is used to nd a single, locally optimal
whether the method is competitive with other statis- point in weight space and a Gaussian approximation
tical techniques for handling non-linear tasks in which to the posterior weight distribution is then constructed
the amount of training data is very small compared with around this point, the covariances between dierent
the dimensionality of the input. It is also worth men- weights are signicant. However, this does not mean
tioning that the solution with the lowest value of the that the covariances are signicant when the learning
total description length was the solution in which all algorithm is explicitly manipulating the Gaussian dis-
the weights except the output bias are equal and nega- tribution because in this case the learning will try to
tive. This solution has a relative error of approximately force the noise in the weights to be independent. The
1:0 so it is a serious embarrassment for the Minimum pressure for independence comes from the fact that the
Description Length Principle or for our method of de- cost function will overestimate the information in the
scribing the weights. weights if they have correlated noise. We are currently
performing simulations to see if this pressure does in-
10 Discussion deed suppress the covariances.
When using the standard backpropagation algorithm, it
There is a correct, but intractable, Bayesian method is essential that the output of a hidden unit is a smooth
of determining the weights in a feedforward neural net- function of its input. This is why the hidden units use
work. We start with a prior distribution over all possible a smooth sigmoid function instead of a linear thresh-
points in weight space. We then construct the correct old function. With noisy weights, however, it is possi-
posterior distribution at each point in weight space by ble to use a version of the backpropagation algorithm
multiplying the prior by the probability of getting the described above in networks that have one layer of lin-
outputs in the training set given those weights.4 Fi- ear threshold units. The noise in the weights ensures
nally we normalize to get the full posterior distribution. that the probability of a threshold unit being active is
Then we use this distribution of weight values to make a smooth function of its inputs. As a result, it is easier
predictions for new input vectors. to optimize a whole Gaussian distribution over weight
In practice, the closest we can get to the ideal Bayesian vectors than it is to optimize a single weight vector.
method is to use a Monte Carlo method to sample from
the posterior distribution. This could be done by con-
sidering random moves in weight space and accepting a 11 Acknowledgements
move with a probability that depends on how well the This research was funded by operating and strategic
resulting network ts the desired outputs. Neal (1993) grants from NSERC. Georey Hinton is the Noranda
shows how the gradient information provided by back- fellow of the Canadian Institute for Advanced Research.
propagation can be used to get a much more ecient We thank David Mackay, Radford Neal, Chris Williams
method of obtaining samples from the posterior distri- and Rich Zemel for helpful discussions.
bution. The major advantage of Monte Carlo meth-
ods is that they do not impose unrealistically simple
assumptions about the shape of the posterior distribu-
tion in weight space. 12 References
If we are willing to make simplifying assumptions about Hinton, G. E. (1987) Learning translation invariant
the posterior distribution, time-consuming Monte Carlo recognition in a massively parallel network. In Goos, G.
simulations can be avoided. MacKay (1992) nds a sin- and Hartmanis, J., editors, PARLE: Parallel Architec-
gle locally optimal point in weight space and constructs tures and Languages Europe, pages 1{13, Lecture Notes
a full covariance Gaussian approximation to the pos- in Computer Science, Springer-Verlag, Berlin.
terior distribution around that point. The alternative
method proposed in this paper is to use a simpler Gaus- Lang, K., Waibel, A. and Hinton, G. E. (1990) A Time-
sian approximation (with no o-diagonal terms in the Delay Neural Network Architecture for Isolated Word
covariance matrix) but to take this distribution into ac- Recognition. Neural Networks, 3, 23-43.
count during the learning. With one layer of non-linear Le Cun, Y., Boser, B., Denker, J. S., Henderson,
D., Howard, R. E., Hubbard, W. and Jackel, L. D.
4This assumes that the output of the neural net repre- (1989) Back-Propagation Applied to Handwritten Zip-
sents the mean of a Gaussian distribution from which the
nal output is randomly selected. So the nal output could
code Recognition. Neural Computation, 1, 541-551.
be exactly correct even though the output of the net is not. Mackay, D. J. C. (1992) A practical Bayesian framework
for backpropagation networks. Neural Computation, 4,
448-472.
Neal, R. M. (1993) Bayesian learning via stochastic dy-
namics. In Giles, C. L., Hanson, S. J. and Cowan, J.
D. (Eds), Advances in Neural Information Processing
Systems 5, Morgan Kaufmann, San Mateo CA.
Nowlan. S. J. and Hinton, G. E. (1992) Simplifying
neural networks by soft weight sharing. Neural Compu-
tation, 4, 173-193.
Rissanen, J. (1986) Stochastic Complexity and Model-
ing. Annals of Statistics, 14, 1080-1100.

Keeping Neural Networks Simple by Minimizing The Description Length of The Weights

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Keeping Neural Networks Simple by Minimizing The Description Length of The Weights

Uploaded by

Copyright:

Available Formats

Keeping Neural Networks Simple by Minimizing

the Description Length of the Weights

Abstract Limit the number of connections in the network

3 Coding the data mists 4 A simple method of coding the

is clear that the weights form three fairly sharp clus-

You might also like

Keeping Neural Networks Simple by Minimizing The Description Length of The Weights

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Keeping Neural Networks Simple by Minimizing The Description Length of The Weights

Uploaded by

Copyright:

Available Formats

Keeping Neural Networks Simple by Minimizing

the Description Length of the Weights

Abstract  Limit the number of connections in the network

3 Coding the data mis ts 4 A simple method of coding the

is clear that the weights form three fairly sharp clus-

You might also like

Abstract Limit the number of connections in the network

3 Coding the data mists 4 A simple method of coding the