You are on page 1of 4
Learning representations by back-propagating errors David E, Rumelhart*, Geoffrey E, Hintont & Ronald J. Williams* * Instiute for Cognitive Science, C-015, University of California, ‘San Diego, La doll, California 92093, USA 4 Depantmtat of Computer Science, Carnegie-Mellon University, Pitsburgh, Philadelphia 15213, USA We describe a new learning procedure, back-propagation, for networks of neuron-like units. The procedure repeatedly adjusts the welghts of the connections in the network so as to minimize a measure of the difference between the actual output vector of the ret and the desired output vector. AS a result of the weight adjustments, internal ‘hidden’ units which are not part ofthe input ‘or output come to represent important features of the task dom: and the regularities ia the task are captured by the interactions of these units, The ability to create useful new features distin- ‘gishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure’. There have been many attempts to design self-organizing neural networks. The aim is to find a powerful synaptic modification rule hat will allow an arbitrarily connected neural ‘network to develop an internal structure that is appropriate for 8 particular task domain. The task is specified by giving the esired state vector of the ourput units for each state vector of the input units, If the input units are directly connected to the output units it is relatively easy to find leaning rules that iteratively adjust the relative strengths of the connections so 38, ‘o progressively reduce the difference between the actual and desired output vectors®. Learning becomes more interesting but, LETTERS TO NATURE $2 ‘more difficult when we introduce hidden units whose actual or desired states are not specified by the task. (In perceptrons, there are ‘feature analysers’ between the input and output that fre not true hidden units because their input connections are fixed by hand, so their states are completely determined by the input vector: they do not learn representations.) The learning procedure must decide under what circumstances the hidden units should be active in order to help achieve the desired input-output behaviour. This amounts to deciding what these units should represent. We demonsteate that a general purpore and relatively simple procedure is powerful enough to construct appropriate internal representations The simplest form of the learning procedure is for layered networks Which have a layer of input units at the bottom; any number of intermediate layers; and a layer of output units at the top. Connections within a layer or from layers are forbidden, but connections can skip intermediate layers. An input vector is presented t0 the network by seting. the states of the input units. Them the states of the units in each layer are determined by applying equations (1) and (2) to the ‘connections coming from lower layers. All units within a layer hhave their states set in parallel, but different layers have their states set sequentially, starting at the bottom and working, ‘upwards until the states of the output units are determined. The total input, x, t0 unit isa linear function of the outputs, sya ofthe aie Maes Connected Ta and ofthe Weighs a — 1h these connections: Eom oy Units can be given biases by introducing an extra input to each ‘anit which always has a value of 1. The weight on this extra input is called the bias and is equivalent to a threshold of the opposite sign. It can be treated just lke the otter weights. ‘A unit has a real-valued quiput, y,, which is non-linear funetion of its total input me 1 @ ies seth dace B el Ye 2 coe Fig. 1A network chat has leaned to detect misor symmmety ia the input vector The sumbers oa the arcs are weights and the numbers inside the nades are biases. The learning required 1,825 Sweeps through the set of 64 posible input vectors, wih he weights being adjusted on the Suse ofthe accumulated gradient aftr eack ‘Sweep, The values of the parameters in equation (9) were e = OL Snd a =C9, The intl weights were random and were uniformly distributed bowen ~0.3 and 03. The key property of his solution [s that fora given hidden weit, weights that are symmetie about the middle of the input vector ate equal in magnitude and opposite insign Soif a eymmetreal patiars presented, both hidden units tl recive a net input of from the input units and, because the bidden units have a negative bia, both will be of. Ia tis case the futput unit, havinga positive Bias, willbe on Note chat the weights fn each side of the midpoint are in the ratio 1:2:4 This ensures thot each of the eight patterns that ean oecur above the midpoint Sends a unique activation sum to each hidden uni, s0 the only pattern below the midpoint that can exacly balance this sum i {he symmetical one. Forall son-syametial patterns, both hidden tints wil receive non-zero activations (rom the input units, The two hidden unite have identical paterns of weights but wih tppesit signs s for every non-tymmetie pater one hidden uit ‘wll come on and suppress the output unit, Wis not necessary to use exactly the functions given in equations (Q) and @). Any input-output function which has a bounded derivative will do, However, the use of a linear function for ‘combining the inputs to a unit before applying the nonlinearity ‘reatly simplifies the learning procedure The aim is to find a set of weights that ensure that for each input vector the output veetor produced by the network is the same as (oF suficienty close to) the desired output vector. If there isa fied, Gnite set of input-output cases, the total error inthe performance of the network with a particular set of weights ‘can be computed by comparing the actual and desired output ‘vectors for every case. The total error, £, is defined es LE Ope dhe? @ where © is an index over cases (input-output pairs), J is an index over output units, y is the actual state of an output unit ‘and d is its desired seaie. To minimize F by gradient descent itis necessary to compute the partial derivaive.of-£-with respect to each weight in the network, This is simply the sum of the partial derivatives for each of the input-output cases For a aiven case, the partial derivatives of the errac.with. respect (0 ‘each weight are computed if Wo passes. We have already ‘described the forwaed pass ia which the units in each layer have theie states determined by the input they receive from units in lower layers using equations (1) and (2). The backward pass which, funnagatss-derjeauives om the in lave DaGk t6 the bottom one is more complicated LETTERSTONATURE- Cte = Pensions Anam = Cine cain Cave Fig.2. Two isomorphic family wees. The information an be expressed a6 a aet of tnples ofthe form (person Iitelationship) (person), where the possible relationships are (father, mother, husband, wie, son, daughter, unete, sun, brother sister, nephew, riee| A layered net can Be aaid to "now thete triples if i ean Drodves the third term ofeach tiple when given the fst wo. The fret ewo terms are encaded by aetvating two of the input ua, And the network must then compete the proposition by aetivating the output uni that represents the thitd tenn Fig. 3 Acivt levels in a ive-tayer network after it has learned ‘The bottom layer has 24 input units on the lel fo representing (person 1) and 12 input waite on the right for representing the Felalionship. The white squares inside these two groups show the fcivity levels af the units There i one ative it nthe Ret group representing Colin and one in the second group representing the felationship “has-aune. Each of the wo input groups is toully ‘connected fo its own group of 6 wits in the tecond layer These fsroups learn to encode people and relationships as distibuted potters of aetvty. The Secard layer is tually connected tothe ental ayer of 12 Units, and these ar connected tothe penultimate layer of units. The atvt in the pensltimate layer must activate the correct output unit, each of which stands Tor a pavtcalor {person 2). In this case, thee ae two correct answers (marked by black dots) because Colin has twa aunts Bath te input unis and ‘he ouput unis are aid ut spatially with the English people in ‘one row and the isomorphic Hallas immediately below. The backward pass starts by computing 9£/ay for each oF ‘the output units. Differentiating equation (3) for a particular ‘ease, G and suppressing the index ¢ gives aE /ay =H 4 “ ‘We can then apply the chain rule 10 compute E/3%, aE as, =9E [ay dy/4, Dilferentiting equation (2) to get the value of dyj/dxj and substituting gives aE [axy= aE /ayy 21-9) 6) This means that we know how a change in the total input x to an output unit will affect the error. But this total input is just a Tinear function of the states of the lower level units and itis also a linear function of the weights on the connections, soit js easy to compute how the error will be affected by changing these states and weights. For a weight w), from i to j the derivative is BE /amy =a [oxy 0%), Efex) % o and for the output of the é unit the contribution to 3£/ay, Fig. 4 The weights (rom the 24 input unt that represent people {ote 6 units in the second layer that lean distributed representas tions of people. White recanglet, excitatory weights, black tec: tangles, inhibitory weights; area ofthe retangle encodes the mage nitude of the weight. The weights from the 12 English people are in the tp row of each unit Unit | is primarily concsened withthe sisinction between English and Halon and ost ofthe thes units ignore this distinction. This means thatthe representation of an English person is very similar o the representation oftheir Halian ‘equivalent The network is making use ofthe lomorphisn between ‘hetwo amily ies to allow tcasharestructre and wil hettone fend to generalize sensibly from one tee tothe other Unie 2 encodes which generation a person belongs to, snd nit 6 encodes hich braned ofthe family they come fom The features captred by the hidden units are not at al explicit inthe input and output encodings, since these use a separate unit or each person, Because the hidden featres capture the undertyng stractore of the task domain the network generalizes corectly to the Tour tcples on which it was not trained. We tained the network for 1500 sweeps, sing €=0005 and a =05 Tor the Girt 20 sweeps and e ~0.01 and © =09 for the remaining sweeps. To make it casct to interpret the weights we introduced "weight-decay' by decrementing every weight by 02% afiereach weightchange.Alterprolonged learning, the decay was balanced by 3£/3, so the final magnitude of each weight indeates its usePulnes in reducing the error To prevent the network needing large weights ta drive the outputs to'T or the ertor was considered to be zero if owipu its that should Be on had aciries above 08 and output units that shouldbe of had setvites below 02. Fig, $A synchronous iterative et that ie run for tree iterations and the equivalent layered net Ench dmessep inthe recurrent net corresponds toa layer inthe layered net ‘The learning procedure for layered nets ean be mapped into 2 learning procedure Tor iterative nets. Two complications arse in performing this mapping! firs, ina layered net the output levels ofthe units nthe iermed ste layers during the forward pass are cequited for performing the backward pass (see equations (3) and (6) Soin an iterative net, iis necessary to love the history of oust states of each unit Second, for layered net to be equivalent to an iterative net, orrespending weights between diferent layers must have the same value. To preserve this propery, we average 9 /aw for al the weights in each set of corresponding weights and then change cach weightintheset by an amount proportional tothisverage gradient, ‘With thete twe provisos, the learaing procedure can be applied Airey to terative nes These nes can then ether learn ig perrorm iterative searches of learn sequential structures: LETTERS TO NATURE ~ = resulting from the elfect of i on jis simply aE [any ax) a9, =9E /25; wp s0 taking into account all the connections emanating from unit, Ewe have GE /0y,=5,3E /axy- wy o ‘We have now seen how to compute 3/ay for any unit in the penultimate layer when given 3£/ay for all units in the last layer. We can therefore repeat this procedure to compute this term for successively earlier layers, computing 3E/ow for the ‘weights as we go. ‘One way of using a£ /aw is to change the weights after every input-output case, This has the advantage that no separate memory is required for the derivatives. An altemative scheme, Which we used in the research reparted here is to accumulate 2E/aw over all the input-output cases before changing the Weights. The simplest version of gradient descent is to change each weight by an smount proportional to the accumulated aE few awe e0E/ow @) ‘This method does not converge as rapidly as methods which ‘make use of the second derivatives, but itis much simpler and. ‘an easily be implemented by local computations in patallel hardware. It can be significantly improved, without sacrificing the simplicity and locality, by using an acceleration method in Which the current gradient is used to modify the velocity of the point in weight space instead of its position Aw(1)=-e9E/am(1)+adw(e-1) 0 where ris incremented by 1 for each sweep through the whole Set of input-output cases, and a is an exponential decay factor between and 1 that determines the relative contribution of the current gradient and carlier gradients to the weight change. To bresk symmetry we start with small random weights Variants on the learning procedure have been discovered independently by David Parker (personal communication) and by Yann Le Cur ‘One simple task that cannot be done by just connecting the input units to the output units is the detection of symmetty. To detect whether the binary activity levels of a one-dimensional array of input units are symmetrical about the centre point, it js essential to use an intermediate layer because the activity in an individual input unit, considered alone, provides no evidence ‘about the symmetry or non-symmetry of tie whole input vector, 0 simply adding up the evidence from the ind units is insufficient (A more formal proof that intermediate Units are required is given in ref 2) The learning procedure discovered an clegant solution using just ewo intermediate units, as shown in Fig 1 Another interesting tsk isto store the information in the two family tres (Fig. 2). Figure 3 shows the network we used, and Fig 4 shows the ‘receptive fields’ of some of the hidden units after the network was trained on 100 of the 104 possible triples. So far, we have only dealt with layered, feed-forward networks. The equivalence between layered networks and recur- rent networks that are run iteratively is shown in Fig. 5. ‘The most obvious drawback of the learning procedure is that the error-surface may contain local minima so that gradient descent is not guaranteed to find a global minimum. However, experience with many tasks shows that the network very rarely gets stuck in poor local minima that are significantly worse than the global minimum. We have only encountered this undesirable behaviour in networks that have just enowgh connections to perform the task. Adding a few more connections creates extra dimensions in weight-space and these dimensions provide paths around the barriets that ereate poor local minima in the lower dimensional subspaces Bs ae LETTERS TONATURE- NATURE VoL. m3 9 OCrOWER ime The learning procedure in its current form, is not a plausible model of learning in brains. However, applying the procedure to various tasks shows that interesting internal representations can be constructed by gradient descent in weight-space, and this suggests that it is Worth looking for more biologically plausible ways of doing gradient descent in neural networks ‘We thank the System Development Foundation and the Office ‘of Naval Research for financial support veces: ever rg, 8) SORE SFiS Pa nc ee "Epunes uh Momarcer of Coen Yl Femdom ets ema DE

You might also like