You are on page 1of 4
Learning representations by back-propagating errors David E, Rumelhart*, Geoffrey E, Hintont & Ronald J. Williams* * Instiute for Cognitive Science, C-015, University of California, ‘San Diego, La doll, California 92093, USA 4 Depantmtat of Computer Science, Carnegie-Mellon University, Pitsburgh, Philadelphia 15213, USA We describe a new learning procedure, back-propagation, for networks of neuron-like units. The procedure repeatedly adjusts the welghts of the connections in the network so as to minimize a measure of the difference between the actual output vector of the ret and the desired output vector. AS a result of the weight adjustments, internal ‘hidden’ units which are not part ofthe input ‘or output come to represent important features of the task dom: and the regularities ia the task are captured by the interactions of these units, The ability to create useful new features distin- ‘gishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure’. There have been many attempts to design self-organizing neural networks. The aim is to find a powerful synaptic modification rule hat will allow an arbitrarily connected neural ‘network to develop an internal structure that is appropriate for 8 particular task domain. The task is specified by giving the esired state vector of the ourput units for each state vector of the input units, If the input units are directly connected to the output units it is relatively easy to find leaning rules that iteratively adjust the relative strengths of the connections so 38, ‘o progressively reduce the difference between the actual and desired output vectors®. Learning becomes more interesting but, LETTERS TO NATURE $2 ‘more difficult when we introduce hidden units whose actual or desired states are not specified by the task. (In perceptrons, there are ‘feature analysers’ between the input and output that fre not true hidden units because their input connections are fixed by hand, so their states are completely determined by the input vector: they do not learn representations.) The learning procedure must decide under what circumstances the hidden units should be active in order to help achieve the desired input-output behaviour. This amounts to deciding what these units should represent. We demonsteate that a general purpore and relatively simple procedure is powerful enough to construct appropriate internal representations The simplest form of the learning procedure is for layered networks Which have a layer of input units at the bottom; any number of intermediate layers; and a layer of output units at the top. Connections within a layer or from layers are forbidden, but connections can skip intermediate layers. An input vector is presented t0 the network by seting. the states of the input units. Them the states of the units in each layer are determined by applying equations (1) and (2) to the ‘connections coming from lower layers. All units within a layer hhave their states set in parallel, but different layers have their states set sequentially, starting at the bottom and working, ‘upwards until the states of the output units are determined. The total input, x, t0 unit isa linear function of the outputs, sya ofthe aie Maes Connected Ta and ofthe Weighs a — 1h these connections: Eom oy ‘Unis cont given bias by troducig an BL np ach tic wath ape bea eue oThe weight on sexe tapes aed te i a qultaben os hvehald foe Oppsis sgn canbe taued fete cuse ph ‘rancher a senbvaived ouput, wih ea non tnext {anaan of soa int i a nenine me 1 @ ies seth dace B el Ye 2 coe Fig. 1A network chat has leaned to detect misor symmmety ia the input vector The sumbers oa the arcs are weights and the numbers inside the nades are biases. The learning required 1,825 Sweeps through the set of 64 posible input vectors, wih he weights being adjusted on the Suse ofthe accumulated gradient aftr eack ‘Sweep, The values of the parameters in equation (9) were e = OL Snd a =C9, The intl weights were random and were uniformly distributed bowen ~0.3 and 03. The key property of his solution [s that fora given hidden weit, weights that are symmetie about the middle of the input vector ate equal in magnitude and opposite insign Soif a eymmetreal patiars presented, both hidden units tl recive a net input of from the input units and, because the bidden units have a negative bia, both will be of. Ia tis case the futput unit, havinga positive Bias, willbe on Note chat the weights fn each side of the midpoint are in the ratio 1:2:4 This ensures thot each of the eight patterns that ean oecur above the midpoint Sends a unique activation sum to each hidden uni, s0 the only pattern below the midpoint that can exacly balance this sum i {he symmetical one. Forall son-syametial patterns, both hidden tints wil receive non-zero activations (rom the input units, The two hidden unite have identical paterns of weights but wih tppesit signs s for every non-tymmetie pater one hidden uit ‘wll come on and suppress the output unit, Wis not necessary to use exactly the functions given in equations (Q) and @). Any input-output function which has a bounded derivative will do, However, the use of a linear function for ‘combining the inputs to a unit before applying the nonlinearity ‘reatly simplifies the learning procedure The aim is to find a set of weights that ensure that for each input vector the output veetor produced by the network is the same as (oF suficienty close to) the desired output vector. If there isa fied, Gnite set of input-output cases, the total error inthe performance of the network with a particular set of weights ‘can be computed by comparing the actual and desired output ‘vectors for every case. The total error, £, is defined es LE Ope dhe? @ where © is an index over cases (input-output pairs), J is an index over output units, y is the actual state of an output unit ‘and d is its desired seaie. To minimize F by gradient descent itis necessary to compute the partial derivaive.of-£-with respect to each weight in the network, This is simply the sum of the partial derivatives for each of the input-output cases For a aiven case, the partial derivatives of the errac.with. respect (0 ‘each weight are computed if Wo passes. We have already ‘described the forwaed pass ia which the units in each layer have theie states determined by the input they receive from units in lower layers using equations (1) and (2). The backward pass which, funnagatss-derjeauives om the in lave DaGk t6 the bottom one is more complicated LETTERSTONATURE- Cte = Pensions Anam = Cine cain Cave Fig.2. Two isomorphic family wees. The information an be expressed a6 a aet of tnples ofthe form (person Iitelationship) (person), where the possible relationships are (father, mother, husband, wie, son, daughter, unete, sun, brother sister, nephew, riee| A layered net can Be aaid to "now thete triples if i ean Drodves the third term ofeach tiple when given the fst wo. The fret ewo terms are encaded by aetvating two of the input ua, And the network must then compete the proposition by aetivating the output uni that represents the thitd tenn Fig. 3 Acivt levels in a ive-tayer network after it has learned ‘The bottom layer has 24 input units on the lel fo representing (person 1) and 12 input waite on the right for representing the Felalionship. The white squares inside these two groups show the fcivity levels af the units There i one ative it nthe Ret group representing Colin and one in the second group representing the felationship “has-aune. Each of the wo input groups is toully ‘connected fo its own group of 6 wits in the tecond layer These fsroups learn to encode people and relationships as distibuted potters of aetvty. The Secard layer is tually connected tothe ental ayer of 12 Units, and these ar connected tothe penultimate layer of units. The atvt in the pensltimate layer must activate the correct output unit, each of which stands Tor a pavtcalor {person 2). In this case, thee ae two correct answers (marked by black dots) because Colin has twa aunts Bath te input unis and ‘he ouput unis are aid ut spatially with the English people in ‘one row and the isomorphic Hallas immediately below. The backward pass starts by computing 9£/ay for each oF ‘the output units. Differentiating equation (3) for a particular ‘ease, G and suppressing the index ¢ gives aE /ay =H 4 “ ‘We can then apply the chain rule 10 compute E/3%, aE as, =9E [ay dy/4, Dilferentiting equation (2) to get the value of dyj/dxj and substituting gives aE [axy= aE /ayy 21-9) 6) This means that we know how a change in the total input x to an output unit will affect the error. But this total input is just a Tinear function of the states of the lower level units and itis also a linear function of the weights on the connections, soit js easy to compute how the error will be affected by changing these states and weights. For a weight w), from i to j the derivative is BE /amy =a [oxy 0%), Efex) % o and for the output of the é unit the contribution to 3£/ay, Fig. 4 The weights (rom the 24 input unt that represent people {ote 6 units in the second layer that lean distributed representas tions of people. White recanglet, excitatory weights, black tec: tangles, inhibitory weights; area ofthe retangle encodes the mage nitude of the weight. The weights from the 12 English people are in the tp row of each unit Unit | is primarily concsened withthe sisinction between English and Halon and ost ofthe thes units ignore this distinction. This means thatthe representation of an English person is very similar o the representation oftheir Halian ‘equivalent The network is making use ofthe lomorphisn between ‘hetwo amily ies to allow tcasharestructre and wil hettone fend to generalize sensibly from one tee tothe other Unie 2 encodes which generation a person belongs to, snd nit 6 encodes hich braned ofthe family they come fom The features captred by the hidden units are not at al explicit inthe input and output encodings, since these use a separate unit or each person, Because the hidden featres capture the undertyng stractore of the task domain the network generalizes corectly to the Tour tcples on which it was not trained. We tained the network for 1500 sweeps, sing €=0005 and a =05 Tor the Girt 20 sweeps and e ~0.01 and © =09 for the remaining sweeps. To make it casct to interpret the weights we introduced "weight-decay' by decrementing every weight by 02% afiereach weightchange.Alterprolonged learning, the decay was balanced by 3£/3, so the final magnitude of each weight indeates its usePulnes in reducing the error To prevent the network needing large weights ta drive the outputs to'T or the ertor was considered to be zero if owipu its that should Be on had aciries above 08 and output units that shouldbe of had setvites below 02. Fig, $A synchronous iterative et that ie run for tree iterations and the equivalent layered net Ench dmessep inthe recurrent net corresponds toa layer inthe layered net ‘The learning procedure for layered nets ean be mapped into 2 learning procedure Tor iterative nets. Two complications arse in performing this mapping! firs, ina layered net the output levels ofthe units nthe iermed ste layers during the forward pass are cequited for performing the backward pass (see equations (3) and (6) Soin an iterative net, iis necessary to love the history of oust states of each unit Second, for layered net to be equivalent to an iterative net, orrespending weights between diferent layers must have the same value. To preserve this propery, we average 9 /aw for al the weights in each set of corresponding weights and then change cach weightintheset by an amount proportional tothisverage gradient, ‘With thete twe provisos, the learaing procedure can be applied Airey to terative nes These nes can then ether learn ig perrorm iterative searches of learn sequential structures: LETTERS TO NATURE ~ = resulting from the elfect of i on jis simply aE [any ax) a9, =9E /25; wp s0 taking into account all the connections emanating from unit, Ewe have GE /0y,=5,3E /axy- wy o ‘We have now seen how to compute 3/ay for any unit in the penultimate layer when given 3£/ay for all units in the last layer. We can therefore repeat this procedure to compute this term for successively earlier layers, computing 3E/ow for the ‘weights as we go. ‘One way of using a£ /aw is to change the weights after every input-output case, This has the advantage that no separate memory is required for the derivatives. An altemative scheme, Which we used in the research reparted here is to accumulate 2E/aw over all the input-output cases before changing the Weights. The simplest version of gradient descent is to change each weight by an smount proportional to the accumulated aE few awe e0E/ow © This method doesnot converges rpily as methods which mae ise ofthe second derivatives, butts much simpler ood Gan es) be inplemented by Toe somputtrs ipa harcvare It can bs sgneaty improved, witout seg the simplicity and lea y sting an seteleraton motos shite eaten gractent i ed to modi the welocy othe Poin in weight space instead ef he poston BS sm --eab om +aawtent) o wheres incremented by 1 fr exch sweep though the whole {stefinpuoutpat cae, ands an exponential ean oor SevveenO and that desis the reatie conto the chien raion and ear grant to the weigh change To'bresk symmety We sat wth mai! random weghts Variants on, he leering. procedure ve eon dsovered independent by, David Parker (peonal communication) aed by Yann Le Cun’. ‘One simple ask that eannot he done by just connecting the input nto the output unite dteaion of symm diet wheter the binary ay ose a8 on dnc onal fray of input uns are aymmereal about the ceate pul i sent oan an inertia yer beaut the acy th fnincivicul input unt eomsidered lone, ponies noeidence our thesymmeuy oraon-symncey ofthe whole inut Yet Sovsimly adding up the cance fom the ‘nis sfclent (A more formal poo that intemedte Units ae eae a given tn oh 2) The leaning procedure Giscovered an elegant solution wing jst vo fotermediate ats satowe i pd Anethe interesting tok to store the information in eo family wees (ig 2) igure shown the network e wie ana Fg shows the “eceptve fede af sme ofthe hens se the network was rained on 100 ofthe 10 posse pen, So far, we fave only doth win inyeed, feedforward networks ‘The equvelenc betwee yer autora and cur Ten netrks tat ae ron teatey shown in gS “he mow obvious drawback of te lamin procedures ha the eworsurace may contin Toe mising oo tal greion ascent ent guaranteed (ds global minimum, However Sxperene wih many tasks shows that ener vey rare fet stuck in poor cl minima tat are sgafeaty worse tes the global minimum Wena only encounared this undestte behaviour in netonts that have jut enough connectors berform ie ask Adsings few move conned oy cranes es dimensions a weighspace and these cmensions provi pts Sound the barter tha reat or laa inna th oer dimensional subspaces Bs ae LETTERS TONATURE- NATURE VoL. m3 9 OCrOWER ime The learning procedure in its current form, is not a plausible model of learning in brains. However, applying the procedure to various tasks shows that interesting internal representations can be constructed by gradient descent in weight-space, and this suggests that it is Worth looking for more biologically plausible ways of doing gradient descent in neural networks ‘We thank the System Development Foundation and the Office ‘of Naval Research for financial support veces: ever rg, 8) SORE SFiS Pa nc ee "Epunes uh Momarcer of Coen Yl Femdom ets ema DE

You might also like