You are on page 1of 10
Learning Long-Term Dependencies with Gradient Descent is Difficult Yoshua Bengio, Patrice Simard, and Paolo Frasconi, Student Member, IEEE Absract— Recurrent neural networks can be used to map Input sequences to output sequences, such as for recognition, production ar predictln problems. However, practical dificulis have teen reported in training recurrent ‘neural networks to perform tasks In which the temporal contingencies present in {he inputoutput sequences span long intervals. We show why fradient based learning algorithms face an Inereasingly aifeult Problem athe duration of the dependencies to be captured Increases. These ress expos a trade-aff between efficient learn ing by gradient descent and latching ow information for Tong Periods. Based on am understanding ofthis problem. alternatives {o standard gradient descent are comsidered 1. Ivrropucrion ARE INTERESTED IN training recurent neural networks to map input sequences o output sequences for applications in sequence recognition, production, or time: series prediction Al of the above applications require a system that will store and update eontext information: ie. information ‘computed from the past inputs and useful to produce desired ‘ouputs. Recurtent neural networks are well suited for those tasks Because they have an intemal state that can represent ‘context information. The cycles in the graph of a recurrent network allow it to keep information about past inputs for an Amount of time that isnot fixed a prior, but rather depends on its weights and on the input data. In contrast, static networks (ice. wth no recurrent connection), even if they include delays (suchas time delay neural networks [5), havea finite impulse response and cant store a bit of information For an indefi time, A recurrent network whose inputs are not fixed but rather constitute an input sequence ean be used to transform an input sequence into an output sequence while taking into ‘account contextual information in a exible way. We restrict, fur attention here to discrete-time systems. Learning algorithms used for recurrent networks are usually based on computing the gradient of a cost function with respect, tothe weight of the network [22 [21]. For example, the back: propagation through time algorithm (22) is a generalization fof back-propagation for sate networks in which one stores the activations of the units while going Forward in time. The backward phase is also backward in time and recursively uses these activations to compute the required gradients (Other algorithms, such asthe forward propagation algorithms [14}, [23], are mach more computationally expensive (or Manuscp wcive Api 2, 1993; eed Dave 21,198 ‘Viena wth Users de Mona (Dee IRO), Montes Cara, Fao min Unsere Fence Dip Stem form ly 4 fully connected recurrent network) but are local in ime; ie, they can be applied in an on-line fashion, producing 4 partial gradient aller each time step. Another algorithm ‘was proposed [10], [18] for training constrained recurrent networks in which dynamic neurons—with a single feedback to themselves—have only incoming connections from the input layer. It is Toca in time like the forward propagation algorithms and it requires computation only proportional to ‘the numberof weights lke the back-propagation through time Algorithm, Unfortunately, the networks it ean deal with have Timited storage capabilities for dealing with general sequences [7], thus Kimiting their representational power. ‘A task displays long-term dependencies if prediction of, the desired output at time ¢ depends on input presented at an earlier time 7 << t, Although recurrent networks ‘can in many instances outperform static networks [4], they appear more dificult to train optimally. Earlier experiments indicated tha their parameters settle in sub-optimal solutions that take into account short-term dependencies but not long: term dependencies [5]. Similar results were obtined by Mozer [19 I was found that back-propagstion was not sufficiently powerful t discover contingencies spanning long temporal ‘intervals. this paper, we present experimental and theoretical resulls in order to further the understanding of this problem, For comparison and evalvation purposes, we now list three basic requirements for a paramettic dynamical system that can leam to store relevant state information. We requice the Tollowing: 1) That the system be able to store information for an arbitrary duration. 2) That the system be resistant 10 noise (ie, fuctuations of the inputs that are random or ielevant to predicting 8 correct output) 3) That the system parameters be trainable (in reasonable time). Throughout this paper, the long term storage of definite bits ‘of information into the state variables ofthe dynamic system is referted to a8 information latching. A formalization of this concept, based on hyperbolic attractors, is piven in Section a ‘The paper is divided into five sections. In Section Il we present a minimal tsk that can be solved only ifthe system Satisies the above conditions. We then present a recurrent network candidate solution and negative experimental results indicating. that gradient descent is not appropriate even for such a simple problem. The theoretical results of Section IV show that either such a system is stable and resistant to noise oss-omm7pss0800 © 1984 EE or, altematvely tis efficiently trainable by gradient descent, bot not both. The analysis shows that when trying to satisfy conditions 1) and 2) above, the magnitude of the derivative of the state of a dynamical system at time ¢ with respect to the state at time 0 decreases exponentially as f increases We show how this makes the back-propagation algorithm (and gradient descent in general) inefficient for leaming of Jong term dependencies in the inputfoutput sequence, hence failing condition 3) for sufficiently long sequences. Finally, in Section V. based on the analysis of the previous sections, new algorithms and approaches aze proposed and compared 10 variants of back-propagaton and simulated annealing. These algorithms ae evaluated on simple tasks on which the span of the input/output dependencies can be controlled. 1, MINIMAL TASK ILLUSTRATING THE PROBLEM ‘The following minimal task is designed as tet that must necessarily be passed in order to satisfy the three conditions enumerated above. A parametric system is trained to classify ‘wo different sets of sequences of length. For each sequence tuys our the elas C(ws,---y ir) € {0,1} depends only on the fist L values of the extemal input: Clues) = Cry) ‘We suppose L fixed and allow sequences of arbitrary length T > L. The system should provide an answer atthe end of each sequence. Thus, the problem can be solved only if the L. ‘The third required condition is leamability. There are two diferent computational aspects involved in this task, Fist, it is necessary 10 process wy, tz, in onder 10 extract some information shout the class; Le, to perform classification ‘Second, iti necessary to store such information into a subset ofthe state variables (let us cll them latching state variables) fof the dynamic system, for an arbitrary duration. For this task, the computation ofthe class does not require accessing latching state variables. Hence the latching state variables do ‘not need to affect the evolution of the other state variables. ‘Therefore, a simple solution to this task may use a latching subsystem, fed by a subsystem that computes information about the class. We are interested in assessing learning capabilities on this latching problem independently on a particular set of training Sequences; ie. in a way that is independent of the specific problem of classifying wi,---,z- Therefore we will focus here only on the latching” subsystem. In order to tain any rmodile feeding the latching subsystem, te leaning algorithm should be able to transmit error information (such as gradient) to such a module. An important question is thus whether the Jearing algorithm can propagate eror information to a module that feeds the latching subsystem and detects the evens leading to latching. Hence, instead of feeding recurent network with the input sequences defined as above we use only the latching subsystem a8 a test system and we reformulate our minimal task as follows, The test system has one input hy and one output, (at each diserete time step £). Te intial inputs hy, for t< ate values which ean be runed by he learning algorithm (eg, aradient descent) whereas fis Gaussian noise for L <1 ST. ‘The connection weights of the test system are also trainable parameters. Optimization is based on the cost function oe ong Der -#) where pis an index over the training sequences and d” is a target of +0.8 fr sequence of class 1 and —0.8 for sequences of class 0, Tn this formulation, he (# = 1, E) represent the result of| the computation that extracts the clas information. Learning i ditectly isan easier task than computing it asa parametric funetionf(u,8) of the exiginal input sequence and learning the parameters 8. In Tact, the error derivatives $= (as used bby backpropagation through time) are the same as iffy were oblained as a parametric function of 1 Thus, if fy cannot be directly trained as parameters in the test system (because of vanishing gradient), they clearly cannot be trained as a ‘rametric function ofthe input sequence in a system tha uses trainable module to feed a latching subsystem. The ability of Teaming the free input values fy,---,/iz is a measure of the effectiveness of the gradient of error information that would bbe propagated further back ifthe tes system were connected to the output of another module. IL, Soarie RecuRRENT NerwoRK CANDIDATE SOLUTION We performed experiments on this minimal task with single recurrent neuron, as shown in Fig. I(a) Two types of trajectories are considered for this test system, for the two classes (k = 0, k= 2) af = Slat) = ae) of = mftoh) +E v a ab =a} =6. If w > 1/f"(0) = 1, then the autonomous dynamic of this neuron has two auructors# > O and ~7 that depend onthe value ofthe weight w [7] [8] (bey can te easily obtained as ron zero intersections of the curve 2 = tanh(a) withthe line = o/1) Assuming thatthe initial stae at = 0 is-29 = —F, it can be shown [8] that there exists a value A" > 0 of the input suet that 1) r+ maintains its signif Jha] < ht, and, 2) there exists finite numberof steps Li such that vz, > Fit fy > ht YES Ty. A symmetric case oceurs fora =F. 1" increases with w, For fixed w the transient length £ decreases th [he Thus the recurrent neuron of Fig. I(a) can robustly Jatch one bit of information, represented by the sign of is 7 » 1) Lachig rare aa, Sale we ee Fig 2Expesimctl ma forte iia robles of ine {white => igh density), wit E = 3 and P= 20, (b) Frequency of wining ‘Sovrgence nih epee 1 the sence length T (wth noe vase Sean itl wei we = 12 activation, Storing is accomplished by keeping a large input (Ge larger than A in absolute value) fora long enough time Small noisy inputs (ie, smaller than fh" in absolute value) ‘cannot change te sign of the activation ofthe neuron, even if applied for an arbitrary long time. This robustness essentially ‘pends on the nonlinearity ‘The recurrent weight w is also trainable, The solution for 1 > L requires w > 1 10 produce two stable attractors 7 and —F. Larger w correspond to langer critical value h* and, ‘consequently, more robustness against noise. The trainable input values must bring the state of the neuron towards 7 or =F in order to robustly latch a bit of information against the input noise, For example, this ean be accomplished by adapting, for ¢-= 1.---,L, Wt > Hand 1 < H, where 1H > Ae contols the transient duration towards one of the two attractors Tn Fig. 1) we show two sample sequences that feed the recurrent neuron. AS state in Section Il hare trainable for 1 = Land samples from a Gaussian distribution with mean (0 and variance » for t > L, The values of hy fort < 1. were initialized to small uniform random values before starting training, A set of simulations were carried out 10 evaluate the effectiveness of back-propagation (through time) on this simple task. In first experiment we investigated the effect of the noise variance » and of different inital values ay forthe self loop weight (see also [3). A density plo of convergence is shown in Fig, 2(a), averaged over 18 runs foreach of the selected pairs (vp) Tt-ean be seen that convergence becomes ‘ery unlikely for large noise variance or small initial values ‘of w, L =i and T= 20 were chosen in these experiments In Fig. 2(0), we show instead the effect of varying 7 keeping fixed # = 0.2 and wy = 1.25. In this ease the tsk ‘consist in learning ony the input parameters fy. As explained in Section Il if the leaming algorithm is unable to properly tune the inputs fh, then it will nt beable to Tear what shoul trigger latching in a more complicated situation. Solving this task is a minim requirement for being able to transmit error information backward, towards module feeding the latch unit ‘When 7 becomes large itis extremely difficult to attain convergence. These experimental results show that even in the very simple situation where we want to robustly latch on ‘one bit of information about the input, gradient descent on the ‘output error fails for long-term inpuvoutput dependencies, for ‘mos iitiat parameter values. IV, LEARNING 10 LATCH Win DYNAMICAL S¥STEMS In this section, we attempt to understand beter why learning ceven simple long-term dependencies with gradient descent appears to beso dificult, We discuss the general case ofa real- time recognizer based on a parametric dynamical system. We find that the conditions under which a recurent network can robustly store information (in a way defined below. ie.. with hyperbolic atractors) yield problem of vanishing gradients that ean make leering very dificult, ne system with a, = Mle) +m 2 and the corresponding autonomous dynamics = Maa) ® where Mis a nonlinear map, and ay and xy afe n-vectors represenling respectively the system state and the external input at time ¢ “To simplify the analysis presented in this section, we con sider only system with adie inputs. However, a dynamic system with non-additive inputs, eg..a, = Niay-1,u—1).can be transformed into one with additive inputs by introducing Additional state variables and comtesponding inputs. Suppose ‘ty € RY and uy € RU. The new system is defined by the additive inputs dynamics af = N'(aj_,) + wt where = (ath) i8 8 nF mewector Sate, and the frst » elements of x = (Om) € R"™ are 0, The new map N can be ‘efined in terms of the old map Nas follows: N’(a.) = (Moy, 4e-1).0), with zeroes for the last elements of NN). Hence we have ye = u. Note that a system with additive inputs with a map of the form of 1'() ean be transformed back into an equivatent system with non-additive inputs. Hence without loss of generality we ean use the model in (2). Th the next subsection, we show that only two conditions can arse when using hyperbolic attractors to latch bits of| information. Either the systom is very sensitive t9 nose, oF the derivatives of the cost at time f with respect tothe system setvations uo converge exponentially to 0 as (increases ‘This situation is the essential reason for the difficulty in using gradient descent to train a dynamical system to capture long-term dependencies in the inpuvoutput sequences, A. Analysis In order to latch a bit of state information one wants to restct the values ofthe system activity ag 10 a subset $ of Sis dora, In this wa, willbe possible to age interpret ay jn at last two ways: inside $ and ouside S. To make sore that remains in such a region, the system dynamics ean be chosen sich that ths region isthe basin of attraction of an tector (oF of an actor in sub-manifold or subspace of ‘ns omain), To "erase" that bit of information the np may Pash the system activity ay out of this basin of atraction and posiby ito another one, In this section, we show that i the tractor is hyperbolic (or canbe transformed into one; 8 stable periodie atacton, then the desvaives 28 quikly ‘anish as (increases, Unfortunately, sven these gradients ae of shor tem dependencies dominate in the weighs gradi Daition 1: stot points Ei sao be invariant under amp Ait B= MCE). Definition 2: hyperbolic tractor is a st of points X invariant under the diferentable map AY, such that ¥o © X all eigenvalcs of A1"(a) ae less than 1 in absolte value An attractor may contin a single point (fined point tractor). finite number of points (periodic atractor) or fn infinite sumber of points (chaotic arator. Note that Stable and atacting fixed point is hyperbolic for the map whores lable and attating periodic atractor of period for the map M is hyperbolic forthe map M'. Fora recueent net. the kind of atractor depends on the weight matrix. In panicle fora network defined by ay = W'tan(ay-y) +e iE W is syrameti and its minimums eigenvalue is greater than 1, then the tractors ae ll fixed points [17]. On the oer hand, if |W] <1 o if the system is Fiear and stable, the ‘system has a single fixed point atractor atthe origin. Defiion 4: The basin of attraction of an atractor X is te set .X) of points «converging to_X under the map 2 ten SX) = (a ¥e Ah ae € Xt. AM a) ~ all to, Lat us now see why its more robust © store abit of information by keping ay in IX), the reduced attacting set of Theorem I> Assume 3 i a point of R™ such that there exist an open sphere O) centered on for which [81"(=)] > 1 for als U(a). Them there exist y € Ute) such that IMCs) — Meal > Ue ~ vl Proof: See the Appeni “This thoorem imple that fora hyperbolic tractor Xi Fe 3. Basin of atacion (3), etced starting set (of anata X Bo msn gone (oe FB Bown) 2 isin A(X) at notin U(X), then the size of a ball of uncertuinty around ap will grow exponentially as € increases, 2 ilsrated ia Fig. 3a). ‘Therefore, small perturbations in the input could posh the trajectory towards another (possibly ‘wrong basin of atraction. This means thatthe stem will ot be resistant 0 input noise. What we call input noise ere ‘may be simply components ofthe inputs that are not relevant to predict the comet future output In contrast, the following resus show that fis in F(X) is uaranted t remain ‘within contain distance of X when the input noise i hounded. Dafnion 6: A map M is contracting on ast Dit ae {0,4} such hat [M2) ~ AF] < ale — yj) We, € D. Theorem 2: Let M be a dileretable mapping on «convex set D.If Ye € D, [M'2)] < then A i contracting on D. Proof: See (20) ‘A crucial element inthis analysis isto identify the con- ivons in which one can robust ttc information with an Theorem 3: Suppose the system is robuly latched to XY, starting in State ao, andthe inputs tu, are Such that for al 1 > 0, ful) < be where by = (I~ Ay}. Let ay be the autonomous tnjecory obtained by staring at ay and input 1s, Also suppose ¥y € DisiM'(y)| < Av < 1. where Dy is a tall of radius around a, Then a remains inside a ball of rai d around dy, and his bal intersects X when f— 20. Proof: See the Appendix ‘The above resus justify the term “robust in out deiiion of robustly latched system: as long as ay remains in the reduced autractng set P(X) ofa hyperbole atvactor Xa bound on the inputs an be found hat guarantees yt remain within a certain distance of some point in X, a ilsvated in Fig. 31). “he smaller ["y)] isin the region around a, te looser the ound by is 0 the inputs, meaning thatthe system fs more robust input noise- On theater hand, ouside F(X) bat in 1H(X), AF isnot contacting, itis expanding: ie. the sizeof bull of uncerainty grows exponentially with time. We now show the consequences of robust latching: i, vanishing gradient Theorem 4: Af the input a is such that a system remains robustly latched on attractor after time 0, then SE — 0 wt ©, Proof. See the Appendix. ‘The results jn this section thus show that when storing one ‘or more bit of information ina way that is esistant to noise, the aradient with respect to pat events rapidly becomes very small {in comparison to the gradient with respect to recent evens. In the next section we discuss how that makes gradient desce ‘on parameter space (e.g. the weights of a network inefficient 1B. Effect on the Weight Gradient Let us consider the effect of vanishing gradients on the derivatives of a cost Cy at time t with respect to parameters ‘of dynamical system, say a recurrent neural network with weights 1: 20, SAC, Dae _ 5 80; Dy Doe BW ~ 24 a, BW ~ 22 Day Dar OW i Suppose we ate in the condition in which the network has robustly latched. Hence fora term with + < t,| SSH] — 0 This tenn tends to become very smal in comparison to terms for which + is close tot, This means tat eventhough there right exist a change in H/ tht would allow a- to jump 10 Enoher (eter basin of ataction, te gradient ofthe com with respect to W docs nt reflect that possibiy. This is because the effect ofa smal change in W’ would be felt mostly on the eae past (7 close 10) Lotus sev an example of how this rel hampers training a system that requires robust latching of information. Consider forexample a system made of tvo subsystems A and B with the ouput of A being fed tothe input of B. Suppose that ‘ny good solution tothe laming problem requires B storing information about events detected by A at time 0, with the ouput of ata late distant time T wsed to comput an eer, 2 nour minimal problem defined in Section Tlf B has not teen trained enough tobe ale to store information for long time, then gragients ofthe enor at with respec othe output of A at time O ate very small, since B does latch and the ‘outputs of atte O have very ite influence onthe ero at time T, On the other hand, as soon as i tained enough (0 reliably store information for along timc, the right gradients Can propagate; but because they cickly vanish to very seal, Values, training Ais very dict (Gepening of the sizeof T and the amount of nose between 0 and 7) V. ALTERNATIVE APPROACHES The above section helped us understand beter why taining a recurrent neowork to learn dong range inpuloutput depen: dencies is a hard problem. Gradient-based methods appear inadequate fr this kind of problem. We need to consider alter native systems and optimization methods that give acceptable results even when the criterion fureton isnot smooth and as Tong plateaus. In this section we consider several alternative ‘optimization algorithms for this purpose, and compare them to ovo variants of back-propagation ‘One way to help inthe taining of recurent networks i 10 set their connectivity and initial weights (and even constraints ‘on the weighs) using prior knowledge. For example, this is ‘accomplished in [8] and [11] using prior rules and sequentiality constraints In Tact, the reslls inthis paper strongly suggest that when such prior knowledge is given, it should be used since the leaming. problem itself is so difficult. However, there ate many’instances where many long-term inpuourput dependencies are unknown and have (0 he learned from examples. A. Simulated Armeating Giobal search methods such as simulated annealing ean be applied to such problems, but they are generally very slo. ‘We implemented the simulated annealing algorithm presented in [6] for optimizing functions of continuous variables. This i a “batch leaming” algorithm (updating parameters ater all examples of the taining set have been seen). It performs 3 cycle of random moves, each along one coordinate (parameter) tiection. Each point is accepted or rejected acconding to the Metropolis criterion [13]. New points are selected according to a uniform distribution inside & hyperretangle around the Tast point. The dimensions of the hyperrectangle are updated in order vo maintain the average percentage of accepted moves at about one-half ofthe total number of moves. Alera certain number of eycles, the temperature is reduced by a constant mliplicative factor (0.85 in the experiments). Training stops When some acceptable value of the cost function is attained when learning gets “stuck,” or if a maximum number of function evaluations is performed. A “function evaluation” ‘corespons to performing a single passthrough the network for one input sequence. 1 Mul-Grid Random Search ‘This simple algorithm is siiar to the simulated annealing algorithm. Like simulated anncalng, it tries random points However, if the main problem with the leaming tasks. was plateaus (rather than local minima), an algorithm that accepts ‘only points that reduce the errr could be more efficient. This Algorithm has this propery. It performs a (uniform) random Search in a hyperrctangle around the current (best) point When a better point is found, it reduces the size of the hyperrectangle (by a factor of 9 in the experiments) and re-venters it around the new point. The stopping eviterion is the same as for simulated annealing, Time Weighted Pseudo-Newton Optimization ‘The pseudo-Newton algorithm {2} for neural networks has the advantage of rescaling the learning rate of each weight dynamically to match the curvature ofthe energy function with respect to that weight, Ths is of interest because adjusting the learning rate could potentially circumvent the problem of "Ven he cot alae on ts at pias dss mt hangs by mone tan SET andy e dont Ini sgeomens sat a vanishing gradient. The pseudo-Newon algorith computes 3

You might also like