Vanishin Gradient Problem
Vanishin Gradient Problem
com
ScienceDirect
IFAC PapersOnLine 53-2 (2020) 1243–1248
On
On the
the vanishing
vanishing and
and exploding
exploding gradient
gradient
On
On the
the vanishing
vanishing
problem in and
and
Gated exploding
exploding
Recurrent gradient
gradient
Units
Onproblem
the
problem in
in Gated
vanishing and
Gated Recurrent
exploding
Recurrent Units
gradient
Units
problem in Gated Recurrent Units
problem in Gated Recurrent Units
Alexander Rehmer, Andreas Kroll
Alexander Rehmer, Andreas Kroll
Alexander
Alexander Rehmer, Rehmer, Andreas Andreas Kroll Kroll
DepartmentAlexander of Measurement Rehmer, and Control,
Andreas Institute
Kroll for System
Department of Measurement and Control, Institute for System
Analytics and Control,
Department of Faculty ofand
Measurement Mechanical
Control, Engineering,
Institute University
for System
System
Department
Analytics of Measurement
and Control, Faculty ofand Control, Engineering,
Mechanical Institute for University
Analytics
Departmentof
and Kassel,
Control, Germany,
Faculty
of Measurement (e-mail:
ofandMechanical {alexander.rehmer,
Control, Engineering,
Institute for System University
Analytics and Control,
of Kassel, Faculty of
Germany, Mechanical
(e-mail: Engineering,
{alexander.rehmer, University
Analytics and of Kassel, andreas.kroll}@mrt.uni-kassel.de)
Control, Germany, (e-mail: {alexander.rehmer,
of Kassel, Germany, (e-mail: {alexander.rehmer, University
Faculty of Mechanical
andreas.kroll}@mrt.uni-kassel.de) Engineering,
of Kassel, andreas.kroll}@mrt.uni-kassel.de)
Germany, (e-mail: {alexander.rehmer,
andreas.kroll}@mrt.uni-kassel.de)
andreas.kroll}@mrt.uni-kassel.de)
Abstract: Recurrent Neural Networks are applied in areas such as speech recognition,
Abstract: Recurrent Neural Networks are applied in areas such as speech recognition,
natural language
Abstract: Recurrentand videoNeural processing,
Networksand arethe identification
applied in areas
areas of nonlinear
such as state space
as speech
speech models.
recognition,
Abstract:
natural language Recurrentand videoNeural Networks
processing, andare theapplied
identification in such
of nonlinear state space recognition,
models.
Conventional
natural
Abstract: language Recurrent
Recurrentand Neural
video
Neural Networks,
processing,
Networks and e.g.
are the the Elman
identification
applied in Network,
areas of areashard
nonlinear
such state
speechto train.space A more
models.
recognition,
natural
Conventional language and video
Recurrent Neuralprocessing,
Networks, ande.g.thetheidentification
Elman Network, of nonlinear are hard state to train.space models. A more
recently language
Conventional
natural developed class
Recurrent
and of recurrent
Neural
video processing, neural
Networks, and networks,
e.g.
the the Elman so-called
identification Network,
of Gated are
nonlinear Units,
hard state outperform
to train.
space A their
more
models.
Conventional
recently developed Recurrent
class Neural
of recurrent Networks,
neurale.g. the Elman
networks, Network,
so-called Gated areUnits,hard to train. A more
outperform their
counterparts
recently
Conventional on
developed virtually
class
Recurrent of every task. neural
recurrent
Neural This e.g.
Networks, paper
networks,
the aims
Elman to Network,
provide
so-called Gated additional
are Units,
hard insights
outperform
to train. into
A the
their
more
recently
counterparts developed class ofevery
on virtually recurrent
task. neural
This papernetworks,aimsso-called
to provide Gated Units, outperform
additional insights intotheir the
differences
counterparts
recently between
on
developed RNNs
virtually
class of and Gated
every task.
recurrent Units
This
neural paperin order
networks,aims to
to explain
provide
so-called Gatedthe superior
additional
Units, perfomance
insights
outperform into the
theirof
counterparts
differences between on virtually
RNNsevery task. This
and Gated Units paper aims to
in order to explain
provide the additional
superiorinsights perfomance into the of
gated recurrent
differences
counterparts between
on units.RNNs
virtuallyIt isevery
argued,
and Gated
task.that
This Gated
Units paperin Units
order
aims are
to
to easier
explain
provide totheoptimizesuperior
additional not because
perfomance
insights into they
theof
differences
gated recurrent between units.RNNsIt is andargued,GatedthatUnits
Gated in Units
order are to explain
easier tothe optimizesuperior notperfomance
because they of
solve the
gated
differences vanishing
recurrent
between gradient
units.RNNsIt is problem,
is and
argued,Gated but
that because
Gated
Units theyare
Units
in Units
order circumvent
easier to
to explain tothe emergence
optimizesuperior notperfomanceof large local
because they of
gated
solve the recurrent
vanishing units. It
gradient argued,
problem, that
but Gated
because theyare easier
circumvent optimize
the emergence not because
of large local they
gradients.
solve
gated the vanishing
recurrent gradient
units. It is problem,
argued, but
that because
Gated they
Units circumvent
are easier to the emergence
optimize not of
because large local
they
solve
gradients.the vanishing gradient problem, but because they circumvent the emergence of large local
gradients.
solve
Copyright
gradients.the vanishing
© 2020 Thegradient
Authors.problem,
This is an butopenbecause they under
access article circumvent
the CC the BY-NC-NDemergence license of large local
(http://creativecommons.org/
gradients.
Keywords: Nonlinear system licenses/by-nc-nd/4.0)
identification, Recurrent Neural Networks, Gated Recurrent
Keywords: Nonlinear system identification, Recurrent Neural Networks, Gated Recurrent
Units.
Keywords: Nonlinear systemsystem identification,
identification, Recurrent
Recurrent Neural Neural Networks,
Networks, Gated Gated Recurrent
Recurrent
Keywords:
Units. Nonlinear
Units.
Keywords:
Units. Nonlinear system identification, Recurrent Neural Networks, Gated Recurrent
Units. 1. INTRODUCTION ics makes the GRU less sensitive to its initial choice of
1. INTRODUCTION ics makes the GRU less sensitive to its initial choice of
1. INTRODUCTION
INTRODUCTION parameters
ics makes the andGRU thus less simplifies
sensitive the to optimization
its initial choice problem. of
1. parameters
ics makes andGRU thus less simplifies the to optimization problem.
Gated Units, such 1. INTRODUCTION
as the Long Short-Term Memory ics In the
parameters
makes end the
GRU
and
the GRU and less
thus RNNs sensitive
simplifies willthe
sensitive be to its initial
compared
optimization
its initial on choice
a simple
problem.
choice
of
of
parameters
In the
Gated Units, such as the Long Short-Term Memory academic example and on a real nonlinear identification end and
GRU thus
and simplifies
RNNs will the
be optimization
compared on problem.
a simple
(LSTM)Units,
Gated and the such Gated
as the Recurrent
Long Unit (GRU)
Short-Term Memory In
were parameters the end GRU
and and
thus RNNs
simplifies will be
the compared
optimization on a simple
problem.
Gated
(LSTM)Units, and the suchGated as theRecurrent
Long Short-TermUnit (GRU) Memorywere academic
In the endexample GRU and and RNNs on awill realbenonlinear comparedidentification
on a simple
originally
(LSTM)
Gated
(LSTM)
originally
developed
and
Units,
and the
suchGated
the
developed
to overcome
Gated
as Long the
Recurrent
theRecurrent
to overcome
vanishing
Unit
Short-Term
the Unit
vanishing(GRU)
(GRU) Memorywere task.
gradient
were
gradient
academic
In the endexample
academic
task. GRU and
example and
and RNNs on
on aawill real
realbenonlinear comparedidentification
nonlinear on a simple
identification
problem, which
originally
(LSTM) developed
and theoccurs to
Gated in the Elman
overcome
Recurrent the Recurrent
vanishing
Unit (GRU) Neural
gradient
were task.
academic example and on a real nonlinear identification
originally
problem, which developedoccurs to in
overcome
the Elman the vanishing
Recurrentgradient Neural task. 2. RECURRENT NEURAL NETWORKS
Network (RNN)
problem,
originally which
developed (Pascanu
occurs in
to in et
theal.,
overcome 2012).
Elman They have
Recurrent
the vanishing since task. 2. RECURRENT NEURAL NETWORKS
Neural
gradient
problem,
Network (RNN) which occurs
(Pascanu the
et Elman
al., 2012). Recurrent
They have Neural
since 2. RECURRENT NEURAL NETWORKS
outperformed
Network
problem, (RNN)
which RNNsoccurs on ainnumber
(Pascanu et
theal., of tasks,
2012).
Elman such have
They
Recurrent as natural
since
Neural
Network
outperformed (RNN) RNNs (Pascanu
on a numberet al., of 2012).
tasks,They such have since In this 2.
as natural RECURRENT
2.section
RECURRENT the Simple NEURAL
NEURAL Recurrent NETWORKS
NETWORKS Neural Network
language,(RNN)
outperformed
Network speechRNNs and video
on
(Pascanu a processing
number
et al., of
2012).(Jordan
tasks, such
They ethave
asal., 2019)
natural
since In this section the Simple Recurrent Neural Network
outperformed
language, speech RNNs and on a number
video processingof tasks,
(Jordan suchetas natural
al., 2019) (RNN), In this also
section known the as Elman
Simple Network,
Recurrent and theNetwork
Neural Gated
and recently
language,
outperformed also
speechRNNs on
and a nonlinear
video
on a number system
processingof tasks,identification
(Jordan such etasal., task
2019)
natural In
(RNN), this section
also known the Simple
as Elman Recurrent
Network, Neural
and theNetwork Gated
language,
and recently speech
also andon avideo processing
nonlinear system (Jordan et al., 2019)
identification task In Recurrent
(RNN), this Unit
also
section known (GRU)
the as will
Elman
Simple be introduced.
Network,
Recurrent and
Neural the Gated
Network
(Rehmer
and
language,recentlyand Kroll,
also
speech on
and 2019).
avideo However,
nonlinear system
processing it will
(Jordan be shown,
identification
et al., that
task
2019) (RNN),
Recurrent also
Unit known (GRU) as Elman
will be Network,
introduced. and the Gated
and
(Rehmer recentlyandalsoKroll,on 2019).
a nonlinear
However,system identification
it will be shown, that task Recurrent Unit (GRU) as will be introduced.
the gradient
(Rehmer
and recentlyand Kroll,
also vanishes
on 2019).
a in Gated
However,
nonlinear system Units,
it will beother
shown,
identification un- (RNN),
stillthat
task also known Elman Network, and the Gated
(Rehmer
the gradient and also
Kroll, 2019). However,
vanishes in Gated it will be
Units, shown,
other un- Recurrent
stillthat 2.1 SimpleUnit Unit
Recurrent(GRU)
(GRU)Neural
will be
will beNetwork
introduced.
accounted
the
(Rehmer
the gradient
gradient
accounted
foralso
and mechanisms
Kroll,
foralso 2019). have
vanishes
vanishes
mechanisms However,
have
to beit
in Gated responsible
to beUnits, will be other
responsible
for
shown,
for un- Recurrent
stilltheir
that 2.1 Simple Recurrent Neural introduced.
Network
success. Pascanu et al. (2012) in Gated
show, Units,
that other
small stilltheir
changes un-
in 2.1 Simple Recurrent Neural Network
accounted
the
accountedgradient
success. foralso
for
Pascanu mechanisms
vanishes
mechanisms
et al. (2012) have
have to be
in Gated
to
show, bethatresponsible
Units, other
responsible
small for their
in 2.1
stilltheir
for
changes un- Simple Recurrent Neural Network
the parameters θ etof al.
the(2012)
RNN can lead tosmall
drastic changes The
2.1 RNN as
Simple depictedNeural
Recurrent in figure Network 1 is a straightforward
success.
accounted
success.
the parameters Pascanu
for mechanisms
Pascanuθ et of al.
the(2012) have
RNN show,show,
to be
can lead that
responsible
thattosmall
drastic changes
for
changes in
their
changes The
in realization RNN as depicted in figure 1 is a straightforward
The RNN of
as a nonlinear
depicted instatefigure space 11 model
is (Nelles, 2001).
in the
the
success.
the
in the
dynamic
parameters
Pascanuθ
parameters
dynamic θ behavior
of al.
et
of the(2012)
the
behavior RNN
RNN
of show,
the
can system,
can
of the lead
thatto
lead
system,
when
tosmall
drastic
drastic
when
crossing
changes
changes
changes
crossing in The RNN of
realization asa depicted
nonlinearin statefigure space is aa straightforward
model straightforward
(Nelles, 2001).
certain critical bifurcation points. This to in drastic
turn results It
The consists
realization
in realization RNN of
of
as aone
depicted hidden
nonlinear in recurrent
statefigure space 1 layer
model
is a with
(Nelles, nonlinear
straightforward2001).
inin
the the dynamic
parameters
the dynamic
certain θ behavior
of the
behavior of
critical bifurcation RNN of the
can system,
lead
the system,
points. This in turn when crossing
changes
whenresults
crossing It
in activationconsists of
of aone hidden
nonlinear recurrent
state space layer
model with
(Nelles, nonlinear
2001).
a huge change in the evolution of the hidden state x̂ , It consists
realization function
of
of aone nonlinear f h , which
hidden recurrent
state aims model
space to approximate
layer with
(Nelles, the
nonlinear
2001).
certain
in
certainthe critical
dynamic
a huge critical bifurcation
behavior
change bifurcation
in the evolution points.
of the This
system,
points.ofThisthe in in turn
when results
crossing
turn results
hidden state x̂in in
k activation
It consists function
of one f
hidden , which
recurrent aims to
layer approximate
with the
nonlinear
k, state
activation
It equation,
consists function
of one as wellf
hidden h as
, one
which hidden
recurrent aims feedforward
to
layer approximate
with layer fg
the
nonlinear
which
a huge leads
change to bifurcation
a locally
in the large,
evolution orofThis
exploding,
the hidden gradient
state x̂of , activation
state equation,function as wellf ,
has which
one hiddenaims to approximate
feedforward layer the
fg
acertain
huge critical
change in the points.
evolution of
which leads to a locally large, or exploding, gradient of the in turn
hidden results
state x̂in
k
k
k , and
state one
activation linear
equation, output
function as wellf h
layer,
h as
, one
which whichhiddenaims together
feedforward
to aim
approximateto approx-
layer f
the
athe
which
thehuge
which
lossleads
function.
change
lossleads to
to in
function. a
a In
In thisevolution
locally
the
locally
paper
large, the
large,the
this paper or GRU
orofGRU
will begradient
exploding,
the hidden
exploding,
examined
state x̂of
will begradient
examined k,
of state
and one equation,
linear output as welllayer, has one whichhidden feedforward
together aim to layer f ggg
approx-
and compared toa the RNN withthethe purpose to provide an imate
and
state one the output
linear
equation, output
as equation.
well layer,
as one For
whichhiddensimplicity
together of
feedforwardaimnotation,
to layer the
approx-
fg
the
which loss
and loss function.
leads
compared to to theIn this
locally
RNN paper
large,
withthe or
theGRUGRU will
exploding,
purpose be examined
gradient of and
imate one thelinear
output output equation.layer, which
For together
simplicity aim
of to
notation, approx-
the
the
alternative function.
explanationIn this paper
to why GRUs willto
outperformbeprovide
examined
RNNs.
an linear imate
and one output
the output
linear layer
output is layer,
equation. omitted For
which in the
simplicity
together following
of
aim equations
notation,
to the
approx-
and
the
and compared
loss
alternative function.
compared to
to the RNN
In this
the
explanation RNN with
paper
to why the
with the
theGRU
GRUs purpose
purpose willto beprovide
to
outperform examined
provide
RNNs.an imate
an linear the outputoutput layer equation.
is omitted For in simplicity
the following of notation, the
equations
First, it willexplanation
betoshown, that theGRUs
gradient oftothe GRU is imateand
linear figures:
output
the output layer is
equation. omitted For in the
simplicity following equations
of notation, the
alternative
and compared
alternative the
explanation RNN to
to why
with
why the
GRUs outperform
purpose
First, it will be shown, that the gradient of the GRU is linear outperform RNNs.
provide
RNNs.an linear
and output
figures: layer is omitted in the following equations
in fact smaller
First, than thatto ofwhythe RNN, at least for GRU
the pa- and figures:
output x̂ layer = fis omitted
(W x̂ + in W the u following
+ b ) , equations
in factit
alternative will be shown,
willexplanation thatthat the gradient
theGRUs of
of the
outperform RNNs. is figures: k+1 h x k u k h
First, itsmaller bethanshown, that
of the gradient
RNN, at least the
for GRU
the pa-is and and figures:x̂ x̂ k+1 = f h (W x x̂k + W u uk + bh ) ,
= f (W
(1)
rameterizations
in fact
First, itsmaller
will be considered
than
shown,that of
that in
thethis
theRNN, paper,
gradient at although
least
of for
the GRUs
the
GRU pa-is x̂ ŷ
k+1 k = f h
g (W x x̂
y x̂ k +
+ W
b
W g )u,u
u k ++ b
b h)) ,, (1)
in fact smaller than
rameterizations that of in
considered thethis
RNN, paper, at least for the
although GRUspa- x̂ ŷ
k+1
k+1 = f h
h (W x
x
y x̂
k = f g (W x̂ + W u + b ) ,kk + b g )u
u, k
k h
h (1)
(1)
were originally designed to solve the vanishing gradient ŷ
rameterizations
in factoriginally
smaller than
rameterizations
were considered
that of
considered
designed toin the
in this
RNN,
this
solve paper,
paper,
the at although
least
although
vanishing GRUs
forgradient
the
GRUspa- with x̂k ∈ Rk+1 n×1
ŷ kkk ,=ŷfk hggg ∈(W Rm×1 x̂kkk ,+ub
x k
y bkggg ))∈
u, k l×1 h
, Rl×1 , W x ∈ Rn×n (1),,
n×lRn×1 R R R
n×1 m×1
y n×n
problem.
were Secondly,
originally
rameterizations it willtobe
designed
considered in shown,
solve
this the
paper,that GRUs gradient
vanishing
although are
GRUsnot W with x̂k ∈ ŷ ,
= ŷf ∈(W y
x̂ ,
+ u b )∈, , W ∈
were
problem. originally
Secondly, designed
it willtobesolve shown, the that
vanishing
GRUs gradient
are not W with u ∈x̂ R ∈ R, b k
h ,∈ ŷ R
k gn×1
∈ R, m×1
yW k , ∈
u k
R
g m×n
∈ R l×1
, b, W∈ R
x m×1
∈ R n×n
and
n×n ,
only
problem. capable to designed
Secondly, represent
it highly nonlinear dynamics, but with u ∈
kRn×l , n×1
k ∈
x̂n×1 n×lRn×1
n×1
bn×1 h ,∈ŷR
k n×1∈ R, m×1 W yy, ∈
m×1
m×1
ukkkRm×1 m×nl×1
∈ Rl×1, b,gg W∈ xxxRm×1
∈ Rn×nand,
were
problem. originally
only capable Secondly, it will
to represent willto be
highlyshown,
besolve
shown, the that
nonlinearthat GRUs
vanishing
GRUs
dynamics, are not
gradient
are but
not f W
with : R
Whh uuu: R∈x̂
k
R ∈ → R, R
b
kRn×l , bn×1
n×l ,∈
h ∈ R ,
ŷ k
R
k
f n×1
∈
n×1: R, W
W yyy → ,→ ∈
u R R
Rm×1
m×n
∈
m×n R l×1
. Usually
, W∈
m×n , bg ∈ xRm×1 and
, b R m×1
R n×n
tanh(·)
∈
m×1 is,
are
only also
problem. able
capable to
Secondly, represent
represent
it will approximately
highly
be nonlinear
shown, that linear dynamics
dynamics,
GRUs are but
not f ∈n×1 → Rn×1 h
h , kf ggn×1: R, m×1 ∈ kR . Usually
g
g tanh(·)andis
only
are also capable
able to represent
represent highly nonlinearlinear
approximately dynamics,
dynamics but f employed
W
f : R∈
h : Rn×1 as
u R
n×1
n×1
n×l
as
→ , R
nonlinear
b
Rn×1
n×1
→ nonlinearh ∈ , R
f n×1
: R, m×1
activation
: Rm×1 W
m×1
, f ggg activation y → ∈ R R
Rm×1
→ function.
m×n
m×1
function.
m×1 ., b When
Usually
g
. Usually∈ R m×1
training
tanh(·)andis
via
are
only aalso
number
able
capable of different
to represent
represent parameterizations.
approximately
highly nonlinear Since
linear a linear
dynamics
dynamics, but employed
h Whentanh(·) trainingis
via aalso
are
model
number
able
is able
always
of different
to represent
a good first
parameterizations.
approximately
guess, the easy
Since
linear a linear f
dynamics
accessibility
anh RNN,
h
employed
:
employed R
an RNN, the
n×1 the
as
→ R
recurrent
n×1
nonlinear
as nonlinear
recurrent , f : R
model
m×1
activation
g activation →is
model is unfolded R
unfolded
m×1
function.
function. over. over
When
Usually
When
the whole
training
tanh(·)
thetraining
wholeis
via
are
via
model a number
also
a number
is always of
to different
represent
of different
a good first parameterizations.
approximately
parameterizations.
guess, the easySince Since
linear a linear
dynamics
a linear training
accessibility sequence of length Nis , and the gradient of the
an RNN,
employed the
as recurrent
nonlinear model
activation unfolded
function. over
When the whole
training
of different
model
via is parameterizations
is always
a number a
a good
good first
of different that produce
guess, the
the easy
parameterizations. linear a linear an
dynam-
accessibility
Since RNN,sequence
training the recurrent of length modelNis , andunfolded over the of
the gradient whole
the
model
of different always
parameterizations first guess,
that produce easylinear
accessibility
dynam- loss
training
an
loss
function
RNN,
training function the L
sequence
sequence
L
with
recurrent
with of respect
length
of respect model
length
to
toN
Nis ,,the
theand model
unfolded
and modelthe parameters
gradient
over the of
the parameters
gradient
θthe
whole
of θ
is
the
is
of different
model
of Sponsor
is
different parameterizations
always a good
parameterizations first that
guess, produce
the
that produce goes easy linear dynam-
accessibility
linear dynam- calculated. AsLawith consequence of ,thethe feedback, the gradient
and financial support acknowledgment here. Paper loss
training
loss function
function
calculated. sequence
AsLawith of respect
length
respect to
consequence toNof the and
the model modelthe
feedback, parameters
gradient
parameters of
the gradientθ
θ is
the
is
of different
Sponsor andparameterizations
financial support that produce
acknowledgment linear
goes dynam-
here. Paper of the errorAsLawith
Sponsor
titles
should andbe financial
written in uppercase
support and lowercase
acknowledgment letters,
goes here. not all
Paper calculated.
loss function
calculated.
of the error As a consequence
respect
consequence to of
of the
the
the feedback,
model
feedback, the
parameters
the gradient
θ is
gradient
titles should be written in uppercase and lowercase letters, not all
Sponsor and financial support acknowledgment goes here. Paper
of ek = ŷ k − ofy
target (2)
of the
the error
calculated. errorAs a consequence ythe feedback, the gradient
uppercase.
titles should
Sponsor andbe financial
written in uppercase
support and lowercase
acknowledgment letters,
goes here. not all
Paper target
k
uppercase.
titles should be written in uppercase and lowercase letters, not all ek = ŷ k − target
k (2)
uppercase.
titles should be written in uppercase and lowercase letters, not all
uppercase. of the error e = ŷ −
ekkk = ŷ kkk − y target y target
k (2)
(2)
target
k
k
2405-8963 Copyright © 2020 The Authors. This is an open access article under the CC BY-NC-ND license
uppercase. e k = . k
ŷ − y k (2)
Peer review under responsibility of International Federation of Automatic Control.
10.1016/j.ifacol.2020.12.1342
1244 Alexander Rehmer et al. / IFAC PapersOnLine 53-2 (2020) 1243–1248
Wu Wy
u fh fg ŷ x̂k x̂k+1
x̃k 1-
Wx
fr fz fc
Fig. 1. Representation of the Elman Network: Layers of Wr Wz Wc
neurons are represented as rectangles, connections
between layers represent fully connected layers. uk
at time step k with respect to the model parameters
θ = {W x , W u , W y , bh , bg } depends on the previous state Fig. 2. The Gated Recurrent Unit (GRU). Gates are
x̂k−1 , which depends again on the model parameters: depicted as rectangles with their respective activation
functions.
∂ek ∂ek ∂ ŷ k ∂ x̂k ∂ x̂k ∂ x̂k−1
= + (3) f r = σ (W r · [x̂k , uk ] + br ) ,
∂θ ∂ ŷ k ∂ x̂k ∂θ ∂ x̂k−1 ∂θ
For example, the gradient of the hidden state x̂k with f z = σ (W z · [x̂k , uk ] + bz ) , (6)
respect to W x is f c = tanh (W c · [x̃k , uk ] + bc ) ,
∂ x̂k
N
(k−τ +1)
(k−β) (k−β)
where W r , W z , W c ∈ Rn×n+l , br , bz , bc ∈ Rn×1 and
∂W x
= x̂k−τ f h (·) fh (·) W x
(4)
f r , f z , f c : Rn×1 → Rn×1 . σ (·) denotes the logistic
τ =1 β function.In order to map the states estimated by the GRU
β = τ − 2, τ − 3, . . . ∀β ≥ 0. to the output, the GRU has to be equipped either with an
High indices in brackets indicate the particular time step. output layer, as the RNN, or with an output gate, as the
The product term in (4), which also appears when com- LSTM.
puting the gradient with respect to the other parameters,
decreases exponentially with τ , if |fh · ρ (W )| < 1, where 3. GRADIENT OF THE STATE EQUATIONS
ρ (W x ) is the spectral radius of W x . Essentially, backprop-
agating an error one time step involves a multiplication of In this section, the gradients of the state equations of RNN
the state with a derivative that is possibly smaller than (1) and GRU (5) w.r.t. their parameters will be compared
one and a matrix whose spectral radius is possibly smaller to each other. In the cases examined the gradient of the
than one. Hence, the gradient vanishes after a certain GRU is, somewhat surprisingly, at most as large as that
amount of time steps. In the Machine Learning community of the RNN, but usually smaller.
it is argued that the vanishing gradient prevents learning In order to allow for an easily interpretable visualization,
of so-called long-term dependencies in acceptable time the analysis will be restricted to one dimensional and
(Hochreiter and Schmidhuber, 1997; Goodfellow et al., autonomous systems. Also, the GRU will be simplified by
2016), i.e. when huge time lags exist between input uk eliminating the reset gate f r from (6), such that x̃k = x̂k .
and output ŷ k . Gated recurrent units, like LSTM and Taking the gradient of the RNN’s state equation in (1)
GRU were developed to solve this problem and have since w.r.t. wx yields
then outperformed classical RNNs on virtually any task. ∂ x̂k+1 ∂ x̂k
However, it can be shown that the gradient also vanishes = (x̂k + wx ) · tanh (wx x̂k + bx ) . (7)
∂wx ∂wx
in gated recurrent units. Additionally, the vanishing of the
gradient over time is a desirable property. In most systems, The gradient of the GRUs state equation (5) with respect
the influence of a previous state xk−τ on the current state to wz is
x̂k decreases over time. Unless one wants to design a ∂ x̂k+1 ∂ x̂k
= (1 − tanh (x̂k ; θ c )) σ (x̂k ; θ z ) x̂k + wz
marginally stable or unstable system, e.g. when performing ∂wz ∂wz
tasks like unbounded counting, or when dealing with large ∂ x̂ k
dead times, the vanishing gradient has no negative effect + (1 − σ (x̂k ; θ z )) tanh (x̂k ; θ c ) ,
∂wz
on the optimization procedure.
(8)
2.2 The Gated Recurrent Unit and with respect to wc
∂ x̂k+1 ∂ x̂k
Gated Recurrent Unit (GRU) (Cho et al., 2014) is be- = σ (x̂k ; θ z ) (x̂k − tanh (x̂k ; θ c ))
∂wc ∂wc
sides LSTM the most often applied architecture of gated
∂ x̂k
recurrent units. The general concept of gated recurrent + σ (x̂k ; θ z )
units is to manipulate the state x̂k through the addition ∂wc
or multiplication of the activations of so called gates, see ∂ x̂k
+ (1 − σ (x̂k ; θ z )) tanh (x̂k ; θ c ) x̂k + wc .
figure 2. Gates are almost exclusively one-layered neural ∂wc
networks with nonlinear sigmoid activation functions. The (9)
state equation of the GRU is For convenience of notation θ z and θ c denote the pa-
x̂k+1 = f z x̂k + (1 − f z ) f c (x̃k ) (5) rameters of fz and fc respectively, i.e. θ z = [wz , bz ] and
with x̃k = f r x̂k . The operator denotes the Hadamard θ c = [wc , bc ].
product. The activations of the so-called gate reset gate f r , It is cumbersome to write down the gradients in (8) and
update gate f z and the output gate f c are given by (9) for an arbitrary number of time steps, as done for the
Alexander Rehmer et al. / IFAC PapersOnLine 53-2 (2020) 1243–1248 1245
and the GRU hence becomes an RNN. For all other con- 0
figurations, the gradient is in fact smaller.
In contrast to the point made by the vanishing gradient
argument, that if the gradient vanishes after some time- −1
−1 0 1 −1 0 1 −1 0 1
steps, optimization might take prohibitively long. We ar-
gue that a smaller gradient of the state equation w.r.t. x̂k x̂k x̂k
its parameters is beneficial when training recurrent neural (a) fz · x̂k ( ) (b) (1 − fz )fc ( )(c) fz · x̂k +(1−fz )·fc
networks. If the gradient is large, as is the case with the ( )
RNN, a small change in parameters will lead to a huge
change in the state trajectory. Since every time step, the Fig. 4. Graphical decomposition of the GRU’s state equa-
previous state is again fed to the recurrent network, these tion. x̂k ( ), fz ( ), fc ( ), (1 − fz ) ( )
1246 Alexander Rehmer et al. / IFAC PapersOnLine 53-2 (2020) 1243–1248
1 1
x̂k+1
x̂k+1
0 0
−1 −1
−1 0 1 −1 0 1 −1 0 1 −1 0 1 −1 0 1
x̂k x̂k x̂k x̂k x̂k
(a) bz → −∞ (b) wz , wc → 0 (c) wz , bc → 0, (a) bz → +∞ (b) wz , wc → 0
wc → ∞,
bz → −∞ Fig. 7. Parameterizations for which GRU converges to
linear activation functions.
Fig. 5. Parameterizations for which GRU converges to-
wards a tanh, leaky binary step or binary step acti- ing different limits in the parameter space is especially
vation function. important when identifying technical systems. Even for
highly nonlinear processes the one step prediction surface
becomes more and more linear with increasing sampling
1 rate (Nelles, 2001). Therefore even when estimating a
model for a nonlinear process, it is important that the
x̂k+1
y
4
10 ϕ
2
5 2 2
0 i gear box
0 0 −2 0
−2 bx 0 bx u
0 PWM
L R
wx −2 wx −2 MA
(a) RNN loss function (b) RNN gradient DC motor
Fig. 8. RNNs loss function and magnitude of the gradient Fig. 10. Technology schematic of the electro-mechanical
on the linear identification task throttle.
of the system (Gringard and Kroll, 2016): a lower and an
upper hard stop, state-dependent friction, and a nonlinear
return spring.
2
10
1 6.2 Excitation Signals
5 2 2
0 One multisine signal and two Amplitude Modulated
0 0 −2 0
−2 bc 0 bc Pseudo-Random Binary Sequences (APRBS 1 and APRBS
0
wc −2 wc −2 2) have been used to excite the system. The multisine sig-
(a) GRU loss function for(b) GRU gradient for wz = nal has a length of ≈ 10 s or 103 instances, and the APRBS
wz = 1,bz = 0 1,bz = 0 signals have a length of ≈ 25 s or 2500 instances each. For
the multisine signal, an upper frequency of fu = 7.5 Hz
has been used. For the APRBS signals, the holding time
is TH = 0.1 s. See (Gringard and Kroll, 2016) for more
1 information on the test signal design.
10
5 0.5 6.3 Data preprocessing
0 0 0
0 −2 APRBS 1 and its response signal were scaled to the
−2 −2 bz 0 bz
0 −2 interval [−1, 1]; all other signals were scaled accordingly.
wz wz
The data was then divided into training, validation and
(c) GRU loss function for(d) GRU gradient for wc =
wc = 1,bc = 0 1,bc = 0
test datasets in the following way:
• Training dataset: Consists of two batches. The first
Fig. 9. GRUs loss function and magnitude of the gradient batch comprises 80 % of all instances of the multisine
on the linear identification task signal and the corresponding system response. The
available in this case study) by multiplying the output of second batch consists of 70 % of APRBS 1 and its
the output gate fc with 1 − fz and thereby diminishing corresponding response signal.
its influence. The effect is a smooth loss function without • Validation dataset: Consists of two batches. The first
large gradients. batch comprises the remaining 20 % of the multisine
signal and the corresponding system response. The
6. CASE STUDY: ELECTRO-MECHANICAL second batch consists of the remaining 30 % of APRBS
THROTTLE 1 and the response signal.
• Test dataset: One batch. APRBS 2 and its correspond-
ing system response.
To test, whether the properties of the GRU also proof
beneficial in real-life applications, it was compared to an This division was chosen because the multisines response
RNN on a real nonlinear dynamical system. signal almost exclusively covers the medium operating
range while the APRBS’ response signal also covers the
6.1 The test system lower and upper hard stops.
GRU
Gated Unit GRU
1st Layer
dim(x) 3 4 5 6 8 10
fg (·) tanh
2nd Layer
#(Neurons) 4 5 6 7 8 10
dim(θ) 66 103 149 201 321 481 Fig. 11. Boxplot of the BFR of RNN and GRU on the
RNN
test dataset. Each model was initialized 20 times and
trained for 800 epochs.
fh (·) tanh
1st Layer smaller gradient produced by the GRU helps gradient
dim(x) 3 4 5 6 8 10
2nd Layer
fg (·) tanh based optimization, since small changes in the parameter
#(Neurons) 4 5 6 7 space correspond to small changes in the evolution of
dim(θ) 36 55 78 105 151 205 the state, which in turn produces a smooth loss function
without large gradients. The second argument made is,
that the GRU’s state equation converges to different
6.5 Model Training functions, when certain parameters become larger. This
corresponds to producing large planes in the loss function
Each of the models in Table 1 was initialized randomly 20 along which the solution improves steadily, rather than
times and trained for 800 epochs. Between each batch, narrow valleys, as they are produced by the RNN. The
the models initial states were set to zero. Parameters analyses provided in this paper have yet to be generalized
were estimated based on the training dataset using the to state space networks with arbitrary dimensions and the
ADAM optimizer (Kingma and Ba, 2015) with its default whole parameter space.
parameter configuration (α = 0.01, β1 = 0.9, β2 = 0.999).
REFERENCES
6.6 Results
Cho, K. et al. (2014). Learning phrase representations
At the end of the optimization procedure, the model with using rnn encoder-decoder for statistical machine trans-
the parameter configuration, which performed best on the lation. In Proceedings of the 2014 Conference on Empir-
validation dataset, was selected and evaluated on the test ical Methods in Natural Language Processing (EMNLP),
dataset. The performance of the models is measured in 8, 1724–1734.
terms of their best fit rate (BFR): Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep
Learning. MIT Press.
yk − ŷk 2 Gringard, M. and Kroll, A. (2016). On the systematic
BFR = 100% · max 1 − ,0 (16)
yk − ȳ2 analysis of the impact of the parametrization of stan-
Figure 11 shows the BFR of the RNN and the GRU on the dard test signals. In IEEE Symposium Series of Com-
test dataset. As expected, nonlinear optimization of the putational Intelligence 2016. IEEE, Athens, Greece.
GRU consistently yields high performing models, while the Hochreiter, S. and Schmidhuber, J. (1997). Long short-
performance of the RNNs fluctuates strongly. It should be term memory. Neural Computation, 9(8), 1735–1780.
noted, that there are cases where the RNNs performance Jordan, I.D., Sokol, P.A., and Park, I.M. (2019). Gated re-
matches or even exceeds the performance of the GRU current units viewed through the lens of continuous time
(e.g. for dim(x) = 10). This proves, that the RNN is in dynamical systems. arXiv preprint arXiv:1906.01005.
general able to represent the test system, it seems just Kingma, D. and Ba, J. (2015). Adam: A method for
very unlikely to arrive at such a parameterization during stochastic optimization. In 3rd International Conference
the optimization process. Arguably because of the issues for Learning Representations (ICLR 2015).
discussed in Section 3. Nelles, O. (2001). Nonlinear System Identification: From
Classical Approaches to Neural Networks and Fuzzy
7. CONCLUSIONS & OUTLOOK Models. Springer, Berlin Heidelberg, Germany.
Pascanu, R., Mikolov, T., and Bengio, Y. (2012). Un-
It was shown, that the gradient of the GRU’s state derstanding the exploding gradient problem. CoRR,
equation w.r.t. its parameters is at most as large as but abs/1211.5063.
usually smaller than that of the RNN, provided the L1 Rehmer, A. and Kroll, A. (2019). On using gated recurrent
norm of all weights is smaller or equal to one. This finding units for nonlinear system identification. In Preprints
contradicts the argument, that a vanishing gradient is of the 18th European Control Conference (ECC), 2504–
responsible for the RNNs poor performance on various 2509. IFAC, Naples, Italy.
tasks. The first point made in this paper is, that the