You are on page 1of 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/295975293

Recurrent neural networks are universal approximators

Conference Paper in Lecture Notes in Computer Science · January 2006

CITATIONS READS

293 2,663

2 authors:

Anton Maximilian Schäfer Hans Georg Zimmermann


Siemens Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.
16 PUBLICATIONS 653 CITATIONS 89 PUBLICATIONS 1,344 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Hans Georg Zimmermann on 14 March 2021.

The user has requested enhancement of the downloaded file.


2nd Reading
July 29, 2007 21:55 00111

International Journal of Neural Systems, Vol. 17, No. 4 (2007) 253–263


c World Scientific Publishing Company

RECURRENT NEURAL NETWORKS ARE UNIVERSAL


APPROXIMATORS
ANTON MAXIMILIAN SCHÄFER
University of Osnabrück, Neuroinformatics Group
Albrechtstraße 28, 49069 Osnabrück, Germany
Schaefer.Anton.Ext@siemens.com
HANS-GEORG ZIMMERMANN
by UNIVERSITY OF CALIFORNIA @ DAVIS on 01/09/15. For personal use only.

Siemens AG, Corporate Technology, Information & Communications


Int. J. Neur. Syst. 2007.17:253-263. Downloaded from www.worldscientific.com

Learning Systems, Otto-Hahn-Ring 6, 81739 Munich, Germany


Hans Georg.Zimmermann@siemens.com

Recurrent Neural Networks (RNN) have been developed for a better understanding and analysis of open
dynamical systems. Still the question often arises if RNN are able to map every open dynamical system,
which would be desirable for a broad spectrum of applications. In this article we give a proof for the
universal approximation ability of RNN in state space model form and even extend it to Error Correction
and Normalized Recurrent Neural Networks.

Keywords: Dynamical systems; system identification; recurrent neural networks; universal approximation.

1. Introduction standard RNN by error correction, i.e. Error Cor-


Recurrent Neural Networks (RNN) allow the iden- rection Neural Networks (ECNN),10 and proceeded
tification of dynamical systems in form of high to large Neural Networks by introducing Normalized
dimensional, nonlinear state space models. In com- Recurrent Neural Networks (NRNN).1 The latter
parison to Feedforward Neural Networks they offer provide the possibility to expand into higher dimen-
the particular advantage that they are able to explic- sions and solve open dynamical systems in an inte-
itly model memory and to identify inter-temporal grated approach.1
dependencies.1 Recent developments of RNN are However, the question arises if the outlined
summarized in the books of Haykin,2,3 Kolen and Recurrent Neural Networks are all able to identify
Kremer,4 Medsker and Jain,5 and Soofi and Cao.6 and to approximate any open dynamical system, i.e.,
In previous papers, e.g. (Ref. 7), we discussed the if they hold a universal approximation ability.
modeling of open dynamical systems based on time- In 1989 Hornik, Stinchcombe, and White11
delay RNN, which can be represented in a state space showed that any Borel-measurable function on a
model form. We solved the system identification task compact domain can be approximated by a three-
by finite unfolding in time, i.e., we transferred the layered Feedforward Neural Network, i.e., a Feed-
temporal problem into a spatial architecture,7 which forward Network with one hidden layer, with an
can be handled by error backpropagation through arbitrary accuracy. In the same year Cybenko12
time (BPTT),8 a shared weights extension of the and Funahashi13 found similar results, each with
standard backpropagation algorithm.9 Further we different methods. Whereas the proof of Hornik,
enforced the learning of the autonomous dynamics Stinchcombe, and White11 is based on the Stone-
in an open system by overshooting.7 Consequently Weierstrass theorem, Cybenko12 makes in princi-
our RNN not only learn from data but also integrate ple use of the Hahn-Banach and Riesz theorem.
prior knowledge and first principles into the modeling Funahashi13 mainly applies the Irie-Miyake and the
in form of architectural concepts. We also extended Kolmogorov-Arnold-Sprecher theorem.

253
2nd Reading
July 29, 2007 21:55 00111

254 A. M. Schäfer & H.-G. Zimmermann

There has also been some work done on several


y
approximation abilities of different kinds of Recur-
rent Networks. Hammer14 for example considers the
capability of Recurrent Neural Networks to approx-
imate functions from a list of real vectors to a real
vector space. Further contributions to the topic have Dynamical s
been given by Chen and Chen,15 Back and Chen,16 System
Kilian and Siegelmann,17 Maass, Natschläger and
Markram18 as well as Sontag.19
In this article we focus on open dynamical sys-
tems and prove that those can be approximated by x
RNN in state space model form with an arbitrary
accuracy. We further show that this result holds for Fig. 1. Open dynamical system with input x, hidden
state s and output y.
by UNIVERSITY OF CALIFORNIA @ DAVIS on 01/09/15. For personal use only.

ECNN and even for NRNN.


Int. J. Neur. Syst. 2007.17:253-263. Downloaded from www.worldscientific.com

We start with a short introduction on open


world systems are driven by a superposition of an
dynamical systems and RNN in state space model
autonomous development and external influences.
form as well as ECNN and NRNN (Sec. 2). We
If we assume that the state transition does not
further recall the basic results of the universal
depend on st , i.e., yt = h(st ) = h(g(xt−1 )), we
approximation theorem of Hornik, Stinchcombe, and
are back in the framework of Feedforward Neural
White11 (Sec. 3). Subsequently we show that these
Networks.7 However, the inclusion of the internal
results can be extended to RNN in state space model
hidden dynamics makes the modeling task much
form and we consequently give a proof for their uni-
harder, because it allows varying inter-temporal
versal approximation ability (Sec. 4). We further
dependencies. Theoretically, in the recurrent frame-
extend this proof to ECNN (Sec. 5). Finally we show
work an event st+1 is explained by a superposition
that RNN are a subset of NRNN and derive that
of external inputs xt , xt−1 , . . . from all the previous
NRNN are also universal approximators (Sec. 6). We
time steps.2
conclude with a short summary and an outlook on
further research (Sec. 7).
2.1. RNN unfolded in time
2. Open Dynamical Systems and In previous papers, e.g. (Refs. 1 and 7), we proposed
Recurrent Neural Networks to map open dynamical systems (Eq. (1)) by a Recur-
Figure 1 illustrates an open dynamical system in dis- rent Neural Network (RNN) in state space model
crete time, which can be described as a set of equa- form
tions, consisting of a state transition and an output st+1 = f (Ast + Bxt − θ)
equationa : (2)
yt = Cst
st+1 = g(st , xt ) state transition
(1) where A, B, and C are weight matrices of appropri-
yt = h(st ) output equation ate dimensions and θ is a bias, which handles off-
The state transition is a mapping from the present sets in the input variables xt .2,7 f is the so called
internal hidden state of the system st and the exter- activation function of the network, which is typically
nal inputs xt to the new state st+1 . The output equa- sigmoidal (Def. 3) like, e.g., the hyperbolic tangent.
tion computes the observable output yt .2,7 A major advantage of RNN written in form of a
The system can be viewed as a partially observ- state space model (Eq. (2)) is the explicit correspon-
able autoregressive dynamic state transition st → dence between equations and architecture. It is easy
st+1 that is also driven by external forces xt . to see that the set of equations (2) can be directly
Without the external inputs the system is called transferred into a spatial neural network architecture
an autonomous system.2,20 However, most real using so called finite unfolding in time and shared

a
Alternative descriptions of open dynamical systems are given in Corollary 2.
2nd Reading
July 29, 2007 21:55 00111

Recurrent Neural Networks are Universal Approximators 255

A A A
(...) y t−4 y t−3 y t−2 y t−1 yt
C D s t−1 C z t−1 D st C zt D s t+1 C y t+1
(...) (...)
θ θ θ θ
C C C C C C
B −I B −I B −I B
A A A A A
(...) s t−4 s t−3 s t−2 s t−1 st (...) (...) x t−2 ydt−1 x t−1 ydt xt

Fig. 3. Error Correction Neural Network using unfold-


B B B B B B ing in time. Note, that −I is the fixed negative of an
x t−4 x t−3 x t−2 x t−1 appropriate sized identity matrix, while zt are output
(...) xt
clusters with target values of zero in order to optimize
the error correction mechanism.
Fig. 2. Recurrent Neural Network unfolded in time.
to Eq. (3), which we call Error Correction Neural
weight matrices A, B, and C.2,8 Figure 2 depicts the Network (ECNN), can be written as
by UNIVERSITY OF CALIFORNIA @ DAVIS on 01/09/15. For personal use only.
Int. J. Neur. Syst. 2007.17:253-263. Downloaded from www.worldscientific.com

resulting model.7 st+1 = f (Ast + Bxt + D tanh(yt − ytd ) − θ)


(4)
yt = Cst .
2.2. Error correction neural networks Here, matrix D adjusts a possible difference in the
In general RNN (Eq. (2)) only allow us to identify the dimension between the error correction term (yt −ytd )
inter-temporal relationships of the underlying system and the internal state st . The additional hyperbolic
dynamics when we have a complete description of all tangent serves as a remedy against numerical insta-
external influences.2 Unfortunately, our knowledge bilities caused by the possibility to code the autore-
about the external forces is typically incomplete and gressive structure either in matrix A or in matrices
our observations might be noisy. Under such condi- C and D.10
tions, learning with finite data sets leads to the con- Analogue to the standard RNN (Fig. 2) we have
struction of incorrect causalities due to learning by a correspondent architecture of an ECNN (Eq. (4))
heart (overfitting). The generalization properties of unfolded in time (Fig. 3).
such a model are questionable.21 The ECNN (Eq. (4)) is best understood by ana-
In those cases, we can refer to the actual model lyzing the dependencies of st , xt , zt = Cst − ytd , and
error (yt −ytd ), which can be interpreted as an indica- st+1 . There are two different inputs: the externals
tor that our model is misleading. Handling this error xt directly influencing the state transition st+1 and
information as an additional input, we extend the the targets ytd . From the latter only the difference
open dynamical system (Eq. (1)) obtaining: between yt and ytd has an impact on st+1 .10 At step
t+ 1, we have no compensation of the internal expec-
st+1 = g(st , xt , yt − ytd ) tations yt+1 , and thus the system offers a forecast
(3)
yt = h(st ) yt+1 = Cst+1 .

The state transition st+1 is now a mapping from


the previous state st , external influences xt , and a 2.3. Normalized recurrent neural
networks
comparison between model output yt and observed
data ytd . If the model error (yt −ytd ) is zero, we have a As we will show in Sec. 4 standard RNN (Eq. (2))
perfect description of the dynamics. However, due to are able to approximate any open dynamical system.
unknown external influences or noise, our knowledge Still, the difficulty with this kind of RNN (as well
about the dynamics is often incomplete. Under such as of ECNN) is the training with BPTT,8 because
conditions, the model error (yt − ytd ) quantifies the a sequence of different connectors has to be bal-
model’s misfit and serves as an indicator of short- anced. The gradient computation is not regular, i.e.,
term effects or external shocks.10 we do not have the same learning behavior for the
Using weight matrices A, B, C, and D of appro- weight matrices in the different time steps. In our
priate dimensions corresponding to st , xt , and experiments we found that this problem becomes
(yt − ytd ) and a bias θ, a neural network approach more important for the training of large RNN. Even
2nd Reading
July 29, 2007 21:55 00111

256 A. M. Schäfer & H.-G. Zimmermann

the training itself is unstable due to the concate- neurons and N outputs, the dimension of the internal
nated matrices A, B, and C. As the training changes state would be J := dim(s) = I + Q + N . With the
weights in all of these matrices, different effects or matrix [I 0 0] we connect only the first N neurons of
tendencies, even opposing ones, can influence them the internal state st to the output layer yt . As this
and may superpose. This implies that there results connector is not trained, it can be seen as a fixed
no clear learning direction or change of weights from identity matrix of appropriate size. Consequently,
a certain backpropagated error.1 the NRNN is forced to generate its N outputs at
Normalized Recurrent Neural Networks (NRNN) the first N components of the state vector st . The
(Eqs. (5)) avoid the stability and learning problems last state neurons are used for the processing of the
resulting from the concatenation of the three matri- external inputs xt . The connector [0 0 I]T between
ces A, B, and C because they incorporate besides the externals xt and the internal state st is an appro-
fixed identity matrices I of appropriate dimensions priately sized fixed identity matrix. More precisely,
and the bias θ, only a transition matrix A: the connector is designed such that the input xt is
   
by UNIVERSITY OF CALIFORNIA @ DAVIS on 01/09/15. For personal use only.

connected to the last I state neurons. To additionally


Int. J. Neur. Syst. 2007.17:253-263. Downloaded from www.worldscientific.com

0
st = f Ast−1 +  0  xt − θ support the internal processing and to increase the
I (5) network’s computational power, we add a number of
Q hidden neurons between the first N and the last
yt = [I 0 0]st I state neurons. This composition ensures that the
Hence, by using NRNN the modeling is solely focused input and output processing of the network is sepa-
on the transition matrix A. The matrices between rated but implies that NRNN can only be designed
input and hidden as well as hidden an output layer as large Neural Networks.1 Note, that in contrast to
are fixed and therefore not changed during the train- RNN (Sec. 2.1) and ECNN (Sec. 2.2) the output of
ing process. Consequently matrix A does not only the NRNN can be bounded by the activation func-
code the autonomous and the externally driven parts tion, e.g., by using the hyperbolic tangent we have
of the dynamics, but also the (pre-)processing of the yt ∈ (−1, 1). Still, this is not a real constraint as
external inputs xt and the computation of the net- we can simply scale the data appropriately before
work outputs yt . This implies that all free parame- applying it to the network.
ters, as they are combined in one matrix, are now Our experiments indicate that, in comparison to
treated the same way by BPTT. Figure 4 illustrates RNN, NRNN often achieve better results22 and gen-
the corresponding architecture. erally show a more stable training process, especially
At first view it seems that in the network archi- when the dimension of the internal state is very large.
tecture (Fig. 4) the external input xt is directly con-
nected to the corresponding output yt . This is not 3. Universal Approximation by
the case, because we enlarge the dimension of the Feedforward Neural Networks
internal state st , such that the input xt has no direct
influence on the output yt . Assuming that we have a Our proof for RNN in state space model form (Sec. 4)
number of I external inputs, Q computational hidden is based on the work of Hornik, Stinchcombe and
White.11 In the following we therefore recall their
definitions and main results:
(...) y t−2 y t−1 yt y t+1
Definition 1. Let AI with I ∈ N be the set of all
I
00
I
00
I
00
I
00
I
00 affine mappings A(x) = w · x − θ from RI to R with
A A A A w, x ∈ RI and θ ∈ R. ‘·’ denotes the scalar product.
(...) s t−2 s t−1 st s t+1
θ θ θ θ θ
Transferred to Neural Networks x corresponds to
0 0 0 0
0 0 0 0 the input, w to the network weights and θ to the
I I I I
bias.
(...) x t−2 x t−1 xt
Definition 2. For any (Borel-)measurable function
I
Fig. 4. Normalized Recurrent Neural Network. f (·) : R → R and I ∈ N be (f ) the class of
2nd Reading
July 29, 2007 21:55 00111

Recurrent Neural Networks are Universal Approximators 257

functions Definition 5. A subset S of a metric space (X, ρ)


J is ρ-dense in a subset U , if there exists, for any ε > 0
NN : RI → R : NN (x) = vj f (Aj (x)) and any u ∈ U , s ∈ S, such that ρ(s, u) < ε.
j=1
This means that every element of S can approx-
x ∈ RI , vj ∈ R, Aj ∈ AI , J ∈ N . imate any element of U with an arbitrary accuracy.
In the following we replace U and X by C I and MI
I
Here NN stands for a three-layered Feedforward respectively and S by (f ) with an arbitrary but
Neural Network, i.e., a Feedforward Network with fixed f . The metric ρ is chosen accordingly.
one hidden layer, with I input-neurons, J hidden-
Definition 6. A subset S of C I is uniformly dense
neurons and one output-neuron. vj denotes the
on a compact domain in C I , if, for any compact sub-
weights between hidden- and output-neurons f is an
set K ⊂ RI , S is ρK -dense in C I , where for f, g ∈ C I
arbitrary activation function (Sec. 2.1).
ρK (f, g) ≡ supx∈K |f (x) − g(x)|.
by UNIVERSITY OF CALIFORNIA @ DAVIS on 01/09/15. For personal use only.
Int. J. Neur. Syst. 2007.17:253-263. Downloaded from www.worldscientific.com

I
Remark 1. The function class (f ) can also be
written in matrix form Definition 7. Given a probability measure µ on
(RI , BI ), the metric ρµ : MI × MI → R+ be defined
NN (x) = vf (W x − θ)
as follows
where x ∈ RI , v, θ ∈ RJ , and W ∈ RJ×I .
In this context the computation of the function ρµ (f, g) = inf{ε > 0 : µ{x : |f (x) − g(x)| > ε} < ε}.
f (·) : RJ → RJ be defined component-wise, i.e.,
  Theorem 1 (Universal Approximation Theo-
f (W1 · x − θ1 ) rem for Feedforward Neural Networks). For
 ..  any sigmoid activation function f, any dimension I
 . 
  and any probability measure µ on (RI , BI ), I
f (W x − θ) :=  f (Wj · x − θj ) 


(f )
 ..  is uniformly dense on a compact domain in C and
I
 . 
ρµ -dense in MI .
f (WJ · x − θJ )
where Wj (j = 1, . . . , J) denotes the jth row of the This theorem states that a three-layered Feed-
matrix W . forward Neural Network, i.e., a Feedforward Neu-
ral Network with one hidden layer, is able to
Definition 3. A function f is called a sig- approximate any continuous function uniformly on a
moid function, if f is monotonically increasing and compact domain and any measurable function in the
bounded, i.e., ρµ -metric with an arbitrary accuracy. The proposi-
f (a) ∈ [α, β], whereas lim f (a) = α tion is independent of the applied sigmoid activation
a→−∞ function f (Def. 3), the dimension of the input space
and lim f (a) = β I, and the underlying probability measure µ. Conse-
a→∞
quently three-layered Feedforward Neural Networks
with α, β ∈ R and α < β. are universal approximators.
Definition 4. Let C I and MI be the sets of Theorem 1 is only valid for Feedforward Neural
all continuous and respectively all Borel-measurable Networks with I input- and a single output-neuron.
functions from RI to R. Further denote BI the Borel- Accordingly, only functions from RI to R can be
σ-algebra of RI and (RI , BI ) the I-dimensional Borel- approximated. However with a simple extension it
measurable space. can be shown that the theorem holds for Networks
with a multiple output (Cor. 1).
MI contains all functions relevant for applica- For this, the set of all continuous functions from
tions. C I is a subset of it. Consequently, for every RI to RN , I, N ∈ N, be denoted by C I,N and the
I
Borel-measurable function f the class (f ) belongs one of (Borel-)measurable functions from RI to RN
I
to the set M and for every continuous f to its
I
by MI,N respectively. The function class gets
I,N
subset C I . extended to by (re-)defining the weights vj
2nd Reading
July 29, 2007 21:55 00111

258 A. M. Schäfer & H.-G. Zimmermann

(j = 1, . . . , J) in Definition 2 as N × 1 vectors. In number of output-neurons. xt denotes the external


matrix-form the class
I,N
is then given by inputs, st the inner states and yt the outputs of the
Neural Network. The matrices A, B, and C corre-
NN (x) = Vf (W x − θ)
spond to the weight-matrices between hidden- and
with x ∈ RI , θ ∈ RJ , W ∈ RJ×I , and V ∈ RN ×J . hidden-, input- and hidden- and hidden- and output-
The computation of the function f (·) : RJ → RJ be neurons respectively. f is an arbitrary activation
once more defined component-wise (Rem. 1). function.
In the following, function g(·) : RI → RN has got Theorem 2 (Universal Approximation Theo-
the elements gk , k = 1, . . . , N . rem for RNN). Let g(·) : RJ × RI → RJ be
Corollary 1. Theorem 1 holds for the approx- measurable and h(·) : RJ → RN be continuous, the
imation of functions in C I,N and MI,N by the external inputs xt ∈ RI , the inner states st ∈ RJ ,
extended function class
I,N
. Thereby the metric ρµ and the outputs yt ∈ RN (t = 1, . . . , T ). Then, any
is replaced by ρN
N
ρ (fk , gk ). open dynamical system of the form
µ :=
by UNIVERSITY OF CALIFORNIA @ DAVIS on 01/09/15. For personal use only.

k=1 µ
Int. J. Neur. Syst. 2007.17:253-263. Downloaded from www.worldscientific.com

st+1 = g(st , xt )
Consequently three-layered multi-output Feedfor- (6)
ward Networks are universal approximators for yt = h(st )
vector-valued functions. can be approximated by an element of the function
class RNN I,N (f ) (Def. 8) with an arbitrary accu-
racy, where f is a continuous sigmoidal activation
4. Universal Approximation by RNN
function (Def. 3).
The universal approximation theorem for Feedfor-
ward Neural Networks (Theo. 1) proves that any Proof. The proof is given in two steps. Thereby
(Borel-)measurable function can be approximated the equations of the open dynamical system (6) are
by a three-layered Feedforward Neural Network. traced back to the representation by a three-layered
We now show that RNN in state space model Feedforward Neural Network.
form (Eq. (2)) are also universal approximators and In the first step, we conclude that the state space
able to approximate every open dynamical system equation of the open dynamical system, st+1 =
(Eq. (1)) with an arbitrary accuracy. g(st , xt ), can be approximated by a Neural Net-
work of the form s̄t+1 = f (As̄t + Bxt − θ) for all
Definition 8. For any (Borel-)measurable function t = 1, . . . , T .
f (·) : RJ → RJ and I, N, T ∈ N be RNN I,N (f ) the Let now be ε > 0 and K ⊂ RJ × RI a com-
class of functions pact set, which contains (st , xt ) and (s̄t , xt ) for all
st+1 = f (Ast + Bxt − θ) t = 1, . . . , T . From the universal approximation the-
yt = Cst . orem for Feedforward Networks (Theo. 1) and the
subsequent corollary (Cor. 1) we know that for any
Thereby be xt ∈ RI , st ∈ RJ , and yt ∈ RN , with measurable function g(st , xt ) : RJ × RI → RJ and
t = 1, . . . , T . Further be the matrices A ∈ RJ×J , for an arbitrary δ > 0, a function
B ∈ RJ×I , and C ∈ RN ×J and the bias θ ∈ RJ . In
NN (st , xt ) = Vf (W st + Bxt − θ̄),
the following, analogue to remark 1, the calculation
¯ ¯
of the function f be defined component-wise, i.e., with weight matrices V ∈ RJ×J , W ∈ RJ ×J , B ∈
¯ ¯
RJ×I , a bias θ̄ ∈ RJ , and a component-wise applied
st+1j = f (Aj st + Bj xt − θj ),
continuous sigmoid activation function f (Rem. 1)
where Aj and Bj (j = 1, . . . , J) denote the jth row exists, such that ∀ t = 1, . . . , T
of the matrices A and B respectively. sup g(st , xt ) − NN (st , xt )∞ < δ. (7)
st ,xt ∈K
I,N
It is obvious that the class RNN (f ) is equiva-
As f is continuous and T finite, there exists a δ > 0,
lent to the RNN in state space model form (Eq. (2)).
such that according to the ε-δ-criterion we get out of
Analogue to its description in Sec. 2.1 as well as Def-
Eq. (7) that for the dynamics
inition 2, I stands for the number of input-neurons,
J for the number of hidden-neurons and N for the s̄t+1 = Vf (W s̄t + Bxt − θ̄)
2nd Reading
July 29, 2007 21:55 00111

Recurrent Neural Networks are Universal Approximators 259

the following condition holds Using again Theorem 1 Equation (11) can be approx-
st − s̄t ∞ < ε ∀ t = 1, . . . , T. (8) imated by

Further let ỹt = Df (Est−1 + F xt−1 − θ̃), (12)


st+1 := f (W s̄t + Bxt − θ̄) ¯¯ ¯¯ ¯
with suitable weight matrices D ∈ RN ×J , E ∈ RJ×J ,
¯¯ ¯¯
which gives us that and F ∈ RJ×I , a bias θ̃ ∈ RJ , and a component-wise
s̄t = Vs t . (9) applied sigmoid activation function f (Rem. 1).
If we further set
With the help of a variable transformation from s̄ to
¯ ¯ ¯¯
st and the replacement A := W V (∈ RJ×J ), we get rt+1 := f (Est + F xt − θ̃) (∈ RJ )
the desired function on state s :
and enlarge the system equations (10) and (12) about
st+1 = f (Ast + Bxt − θ̄) (10) this additional component, we achieve the following
Remark 2. The transformation from s to s might form
by UNIVERSITY OF CALIFORNIA @ DAVIS on 01/09/15. For personal use only.
Int. J. Neur. Syst. 2007.17:253-263. Downloaded from www.worldscientific.com

involve an enlargement of the internal state space          


st+1 A 0 st B θ̄
dimension. =f + xt −
rt+1 E 0 rt F θ̃
In the second step we show that the output equa-  
s
tion yt = h(st ) can be approximated by a Neural ỹt = (0 D) t .
rt
Network of the form ȳt = C s̄t . Thereby we have to
cope with the additional challenge, to approach the Their equivalence to the original Eqs. (10) and (12)
nonlinear function h(st ) of the open dynamical sys- is easy to see by a component-wise computation.
tem by a linear equation C s̄t . Finally out of
Let ε̃ > 0. As h is continuous per definition, there  
¯ s ˜
exists an ε > 0, such that (according to the ε-δ- J := J + J, s̃t := t ∈ RJ ,
˜ ¯ ¯
rt
criterion) out of st − s̄t ∞ < ε (Eq. (8)) follows    
that h(st ) − h(s̄t )|∞ < ε̃. Consequently it is suffi- A 0 ˜ ˜ B ˜
à := ∈ RJ×J , B̃ := ∈ RJ×I ,
cient to show that ŷt = h(s̄t ) can be approximated E 0 F
by a function of the form ȳt = C s̄t with an arbi-  
˜ θ̄ ˜
trary accuracy. The proposition then follows out of C̃ := (0 D) ∈ RN ×J and θ := ∈ RJ ,
θ̃
the triangle inequality.
Once more we use the universal approximation follows
theorem for Feedforward Networks (Theo. 1) and the s̃t+1 = f (Ãs̃t + B̃xt − θ)
subsequent corollary (Cor. 1), which gives us that (13)
ỹt = C̃ s̃t .
equation
ŷt = h(s̄t ) Equation (13) is apparently an element of the func-
tion class RNN I,N (f ). Thus, the theorem is proven.
can be approximated by a Feedforward Neural Net-
work of the form
ȳt = Mf (Gs̄t − θ̂) Definition 9. With the same assumptions as in
N ×Jˆ ˆ
where M ∈ R and G ∈ R J×J
be suitable weight definition 8 we define analogously the classes of func-
matrices, f a component-wise applied sigmoid acti- tions RNN I,N I,N
2 (f ) and RNN 3 (f ) as
ˆ
vation function (Rem. 1), and θ̂ ∈ RJ a bias. Accord-
st = f (Ast−1 + Bxt − θ)
ing to Eqs. (9) and (10) we know that s̄t = V st and
st+1 = f (Ast + Bxt − θ̄). By insertion we get yt = Cst

ȳt = Mf (Gs̄t − θ̂) and respectively


= Mf (GV st − θ̂) st+1 = f (Ast + Bxt − θ)
= Mf (GV f (Ast−1 + Bxt−1 − θ̄) − θ̂). (11) yt+1 = Cst .
2nd Reading
July 29, 2007 21:55 00111

260 A. M. Schäfer & H.-G. Zimmermann

Corollary 2. Under the conditions of Theorem 2 Theorem 3 (Universal Approximation Theo-


the universal approximation ability equally holds for rem for ECNN). Let g(·) : RJ × RI × RN → RJ be
the classes RNN I,N I,N
2 (f ) and RNN 3 (f ) approximat- measurable and h(·) : RJ → RN be continuous, the
ing open dynamical systems of the form external inputs xt ∈ RI , the inner states st ∈ RJ ,
st = g(st−1 , xt ) st+1 = g(st , xt ) and the expected as well as approximated outputs
and ytd , yt ∈ RN (t = 1, . . . , T ). Then, any open dynami-
yt = h(st ) yt+1 = h(st )
cal system with error correction of the form
respectively. st+1 = g(st , xt , yt − ytd )
Proof. In each case analogue to the proof of yt = h(st )
Theorem 2. can be approximated by an element of the function
class ECNN I,N (f ) (Def. 10) with an arbitrary accu-
5. Universal Approximation by ECNN racy, where f is a continuous sigmoidal activation
by UNIVERSITY OF CALIFORNIA @ DAVIS on 01/09/15. For personal use only.

function (Def. 3).


As an extension to the universal approximation the-
Int. J. Neur. Syst. 2007.17:253-263. Downloaded from www.worldscientific.com

orem for RNN (Sec. 4) we show that the result holds Proof. Analogue to Theorem 2. The error cor-
for ECNN approximating open dynamical systems rection term simply serves as an additional input
with error correction (Eq. (3)). and therefore does not substantially change the
Definition 10. For any (Borel-)measurable func- derivations.
tion f (·) : RJ → RJ and I, N, T ∈ N be
6. Universal Approximation by NRNN
ECNN I,N (f ) the class of functions
We finally extend the universal approximation ability
st+1 = f (Ast + Bxt + D(yt − ytd ) − θ)
of standard RNN to their normalized version. As we
yt = Cst . pointed out in Sec. 2.3 NRNN play an important
Thereby be xt ∈ RI , st ∈ RJ , and yt , ytd ∈ RN , with role for the construction of large Neural Networks.1
t = 1, . . . , T . Further be the matrices A ∈ RJ×J , Besides that, due to the stability increase, i.e., to the
B ∈ RJ×I , C ∈ RN ×J , and D ∈ RJ×N and the bias reduction to one single transition matrix, they often
θ ∈ RJ . Again the calculation of the function f be achieve better results in terms of forecast precision
defined component-wise (Def. 8). than standard RNN.22
In the following we show that RNN are a subset
The class ECNN I,N (f ) only differs from the of NRNN by proving that every RNN can be repre-
ECNN in state space model form (Eq. (4)) in the sented by an NRNN. This already implies that the
missing hyperbolic tangent, i.e., the nonlinear trans- universal approximation theorem for RNN (Theo. 2)
formation of the error-correction term (yt−1 − yt−1 d
), also holds for NRNN.
which has been added to the ECNN for numerical
reasons (Sec. 2.2). It has been left out in definition Definition 11. For any (Borel-)measurable func-
10 because it is not required for the theoretical anal- tion f (·) : RJ → RJ and I, N, T ∈ N be
ysis of system approximation. It has simply proved NRNN I,N (f ) the class of functions
   
to produce better and especially more stable results 0
during our practical applications.10 Analogue to its st = f Ast−1 +  0  xt − θ
description in Sec. 2.2 as well as Definition 8, I stands I (14)
I
for the number of input-neurons, J for the number
yt = [IN 0 0]st
of hidden-neurons and N for the number of output-
neurons. xt denotes the external inputs, st the inner Thereby be xt ∈ RI , st ∈ RJ , and yt ∈ RN , with
states, and yt the outputs of the Neural Network. The t = 1, . . . , T . Further be matrix A ∈ RJ×J , and the
matrices A, B, C, and D correspond to the weight- bias θ ∈ RJ . For the fixed identity matrices we have
matrices between hidden- and hidden-, input- and [II 0 0]T ∈ RJ×I and [IN 0 0] ∈ RN ×J where the
hidden-, hidden- and output-, and error correction- index i of the identity matrix Ii denotes its dimen-
and hidden-neurons respectively. f is an arbitrary sion. As before the calculation of the function f be
activation function. defined component-wise (Def. 8).
2nd Reading
July 29, 2007 21:55 00111

Recurrent Neural Networks are Universal Approximators 261

The class NRNN I,N (f ) is equivalent to NRNN with à ∈ RQ×Q , B̃ ∈ RQ×I , C̃ ∈ RN ×Q and θ̃ ∈ RQ .
(Eq. (5)). Analogue to its description in Sec. 2.3 as In doing so Eq. (14) can be transformed into
well as Definition 8, I stands for the number of input-  y    y
s 0 C̃ 0 s
neurons, J for the number of hidden-neurons and 
sh  = f 0 Ã B̃  sh 
N for the number of output-neurons. xt denotes the
sx t 0 0 0 sx t−1
external inputs, st the inner states and yt the outputs
   
of the Neural Network. Matrix A corresponds to the 0 0
weight-matrix between hidden- and hidden-neurons. +  0  xt − θ̃ 
f is an arbitrary activation function. II 0
 y
Lemma 1. The class RNN I,N s
3 (f ) (Def. 9) is a
subset of the class NRNN I,N
(f ) in the sense that yt = [IN 0 0] sh 
 (16)
every element of RNN I,N (f ) can be represented by sx t
3
an element of NRNN I,N (f ) with an arbitrary accu-
by UNIVERSITY OF CALIFORNIA @ DAVIS on 01/09/15. For personal use only.

Note, that due to the construction of the output


Int. J. Neur. Syst. 2007.17:253-263. Downloaded from www.worldscientific.com

racy given the assumption that the input x and the equation we have yt = syt .
output y of the Networks can be scaled such that for Writing Eq. (16) component-wise we derive
an arbitrary ε > 0 we have
syt = f (C̃sht−1 )
(i) xt − f (xt )∞ < ε ∀ t = 1, . . . , T sht = f (Ãsht−1 + B̃sxt−1 − θ̃) (17)
(ii) yt − f (yt )∞ < ε ∀ t = 1, . . . , T .
sxt = f (xt ).
Note, that the assumption in lemma 1 is not a
According to our assumptions (i) and (ii) and yt =
major restriction as for most of the sigmoidal activa-
f (C̃sht−1 ), we have for ε > 0
tion functions like the hyperbolic tangent the condi-
tion is implicitly fulfilled for small values of x and f (yt ) − f (C̃sht−1 )∞ < ε (18)
y as those functions are linear around the origin. and sxt − xt ∞ < ε. (19)
Besides in NRNN the output is anyway bounded by
the activation function (Sec. 2.3). This generally also Out of Eq. (18) in conjunction with assumption
applies to the input as in most applications one does (i) and f being a continous sigmoid function then
not distinct between input and output variables but follows
simply considers observable data.1 yt − C̃sht−1 ∞ < ε. (20)
Finally for ε → 0 Eqs. (19) and (20) imply that
Proof. Let I and N be arbitrary but fixed. We prove
Eq. (17) are an element of RNN I,N 3 (f ). Conse-
the theorem by showing that by reconstructing the
quently, as matrices Ã, B̃, and C̃ are arbitrary we
equations of class NRNN I,N (f ) we can represent any
can represent any element of RNN I,N3 (f ) by one of
element of RNN I,N
3 (f ) with arbitrary matrices Ã, B̃,
NRNN I,N (f ).
and C̃.
We start by defining the inner state vector st ∈
Theorem 4 (Universal Approximation Theo-
RJ in NRNN I,N (f ) as follows
rem for NRNN). Let g(·) : RJ × RI → RJ be
 y
s measurable and h(·) : RJ → RN be continuous, the
st := sh 
 (15) external inputs xt ∈ RI , the inner states st ∈ RJ ,
sx t and the outputs yt ∈ RN (t = 1, . . . , T ). Then, any
open dynamical system of the form
where sx ∈ RI represents the network’s input-, sh ∈
RQ the hidden-, sy ∈ RN the output-part (respec- st+1 = g(st , xt )
tively neurons) of st and I + Q + N = J. We further yt+1 = h(st )
set the single transition matrix of NRNN I,N (f )
    can be approximated by an element of the function
0 C̃ 0 0 class NRNN I,N (f ) (Def. 11) with an arbitrary accu-
A := 0 Ã B̃  and the bias θ := θ̃  racy, where f is a continuous sigmoidal activation
0 0 0 0 function (Def. 3).
2nd Reading
July 29, 2007 21:55 00111

262 A. M. Schäfer & H.-G. Zimmermann

Proof. Follows directly out of lemma 1 and corol- nal Processing: From Systems to Brain (MIT Press,
lary 2. Cambridge, MA, 2006).
4. J. F. Kolen and S. C. Kremer, A Field Guide to
Dynamical Recurrent Networks, IEEE Press (2001).
7. Conclusion 5. L. R. Medsker and L. C. Jain, Recurrent neural net-
works: Design and application, Vol. 1, Comp. Intel-
In this article we gave a proof for the universal ligence (CRC Press international, 1999).
approximation ability of RNN in state space model 6. A. Soofi and L. Cao, Modeling and Forecasting
form. After a short introduction into open dynamical Financial Data, Techniques of Nonlinear Dynamics,
Kluwer Academic Publishers (2002).
systems, RNN in state space model form as well as
7. H. G. Zimmermann and R. Neuneier, Neural net-
ECNN and NRNN we recalled the universal approx- work architectures for the modeling of dynamical
imation theorem for Feedforward Neural Networks. systems, in J. F. Kolen and St. Kremer (eds.), A
Based on this result we proved that RNN in state Field Guide to Dynamical Recurrent Networks, IEEE
space model form are able to approximate any open Press (2001), pp. 311–350.
8. D. E. Rumelhart, G. E. Hinton and R. J. Williams,
by UNIVERSITY OF CALIFORNIA @ DAVIS on 01/09/15. For personal use only.

dynamical system with an arbitrary accuracy. We


Int. J. Neur. Syst. 2007.17:253-263. Downloaded from www.worldscientific.com

Learning internal representations by error propa-


further extended this proof to ECNN and NRNN by gation, in D. E. Rumelhart and J. L. McClelland
showing that it also holds for these more elaborated et al. (eds.), Parallel Distributed Processing: Explo-
Neural Networks. rations in the Microstructure of Cognition, Vol. 1,
In praxis all three Recurrent Neural Networks are MIT Press, Cambridge, MA (1986), pp. 318–362.
of high value as they are applied accordant to the 9. P. J. Werbos, Beyond Regression: New Tools for Pre-
diction and Analysis in the Behavioral Sciences, PhD
respective problem conditions. This means that, as
thesis, Harvard University (1974).
outlined, in case of incomplete knowledge of the envi- 10. H. G. Zimmermann, R. Neuneier and R. Grothmann,
ronment we switch from RNN to ECNN whereas for Modeling of dynamical systems by error correction
high dimensions we refer to NRNN. neural networks, in A. Soofi and L. Cao (eds.), Mod-
The proof can be seen as a basis for future work eling and Forecasting Financial Data, Techniques of
on Recurrent Neural Networks as well as a justifi- Nonlinear Dynamics, Kluwer Academic Publishers
(2002), pp. 237–263.
cation for their use in many real-world applications. 11. K. Hornik, M. Stinchcombe and H. White, Multi-
It also underlines the good results we achieved by layer feedforward networks are universal approxima-
applying RNN, ECNN and NRNN to various time- tors, Neural Networks 2 (1989) 359–366.
series problems.22,23 12. G. Cybenko, Approximation by superpositions of
Further research is done on a constant enhance- a sigmoidal function, in Mathematics of Control,
Signals and Systems, Springer, New York (1989),
ment of Recurrent Neural Networks for a more effi-
pp. 303–314.
cient use in different practical questions and prob- 13. K. I. Funahashi, On the approximate realization
lems. In this context it is important to note that of continuous mappings by neural networks, Neural
for the application of the presented Neural Networks Networks 2 (1989) 183–192.
to real-world problems an adaption of the model to 14. B. Hammer, On the approximation capability of
the respective task is advantageous as it improves its recurrent neural networks, in International Sympo-
sium on Neural Computation (1998).
quality. 15. T. Chen and H. Chen, Universal approximation
to nonlinear operators by neural networks with
arbitrary activation functions and applications to
References dynamical systems, IEEE Transactions on Neural
1. H. G. Zimmermann, R. Grothmann, A. M. Schaefer Networks 6(4) (1995) 911–917.
and Ch. Tietz, Identification and forecasting of large 16. A. D. Back and T. Chen, Universal approximation
dynamical systems by dynamical consistent neural of multiple nonlinear operators by neural networks,
networks, in S. Haykin, J. Principe, T. Sejnowski Neural Computation 14(11) (2002) 2561–2566.
and J. McWhirter (eds.), New Directions in Statisti- 17. J. Kilian and H. T. Siegelmann, The dynamic uni-
cal Signal Processing: From Systems to Brain, MIT versality of sigmoidal neural networks, Information
Press (2006), pp. 203–242. and Computation 128(1) (1996) 48–56.
2. S. Haykin. Neural Networks: A Comprehensive 18. W. Maass, T. Natschlger and H. Markram, Real-
Foundation (Macmillan, New York, 1994). time computing without stable states: A new frame-
3. S. Haykin, J. Principe, T. Sejnowski and work for neural computation based on perturbations,
J. McWhirter, New Directions in Statistical Sig- Neural Computation 14(11) (2002) 2531–2560.
2nd Reading
July 29, 2007 21:55 00111

Recurrent Neural Networks are Universal Approximators 263

19. E. Sontag, Neural nets as systems model and con- 22. A. M. Schaefer, S. Udluft and H. G. Zimmermann,
trollers, in Proceedings of the Seventh Yale Workshop Learning long term dependencies with recurrent neu-
on Adaptive and Learning Systems, Yale University ral networks, in Proceedings of the International
(1992), pp. 73–79. Conference on Artificial Neural Networks (ICANN-
20. D. P. Mandic and J. A. Chambers, Recurrent neural 06), Athens (2006).
networks for prediction: Learning algorithms, archi- 23. H. G. Zimmermann, L. Bertolini, R. Grothmann,
tectures and stability, in S. Haykin (ed.), Adaptive A. M. Schaefer and Ch. Tietz, A technical trad-
and Learning Systems for Signal Processing, Com- ing indicator based on dynamical consistent neural
munications and Control, John Wiley & Sons, Chich- networks, in Proceedings of the International Con-
ester (2001). ference on Artificial Neural Networks (ICANN-06),
21. R. Neuneier and H. G. Zimmermann, How to train Athens (2006).
neural networks, in G. B. Orr and K. R. Mueller
(eds.), Neural Networks: Tricks of the Trade,
Springer Verlag, Berlin (1998), pp. 373–423.
by UNIVERSITY OF CALIFORNIA @ DAVIS on 01/09/15. For personal use only.
Int. J. Neur. Syst. 2007.17:253-263. Downloaded from www.worldscientific.com

View publication stats

You might also like