You are on page 1of 57

C H A P T E R

2
Dynamic Neural Networks: Structures
and Training Methods

2.1 ARTIFICIAL NEURAL To generate any model, we need to have at


NETWORK STRUCTURES our disposal:
• a basis, i.e., a set of elements from which mod-
2.1.1 Generative Approach to Artificial els are formed;
Neural Network Design • the rules used to form models by appropri-
ately combining the elements of the basis:
2.1.1.1 The Structure of the Generative • rules for the structuring of models;
Approach • rules for parametric adjustment of gener-
The generative approach is widely used in ated models.
applied and computational mathematics. This
One of the generative approach variants1 is
approach, extended by the ideas of ANN mod- that the desired dependence y(x) is represented
eling, is very promising as a flexible tool for the as a linear combination of the basis functions
formation of dynamical system models. ϕi (x), i = 1, . . . , n, i.e.,
The generative approach is interpreted fur-
ther as follows. We can treat the class of models 
n
y(x) = ϕ0 (x) + λi ϕi (x), λi ∈ R. (2.1)
which contains the desired (generated) dynami-
i=1
cal system model as a collection of tools produc-
ing dynamical system models that satisfy some The set of functions {ϕi (x)}, i = 1, . . . , n, we will
specified requirements. There are two main re- call the functional basis (FB). The expression of
quirements for this set of tools. Firstly, it must the form (2.1) is a decomposition (expansion) of
generate a potentially rich class of models (i.e., it the function y(x) with respect to the functional
must provide extensive choice possibilities) and, basis {ϕi (x)}ni=1 .
secondly, it should have as many as possible We will further consider the generation of the
simple “arrangement,” so that the implementa- FB expansion by varying the adjustable parame-
ters (the coefficients λi in the expansion (2.1)) as
tion of this class of models is not an “unbear-
able” problem. These two requirements, gener- 1 Examples of other variants are generative grammars from
ally speaking, are mutually exclusive. How and
the theory of formal grammars and languages [1–3], a syn-
by what tools to ensure an acceptable balance tactic approach to the description of patterns in the theory of
between them is discussed later in this section. pattern recognition [4–7].

Neural Network Modeling and Identification of Dynamical Systems


https://doi.org/10.1016/B978-0-12-815254-6.00012-5 35 Copyright © 2019 Elsevier Inc. All rights reserved.
36 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

tools to produce solutions (each particular com-


bination of λi provides some solution). The rule
for combining FB elements in the case of (2.1) is
a weighted summation of these items.
This technique is widely used in traditional
mathematics. In the general form, the functional
expansion can be represented as


n
y(x) = ϕ0 (x) + λi ϕi (x), λi ∈ R. (2.2)
i=1

Here the basis is a set of functions {ϕi (x)}ni=0 , and


the rule for combining the elements of a basis is a
weighted summation. The required expansion is
a linear combination of the functions ϕi (x), i =
1, . . . , n, as elements of the FB.
Here we present some examples of functional
expansions, often used in mathematical model- FIGURE 2.1 Functional dependence on one variable as
ing. (A) a linear and (B) a nonlinear combination of the FB ele-
ments fi (x), i = 1, . . . , n. From [109], used with permission
Example 2.1. We have the Taylor series expan- from Moscow Aviation Institute.
sion, i.e.,
The basis of this expansion is {ui (x)}ni=0 , and the
F (x) = a0 + a1 (x − x0 ) + a2 (x − x0 )2 + · · · rule for combining FB elements is a weighted
(2.3)
+ an (x − x0 )n + · · · . summation.

The basis of this expansion is {(x − x0 )i }∞


i=0 , and
In all these examples, the generated solutions
the rule for combining FB elements is a weighted are represented by linear combinations of ba-
summation. sis elements, parametrized by the corresponding
weights associated with each FB element.
Example 2.2. We have the Fourier series expan-
sion, i.e., 2.1.1.2 Network Representation of
Functional Expansions

 We can give a network interpretation for
F (x) = (ai cos(ix) + bi sin(ix)). (2.4)
functional expansions, which allows us to
i=0
identify similarities and differences between
The basis of this expansion is {cos(ix),sin(ix)}∞ their variants. Such description provides a fur-
i=0 ,
and the rule for combining FB elements is a ther simple transition to ANN models and
weighted summation. also allows to establish interrelations between
traditional-type models and ANN models.
Example 2.3. We have the Galerkin expansion, The structural representation of the func-
i.e., tional dependence on one variable as a linear
and nonlinear combination of the elements of

n
y(x) = u0 (x) + ci ui (x). (2.5) the basis fi (x), i = 1, . . . , n, is shown in Fig. 2.1A
i=1 and Fig. 2.1B, respectively.
2.1 ARTIFICIAL NEURAL NETWORK STRUCTURES 37

FIGURE 2.3 Vector-valued functional dependence on sev-


eral variables as a linear combination of the elements of the
basis fi (x1 , . . . , xn ), i = 1, . . . , N . From [109], used with per-
mission from Moscow Aviation Institute.

Here, the function F (x1 , x2 , . . . , xn ) is a (lin-


ear) combination of the elements of the basis
FIGURE 2.2 Scalar-valued functional dependence on sev- ϕi (x1 , x2 , . . . xn ).
eral variables as (A) a linear and (B) a nonlinear combination
An expansion of the form (2.6) has the follow-
of the elements of the basis fi (x1 , . . . , xn ), i = 1, . . . , N . From
[109], used with permission from Moscow Aviation Institute. ing features:
• the resulting decomposition is one-level;
• functions ϕi : Rn → R as elements of the
Similarly, for a scalar-valued functional de-
basis have limited flexibility (with variabil-
pendence on several variables as a linear and
ity of such types as displacement, compres-
nonlinear combination of elements of the basis
sion/stretching) or are fixed.
fi (x1 , . . . , xn ), i = 1, . . . , N , the structural repre-
sentation is given, respectively, in Fig. 2.2A and Such limited flexibility of the traditional func-
Fig. 2.2B. tional basis together with the one-level nature of
Vector-valued functional dependence on sev- the expansion sharply reduces the possibility of
eral variables as a linear combination of the el- obtaining some “right” model.2
ements of the basis fi (x1 , . . . , xn ), i = 1, . . . , N , in
the network representation is shown in 2.1.1.3 Multilevel Adjustable Functional
Fig. 2.3. The nonlinear combination is repre- Expansions
sented in a similar way, namely, we use non- As noted in the previous section, the possibil-
linear combining rules ϕi (f1 (x), . . . , fm (x)), i = ity of obtaining a “right” model is limited by a
1, . . . , m, x = x1 , . . . , xm , instead of the linear ones single-level structure and an inflexible basis for
N (·).
i=1 traditional expansions. For this reason, it is quite
We have written the traditional functional ex- natural to build a model that overcomes these
pansions mentioned above in general form as shortcomings. It must have the required level of
flexibility (and the needed variability in the gen-

m
y(x) = F (x1 , x2 , . . . , xn ) = λi ϕi (x1 , x2 , . . . xn ). 2 At the intuitive level, a “right model” is a model with
i=0
generalizing properties that are adequate to the application
(2.6) problem that we solve; see also Section 1.3 of Chapter 1.
38 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

FIGURE 2.4 Multilevel adjustable functional expansion. From [109], used with permission from Moscow Aviation Insti-
tute.

erated variants of the required model), by form- 2.1.1.4 Functional and Neural Networks
ing it as a multilevel network structure and by Thus, we can interpret the model as an expan-
appropriate parametrization of the elements of sion on the functional basis (2.6), where each el-
this structure. ement ϕi (x1 , x2 , . . . xn ) transforms n-dimensional
Fig. 2.4 shows how we can construct a mul- input x = (x1 , x2 , . . . xn ) in the scalar output y.
tilevel adjustable functional expansion. We see We can distinguish the following types of el-
that in this case, the adjustment of the expansion ements of the functional basis:
is carried out not only by varying the coefficients
of the linear combination, as in expansions of • the FB element as an integrated (one-stage)
the type (2.6). Now the elements of the func- mapping ϕi : Rn → R that directly transforms
tional basis are also parametrized. Therefore, in some n-dimensional input x = (x1 , x2 , . . . xn )
the process of solving the problem, the basis is to the scalar output y;
adjusted to obtain a dynamical system model • the FB element as a compositional (two-stage)
which is acceptable in the sense of the criterion mapping of the n-dimensional input x =
(1.30). (x1 , x2 , . . . xn ) to the scalar output y.
As we can see from Fig. 2.4, the transition In the two-stage (compositional) version, the
from a single-level decomposition to a multi- mapping Rn → R is performed in the first stage,
level one consists in the fact that each element “compressing” the vector input x = (x1 , x2 , . . . xn )
ϕj (v, wϕ ), j = 1, . . . , M, is decomposed using to the intermediate scalar output v, which in the
some functional basis {ψk (x, wψ )}, j = 1, . . . , K. second stage is additionally processed by the
Similarly, we can construct the expansion of the output mapping R → R to obtain the output y
elements ψk (x, wψ ) for another FB, and so on, (Fig. 2.5).
the required number of times. This approach Depending on which of these FB elements are
gives us the network structure with the re- used in the formation of network models (NMs),
quired number of levels, as well as the required the following basic variants of these models are
parametrization of the FB elements. obtained:
2.1 ARTIFICIAL NEURAL NETWORK STRUCTURES 39

ered ANN model3 : the neurons, of which the


ANN model consists, operate layer by layer, i.e.,
until all the neurons of the pth layer work, the
neurons of the (p + 1)th layer do not come into
operation. We will consider below the general
variant that defines the rules for activating neu-
rons in ANN models.
In the simplest variant of the structural orga-
nization of layered networks, all the layers L(p) ,
FIGURE 2.5 An element of an FB that transforms the
n-dimensional x = (x1 , x2 , . . . xn ) input into a scalar output y.
numbered from 0 to NL , are activated in the or-
(A) A one-stage mapping Rn → R. (B) A two-stage (com- der of their numbers. This variant means that
positional) mapping Rn → R → R. From [109], used with until all the neurons in the layer with the num-
permission from Moscow Aviation Institute. ber p work, the neurons from the (p + 1)th layer
are waiting. In turn, the pth layer can start oper-
ating only if all the neurons of the (p − 1)th layer
• The one-stage mapping Rn → R is an element
have already worked.
of functional networks.
• The two-stage mapping Rn → R → R is an Visually, we can represent such a structure as
element of neural networks. a “stack of layers,” ordered by their numbers. In
the simplest version, this “stack” looks like it is
The element of compositional type, i.e., the shown in Fig. 2.6A. Here the L(0) layer is the in-
two-stage mapping of the n-dimensional input put layer, the elements of which are components
to the scalar output, is a neuron; it is specific for of the ANN input vector.
functional expansion of the neural network type Any layer L(p) , 1  p < NL , is connected with
and is a “brand feature” of such expansions, in two adjacent layers: it gets its inputs from the
other words, ANN models of all kinds. previous layer L(p−1) , and it transfers its outputs
to the subsequent layer L(p+1) . The exception is
2.1.2 Layered Structure of Neural the layer L(NL ) , the latter in the ANN (the output
Network Models layer), which does not have a layer following it.
The outputs of the layer L(NL ) are the outputs
2.1.2.1 Principles of Layered Structural of the network as a whole. The layers L(p) with
Organization for Neural Network numbers 0 < p < NL are called hidden.
Models
Since the ANN shown in Fig. 2.6A is a feed-
We assume that the ANN models in the gen- forward network, all links between its layers
eral case have a layered structure. This assump- go strictly sequentially from the layer L(0) to
tion means that we divide the entire set of neu- the layer L(NL ) without “hopping” (bypassing)
rons constituting the ANN model into disjoint through adjacent layers and backward (feed-
subsets, which we will call layers. For these lay- back) links. A more complicated ANN struc-
ers we introduce the notation L(0) , L(1) , . . . , L(p) , ture version with bypass connection is shown in
. . . , L(NL ) . Fig. 2.6B.
The layered organization of the ANN model
determines the activation logic of its neurons.
This logic will be different for different struc- 3 For the case when the layers are in the order of their num-
tural variants of the network. The following bers and there are no feedbacks between the layers. In this
specificity takes place in the operation of the lay- case, the layers will operate sequentially and only once.
40 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

FIGURE 2.6 Variants of the structural organization for


layered neural networks with sequential numbering of lay-
ers (feedforward networks). (A) Without bypass connec-
tions. (B) With bypass connections (q > p + 1). From [109],
used with permission from Moscow Aviation Institute.

We also assume for networks of this type that


any pair of neurons between which there is a
connection refers to different layers. In other
words, neurons within any of the processing
layers L(p) , p = 1, . . . , NL , have no connections
with each other. Variants in which such relation-
ships, called lateral ones, are available in neural
networks require separate consideration.
We can complicate the structure of the con-
FIGURE 2.7 Variants of the structural organization for
nections of the layered network in comparison
layered neural networks with sequential numbering of lay-
with the scheme shown in Fig. 2.6. ers. (A) A network with a feedback from the output layer
The first of the possible variants of such com- L(NL ) to the first processing layer L(1) . (B) A network with
plication is the insertion of feedback into the feedback from the output layer L(NL ) to an arbitrary layer
ANN structure. This feedback transfers the re- L(p) , 1 < p < NL . (C) A network with feedback from the
layer L(q) , 1 < q < NL to the layer L(p) , 1 < p < NL . (D) An
ceived output of the network (i.e., the output
example of a network with feedback from the layer L(q) , 1 <
of the layer L(NL ) ) “back” to the input of the q < NL to the layer L(p) , 1 < p < NL and bypass connection
ANN. More precisely, we move the network out- from the layer L(p−1) to the layer L(q+1) . From [109], used
put to the input of its first processing layer L(1) , with permission from Moscow Aviation Institute.
as shown in Fig. 2.7A.
In Fig. 2.7B another way of introducing feed- ant can also be treated as a composition (serial
back into a layered network is shown, in which connection) of a feedforward network (layers
the feedback goes from the output layer L(NL ) L(1) , . . . , L(p−1) ) and a feedback network of the
to an arbitrary layer L(p) , 1 < p < NL . This vari- type shown in Fig. 2.7A (L(p) , . . . , L(NL ) ).
2.1 ARTIFICIAL NEURAL NETWORK STRUCTURES 41

The most general way of introducing feed- with L(1) and up to L(NL ) , or for some of them,
back into a “stack of layers”–type structure is for some range of numbers p1  p  p2 . The im-
shown in Fig. 2.7C. Here the feedback comes plementation depends on which layers of the
from some hidden layer L(q) , 1 < q < NL , to ANN we cover by feedback. However, in any
the layer L(p) , 1  p < NL , q > p. Similar to case, some strict sequence of operation of the
the case shown in Fig. 2.7A, this variant can be layers is preserved. If one of the ANN layers
treated as a serial connection of a feedforward started its work, then, until this work is com-
neural network (layers L(1) , . . . , L(p−1) ), the net- pleted, no other layer will be launched for pro-
work with feedback (layers L(p) , . . . , L(q) ), and cessing.
another feedforward network (layers L(q+1) , . . . , The rejection of this kind of strict firing se-
L(NL ) ). The operation of such a network can, for quence for the ANN layers leads to the appear-
example, be interpreted as follows. The recur- ance of parallelism in the network at the level of
rent subnet (the layers L(p) , . . . , L(q) ) is the main its layers. In the most general case, we allow for
part of the ANN as a whole. One feedforward any neuron from the layer L(p) and any neuron
subnet (layers L(1) , . . . , L(p−1) ) preprocesses the from the layer L(q) to establish a connection of
data entering the main subnet, while the second any type. Namely, we allow forward, backward
subnet (layers L(q+1) , . . . , L(NL ) ) performs some (for these cases p = q), or lateral (in this case
postprocessing of the data produced by the main p = q) connections. Here, for the time being, it is
recurrent subnet. still considered that a layered organization like
Fig. 2.7D shows an example of a generaliza- “stack of layers” is used.
tion of the structure shown in Fig. 2.7C, for the Variants of the ANN structural organization
case in which, in addition to strictly consecutive shown in Fig. 2.7 use the same “stack of lay-
connections between the layers of the network, ers” scheme for ordering the layers of the net-
there are also bypass connections. work. Here, at each time interval, the neurons
In all the ANN variants shown in Fig. 2.6, of only one layer work. The remaining layers ei-
the strict sequence of layers is preserved un- ther have already worked or are waiting for their
changed. The layers are activated one after the turn. This approach applies to both feedforward
other in the order specified by forward and networks and recurrent networks.
backward connections in the considered ANN. The following variant allows us to refuse
For a feedforward network, this means that any the “stack of layers” scheme and to replace it
neuron from the layer L(p) receives its inputs with more complex structures. As an example
only from neurons from the layer L(p−1) and illustrating structures of this kind, we show in
passes its outputs to the layer L(p+1) , i.e.,
Fig. 2.8 two variants of the structures of an ANN
L(p−1) → L(p) → L(p+1) , p ∈ {0, 1, . . . , NL }. with parallelism in them at the layer level.4
(2.7) Consider the schemes shown in Fig. 2.7 and
Fig. 2.8. Obviously, to activate a neuron from
At the same time (simultaneously) two or more some pth layer, it must first get the values of
layers cannot be executed (“fired”), even if there all its inputs it “waits for” until that moment.
is such a technical capability (the network is ex- For paralleling the work of neurons, we must
ecuted on some parallel computing system) due meet the same conditions. Namely, all neurons
to the sequential operation logic of the ANN lay- that have a complete set of inputs at a given mo-
ers noted above.
The use of feedback introduces cyclicity into 4 If we refuse the “stack of layers” scheme, some layers in
the order of operation for the layers. We can im- the ANN can work in parallel, i.e., simultaneously with each
plement this cyclicity for all layers, beginning other, if there is such a technical possibility.
42 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

FIGURE 2.8 An example of a structural organization for a layered neural network with layer-level parallelism. (A) Feed-
forward ANN. (B) ANN with feedbacks.

ment of time can operate independently from without branches and cycles. In structures with
each other in an arbitrary order or parallel if parallelism at the layer level for networks, as
there is such a technical capability. shown in Fig. 2.8, both forward “jumps” and
Suppose we have an ANN organized accord- feedbacks can be present. Such structures bring
ing to the “stack of layers” scheme. The logic of nonlinearity to the cause-and-effect chains; in
neuron activation (i.e., the sequence and condi- particular, they provide tree structures and cy-
tions of neuron operation) in this ANN ensures cles.
the absence of conflicts between them. If we in- The cause-and-effect chain should show
troduce a parallelism at the layer level in the which neurons transmit signals to some ana-
ANN, we need to add some additional synchro- lyzed neuron. In other words, it is required to
nization rules to provide such conflict-free net- show which neural predecessors should work
work operation. so that a given neuron receives a complete set of
Namely, a neuron can work as soon as it is input values. As noted above, this is a necessary
ready to operate, and it will be ready as soon condition for the readiness to operate a given
as it receives the values for all its inputs. Once neuron. This condition is the causal part of the
the neuron is ready for functioning, we should chain. Also, the chain indicates which neurons
start it immediately, as soon as it becomes possi- will get the output of this “current neuron.” This
ble. This is significant because the outputs of this indication will be the “effect” part of the cause-
neuron are required to ensure the operational and-effect chain.
readiness for other neurons that follow. In all the considered variants of the ANN
For the particular ANN, it is possible to spec- structural organization, only forward and back-
ify (to generate) a set of cause-and-effect rela- ward links were contained, i.e., connections be-
tions (chains) that provide the ability to monitor tween pairs of neurons in which the neurons
the operational conditions for different neurons from this pair belong to different layers.
to prevent conflicts between them. The third kind of connections that are possi-
For layered feedforward networks with the ble between neurons in the ANN is lateral con-
structures shown in Fig. 2.7, the cause-and- nections, in which the two neurons, between
effect chains will have a strictly linear structure, which the connection is established, belong to
2.1 ARTIFICIAL NEURAL NETWORK STRUCTURES 43

the same layer. One example of an ANN with


lateral connections is the Recurrent MultiLayer
Perceptron (RMLP) network [8–10].

2.1.2.2 Examples of Layered Structural


Organization for Neural Network
Models
Examples of structural organization options
for static-type ANN models (i.e., without TDL
elements and/or feedbacks) are shown in
Fig. 2.9. Network ADALINE [11] is a single-layer FIGURE 2.9 Examples of a structural organization for
(i.e., without hidden layers) linear ANN model. feedforward neural networks. (A) ADALINE (Adaptive Lin-
Its structure is shown in Fig. 2.9A. A more ear Network). (B) MLP (MultiLayer Perceptron). Din are
general variant of feedforward neural network source (input) data; Dout are output data (results); L(0) is in-
put layer; L(1) is output layer.
(FFNN) is MLP (MultiLayer Perceptron) [10,11],
which is a nonlinear network with one or more
hidden layers (Fig. 2.9B).
Dynamic networks can be divided into two
classes [12–19]:
• feedforward networks, in which the input sig-
nals are fed through delay lines (TDL ele-
ments);
• recurrent networks in which feedbacks exist,
and there may also be TDL elements at the in-
puts of the network.
Examples of the structural organization of
the ANN models of the dynamic type of the
first type (i.e., with TDL elements at the net-
work inputs, but without feedbacks) are shown
in Fig. 2.10.
Typical variants of ANN models of this FIGURE 2.10 Examples of a structural organization for
type are the Time Delay Neural Network feedforward dynamic neural networks. (A) TDNN (Time
Delay Neural Network). (B) DTDNN (Distributed Time De-
(TDNN) [10,20–27], whose structure is shown in
lay Neural Network). Din are source (input) data; Dout are
Fig. 2.10A (similarly, in the structural plan, the output data (results); L(0) is input layer; L(1) is hidden layer;
Focused Time Delay Neural Network [FTDNN] (n) (m)
L(2) is output layer; T DL1 and T DL2 are tapped delay
is organized) as well as the Distributed Time De- lines (TDLs) of order n and m, respectively.
lay Neural Network (DTDNN) network [28] (see
Fig. 2.10B).
Examples of the structural organization of dy-
namic ANN models of the second kind, that is, search began to develop, are the Jordan net-
of recurrent neural networks (RNN), are shown work [14,15] (Fig. 2.11A), the Elman network [10,
in Figs. 2.11–2.13. 29–32] (Fig. 2.11B), the Hopfield network [10,11]
Classical examples of recurrent networks, (Fig. 2.12A), and the Hamming network [11,28]
from which, to a large extent, this area of re- (Fig. 2.12B).
44 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

FIGURE 2.11 Examples of a structural organization for feedforward dynamic neural networks. (A) Jordan network. (B) El-
man network. Din are source (input) data; Dout are output data (results); L(0) is input layer; L(1) is hidden layer; L(2) is output
layer; T DL(1) is tapped delay line (TDL) of order 1.

FIGURE 2.12 Examples of a structural organization for feedforward dynamic neural networks. (A) Hopfield network.
(B) Hamming network. Din are source (input) data; Dout are output data (results); L(0) is input layer; L(1) is hidden layer;
L(2) is output layer; T DL(1) is tapped delay line (TDL) of order 1.

In Fig. 2.13A the ANN model Nonlinear any topology of forward and backward connec-
AutoRegression with eXternal inputs (NARX) tions, that is, in a certain sense, this structural
[33–41] is shown, which is widely used in mod- organization of the neural network is the most
eling and control tasks for dynamical systems. common.
The same structural organization has a variant The set of Figs. 2.14–2.17 allows us to spec-
of this network, expanded by the composition ify the structural organization of the layers of
of the parameters considered. This is the ANN the ANN model: the input layer (Fig. 2.14) and
model Nonlinear AutoRegression with Moving working (hidden and output) layers (Fig. 2.15).
Average and eXternal inputs (NARMAX) [42, In Fig. 2.16 the structure of the TDL element is
43]. presented, and in Fig. 2.17 the structure of the
In Fig. 2.13B we can see an example of an neuron as the main element that is part of the
ANN model with the Layered Digital Dynamic working layers of the ANN model is shown.
Network (LDDN) structure [11,28]. Networks One of the most popular static neural net-
with a structure of this type can have practically work architectures is a Layered Feedforward
2.1 ARTIFICIAL NEURAL NETWORK STRUCTURES 45

FIGURE 2.13 Examples of a structural organization for feedforward dynamic neural networks. (A) NARX (Nonlinear
AutoRegression with eXternal inputs). (B) LDDN (Layered Digital Dynamic Network). Din are source (input) data; Dout are
output data (results); L(0) is input layer; L(1) is hidden layer; L(2) is output layer for NARX network and hidden layer for
(m) (m) (n1) (n2)
LDDN; L(3) is output layer for LDDN; T DL1 , T DL2 , T DL1 , T DL1 are tapped delay lines of order m, m, n1, and n2
respectively.

FIGURE 2.15 General structure of the operational (hid-


(0)
den and output) ANN layers: si is the ith neuron of the
pth ANN layer; W (L(p) ) is a matrix of synaptic weights for
connection entering the neurons of the L(p) layer.

inputs; and a0i ∈ R is the value of the ith input.


For each ith neuron of an lth layer we denote
the following: nli is the weighted sum of neu-
ron inputs; ϕ li : R → R is the activation function;
and ali ∈ R is the output of an activation function
FIGURE 2.14 ANN input layer as a data structure. (the neuron state). Outputs aL
(0) i of activation func-
(A) One-dimensional array. (B) Two-dimensional array. si ,
(0)
tions of Lth layer neurons are called the network
sij are numeric or character variables. outputs. Also, W ∈ Rnw is the total vector of net-
work parameters, which consists of biases bli ∈ R
Neural Network (LFNN). We introduce the fol- and connection weights wli,j ∈ R. Thus, the lay-
lowing notation: L ∈ N is the total number of ered feedforward neural network is a parametric
layers; S l ∈ N is the number of neurons within function family, mapping the network inputs a0
the lth layer; S 0 ∈ N is the number of network and parameters W to the outputs aL according
46 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

FIGURE 2.16 Tapped delay lines (TDLs) as ANN structural elements. (A) TDL of order n. (B) TDL of order 1. D is delay
(memory) element.

and the logistic sigmoid function,

1
ϕ li (nli ) = logsig(nli ) = ,
1 + e−ni
l
(2.10)
l = 1, . . . , L − 1, i = 1, . . . , S l .

The hyperbolic tangent is more suitable for func-


tion approximation problems, since it has a sym-
FIGURE 2.17 General structure of a neuron within op- metric range [−1, 1]. On the other hand, the lo-
erational (hidden and output) ANN layers. (x, w) is in- gistic sigmoid is frequently used for classifica-
put mapping Rn → R1 (aggregating mapping) parametrized tion problems, due to its range [0, 1]. Identity
with synaptic weights w; (v) is output mapping R1 → R1 functions are frequently used as activation func-
(activation function); x = (x1 ), . . . , xn are neuron inputs; w =
tions for output layer neurons, i.e.,
(w1 ), . . . , wn are synaptic weights; v is output of the aggre-
gating mapping; y is output of the neuron.
i (ni ) = ni , i = 1, . . . , S .
ϕL L L L
(2.11)

to the following equations:


2.1.2.3 Input–Output and State Space
⎫ ANN-Based Models for
S l−1 ⎪


nli = bli + ⎬
wli,j al−1
j
Deterministic Nonlinear Controlled
j =1 l = 1, . . . , L, i = 1, . . . , S l . Discrete Time Dynamical Systems
  ⎪


⎭ Nonlinear AutoRegressive network with
ali = ϕ li nli
eXogeneous inputs [44] (NARX). One popular
(2.8) class of models for deterministic nonlinear con-
trolled discrete time dynamical systems is a class
The Lth layer is called the output layer, while all of input–output nonlinear autoregressive neural
the rest are called the hidden layers, since they network based models, i.e.,
are not directly connected to network outputs.
Common examples of activation functions for ŷ(tk ) = F(ŷ(tk−1 ), . . . , ŷ(tk−ly ), u(tk−1 ), . . . ,
the hidden layer neurons are the hyperbolic tan-

u(tk−lu ), W), k  max lu , ly ,


gent function,
(2.12)
eni − e−ni
l l

ϕ li (nli ) = th(nli ) = , where F(·, W) is a static neural network, and


eni + e−ni
l l
(2.9) lu and ly are the number of past controls and
l = 1, . . . , L − 1, i = 1, . . . , S l , past outputs used for prediction. (See Fig. 2.18.)
2.1 ARTIFICIAL NEURAL NETWORK STRUCTURES 47

FIGURE 2.18 Nonlinear AutoRegressive network with eXogeneous inputs.

The input–output modeling approach has seri-


ous drawbacks: first, the minimum time win-
dow size required to achieve the desired accu-
racy is not known beforehand; second, in order
to learn the long-term dependencies one might
need an arbitrarily large time window; third, if a
dynamical system is nonstationary, the optimal
time window size might change over time.
Recurrent neural network. An alternative
class of models for deterministic nonlinear con-
trolled discrete time dynamical systems is a class
of state-space neural network–based models,
usually referred to as the recurrent neural net-
works, i.e.,
FIGURE 2.19 Recurrent neural network in state space.
z(tk+1 ) = F(z(tk ), u(tk ), W),
(2.13)
ŷ(tk ) = G(z(tk ), W),
2.1.3 Neurons as Elements From Which
where z(tk ) ∈ Rnz are the state variables (also the ANN Is Formed
called the context units), ŷ(tk ) ∈ Rny are the pre-
dicted outputs, W ∈ Rnw is the model parameter The set L of all elements (neurons) included
vector, and F(·, W) and G(·, W) are static neural in the ANN is divided into subsets (layers), i.e.,
networks. (See Fig. 2.19.) One particular case of
a state-space recurrent neural network (2.13) is L(0) , L(1) , . . . , L(p) , . . . , L(NL ) , (2.14)
the Elman network [30]. In general, the optimal
number of state variables nz is unknown. Usu- or, in a more concise notation,
ally, one simply selects nz large enough to be
able to represent the unknown dynamical sys- L(p) , p = 0, 1, . . . , NL ,
(2.15)
tem with the required accuracy. L (p)
,L (q)
,L ;
(r)
p, q, r ∈ {0, 1, . . . , NL },
48 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

FIGURE 2.20 Neuron as a module converting n-dimen-


sional input vector into m-dimensional output vector. From
[109], used with permission from Moscow Aviation Institute.

where NL is the number of layers into which the


set of ANN elements is divided; p, q, r are the
indices used to number the arbitrary (“current”)
ANN layer.
In the list (2.14) L(0) is the input (zero) layer,
the purpose of which is to “distribute” the input
data to the neuron elements, which perform the
primary data processing. Layers L(1) , . . . , L(NL ) FIGURE 2.21 The primitive mappings of which the neu-
ensure the processing of the inputs of the ANN ron consists. From [109], used with permission from Moscow
into the outputs. Aviation Institute.
Suppose that in the ANN there are NL layers
L(p) , p = 0, 1, . . . , NL . 1) set of input mappings fi (xi
(in)
):
(p)
The layer L(p) has NL elements of the neu-
(p) (in)
ron elements Sj , i.e., fi : R → R; ui = fi (xi ), i = 1, . . . , n;

(p) (p) (2.17)


L(p) = {Sj }, j = 1, . . . , NL . (2.16)
2) aggregating mapping (“input star”) ϕ(u1 , . . . ,
(p) (p) (p,q) un ):
The element Sj has Nj inputs xi,j and
(p) (p,q)
Mj outputs xj,k . ϕ : Rn → R; v = ϕ(u1 , . . . , un ); (2.18)
(p)
The connections of the element Sj
3) converter (activation function) (v):
with other elements of the network can be
represented as a set of tuples showing where ψ : R → R; y = ψ(v); (2.19)
(p)
the outputs of the element Sj are trans-
ferred. 4) output mapping (“output star”) E (m) :
Thus, a single neuron as a module of the (out)
ANN (Fig. 2.20) is a mapping of the n- E (m) : R → Rm ; E (m) (y) = {xj },
(in)
dimensional input vector x (in) = (x1 , . . . , xn )
(in)
j = 1, . . . , m,
into the m-dimensional output vector x (out) = xj(out) = y, ∀j ∈ {j = 1, . . . , m}.
(out) (out)
(x1 , . . . , xm ), i.e., x (out) = (x (in) ). (2.20)
The mapping  is formed as a composi-
tion of the following primitive mappings The relations (2.20) are interpreted as follows:
(Fig. 2.21): mapping E (m) (y) generates as a result an m-
2.1 ARTIFICIAL NEURAL NETWORK STRUCTURES 49

FIGURE 2.22 Structure of the neuron. I – input vector; II – input mappings; III – aggregating mapping; IV – converter;
V – the output mapping; VI – output vector. From [109], used with permission from Moscow Aviation Institute.

The interaction of primitive mappings form-


ing a neuron is shown in Fig. 2.23.

2.1.4 Structural Organization of a


Neuron
(p)
A separate neuron element Sj of the neural
network structure (i.e., the j th neuron from the
pth layer) is an ordered pair of the form

(p) (p) (p)


Sj = j , Rj , (2.22)

(p)
where j is the transformation of the input
(p)
vector of dimension Nj into the output vector
FIGURE 2.23 The sequence of transformations (primitive
(p)
(p)
mappings) realized by the neuron. I – input vector; II – input of dimension Mj ;
Rj is the connection of the
mappings; III – aggregating mapping; IV – converter (activa- (p)
tion function); V – the output mapping; VI – output vector. output of the element Sj with other neurons of
From [109], used with permission from Moscow Aviation In- the considered ANN (with neurons from other
stitute. layers, they are direct and inverse relations; with
neurons from the same layer, they are lateral
element ordered set {xj
(out)
}, each element of connections).
(p) (r,p)
The transformation j (xi,j ) is the compo-
which takes the value xj(out)
= y.
sition of the primitives from which the neuron
The  map is formed as a composition of the
consists, i.e.,
mappings {fi }, ψ, ϕ, and E (m) (Fig. 2.22), i.e.,
(p) (r,p) (r,p) (r,p)
x (out) = (x (in) ) j (xi,j ) = (ψ(ϕ(fi,j (xi,j )))). (2.23)
(in) (in)
= E (m) (ψ(ϕ(f1 (x1 ), . . . , fn(in) (xn(in) )))). (p) (p)
The connections Rj of the neuron Sj are the
(2.21) set of ordered pairs showing where the outputs
50 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

(r,p) (p,q)
FIGURE 2.24 The numeration of the inputs/outputs of neurons and the notation of signals (xi,j and xj,k ), transmitted
(r) (p) (q)
via interneuron links; it is the basic level of the description of the ANN. Si , Sj , and Sk are neurons of the ANN (ith in
(r) (p) (q) (r)
the rth layer, j th in the pth layer, and kth in the qth layer, respectively); Ni , Nj , Nk are the number of inputs and Mi ,
(p) (q) (r) (p) (q) (r,p)
M j , Mk are the number of outputs in the neurons Si , Sj , and Sk , respectively; xi,j is the signal transferred from the
(p,q)
output of the ith neuron from the rth layer to the input of the j th neuron from the pth layer; xj,k is the signal transferred
from the output of the j th neuron from the pth layer to the input of the kth neuron from the qth layer; g, h, l, m, n, s are the
numbers of the neuron inputs/outputs; NL is the number of layers in the ANN; N (r) , N (p) , N (q) is the number of neurons in
the layers with numbers r, p, q, respectively. From [109], used with permission from Moscow Aviation Institute.

of a given neuron go, i.e., der of the input/output quantities is important,


(p) q
i.e., the set of these quantities is interpreted as a
Rj = {q, k}, q ∈ {1, . . . , NL }, k ∈ {1, . . . , NL }. vector. For example, this kind of representation
(2.24) is used in the compressive mapping of the RBF
neuron, which realizes the calculation of the dis-
Inputs/outputs of neurons are described as
tance between two vectors.
follows.
In the variant, when a complete specification
In the variant with the maximum detail of
the description (extended level of the ANN de- of the neuron’s connections is not required (this
scription), which provides the possibility of rep- is the case when the result of “compression” (or
resenting any ANN structure, we use the nota- “aggregation”) ϕ : Rn → R does not depend on
(r,p) the order of the input components), we can use
tion of the form x(i,l),(j,m) . It identifies the sig-
a simpler notation for the input/output signals
nal transmitted from the neuron Si(r) (ith neuron (r,p)
(p)
of the neuron, which has the form xi,j . In this
from the rth layer) to Sj (j th neuron from the case, it is simply indicated that the connection
pth layer), and the outputs of the ith neuron in goes from the ith neuron of the rth layer to the
the rth layer and the inputs of the j th neuron j th neuron of the pth layer, without specifying
of the pth layer are renumbered; according to the serial numbers of the input/output compo-
their numbering, l is the serial number of the nents.
output of the element Si(r) , and m is the entry se- The system of numbering of the neuron in-
(p)
rial number of the element Sj . Such a detailed puts/outputs in the ANN, as well as the in-
representation is required in cases where the or- terneuron connections, is illustrated in Fig. 2.24
2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS 51

(r,p) (p,q)
FIGURE 2.25 The numbering of the inputs/outputs of neurons and the designations of signals (x(i,h),(j,l) and x(j,m),(k,n) ),
(r) (p) (q)
transmitted through interneuronal connections; it is the extended level of the description of the ANN. Si , Sj , and Sk
(r) (p) (q)
are the neurons of the ANN (ith in the rth layer, j th in the pth layer, and kth in the qth layer, respectively); Ni , Nj , Nk
(r) (p) (q) (r) (p) (q)
are the number of inputs and Mi , Mj , Mk are the number of outputs in the neurons Si , Sj , and Sk , respectively;
(r,p)
x(i,h),(j,l) is the signal transferred from the hth output of the ith neuron from the rth layer on the lth input of the j th neuron
(r,p)
from the pth layer; x(i,h),(j,l) is the signal transferred from the mth exit of the j th neuron from the pth layer to the nth input
of the kth neuron from the qth layer; g, h, l, m, n, s are the numbers of the neuron inputs/outputs; NL is the number of layers
in the ANN; N (r) , N (p) , N (q) are the number of neurons in the layers with numbers r, p, q, respectively.

for the baseline level of the ANN description • unsupervised learning;


and in Fig. 2.25 for the advanced level. • supervised learning;
• reinforcement learning.
The features of these approaches are as fol-
2.2 ARTIFICIAL NEURAL
lows.
NETWORK TRAINING
In the case of unsupervised learning, only the
METHODS
inputs are given, and there are no prescribed
output values. Unsupervised learning aims at
After an appropriate neural network struc-
discovering inherent patterns in the data set.
ture has been selected, one needs to determine
This approach is usually applied to clustering
the values of its parameters in order to achieve
the desired input–output behavior. The pro- and dimensionality reduction problems.
cess of parameter modification is usually called In the case of a supervised learning, the de-
learning or training, when referred to neural net- sired network behavior is explicitly defined by
works. Thus, the ANN learning algorithm is a a training data set. Each training example asso-
sequence of actions which modifies the parame- ciates some input with a specific desired output.
ters so that the network would be able to solve The goal of the learning is to find such values
some specific task. of the neural network parameters that the ac-
There are several major approaches to the tual network outputs would be as close as possi-
neural network learning problem: ble to the desired ones. This approach is usually
52 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

applied to classification, regression, and system methods, see [10]. Reinforcement learning meth-
identification problems. ods are presented in the books [45–48].
If a training data set is not known beforehand, We need to mention that the actual goal of
but rather presented sequentially one example the neural network supervised learning is not
at a time, and a neural network is expected to to achieve a perfect match of predictions with
operate and learn simultaneously, then it is said the training data, but to perform highly accurate
to perform incremental learning. Additionally, if predictions on the independent data during the
the environment is assumed to be nonstationary, network operation, i.e., the network should be
i.e., the desired response to some input may vary able to generalize. In order to evaluate the gen-
over time, then the training data set becomes eralization ability of a network, we split all the
inconsistent and a neural network needs to per- available experimental data into training set and
form adaptation. In this case, we face a stability- test set. The model learns only on the training set,
plasticity dilemma: if the network lacks plastic- and then it is evaluated on an independent test
set. Sometimes, yet another subset is reserved –
ity, then it cannot rapidly adapt to changes; on
the so-called validation set, which is used to select
the other hand, if it lacks stability, then it forgets
the model hyperparameters (such as the number
the previously learned data.
of layers or neurons).
Another variation of supervised learning is
active learning, which assumes that the neural
network itself is responsible for the data set ac- 2.2.1 Overview of the Neural Network
quisition. That is, the network selects a new in- Training Framework
put and queries an external system (for example,
Suppose that the network parameters are rep-
some sensor) for the desired outputs that corre-
resented by a finite-dimensional vector W ∈ Rnw .
spond to this input. Hence, a neural network is
The supervised learning approach implies a
expected to “explore” the environment by inter-
minimization of an error function (also called
acting with it and to “exploit” the obtained data objective function, loss function, or cost func-
by minimizing some objective. In this paradigm, tion), which represents the deviation of actual
finding a balance between exploration and ex- network outputs from their desired values. We
ploitation becomes an important issue. Reinforce- define a total error function Ē : Rnw → R to be a
ment learning takes the idea of active learning sum of individual errors for each of the training
one step further by assuming that the external examples, i.e.,
system cannot provide the network with exam-
ples of desired behavior – instead, it can only 
P
score the previous behavior of the network. This Ē(W) = E (p) (W). (2.25)
approach is usually applied to intelligent control i=1
and decision making problems. The error function (2.25) is to be minimized with
In this book, we cover only the supervised respect to neural network parameters W. Thus,
learning approach and focus on the modeling we have an unconstrained nonlinear optimiza-
and identification problem for dynamical sys- tion problem:
tems. Section 2.3.1 treats the training methods
for static neural networks with applications to minimize Ē(W). (2.26)
W
function approximation problems. These meth-
ods constitute the basis for dynamic neural In order for the minimization problem to
network training algorithms, discussed in Sec- make sense, we require the error function to be
tion 2.3.3. For a discussion of unsupervised bounded from below.
2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS 53

Minimization is carried out by means of vari- methods use only the error function values; first-
ous iterative numerical methods. The optimiza- order methods rely on the first derivatives (gra-
tion methods can be divided into global and lo- dient ∇ Ē); second-order methods also utilize the
cal ones, according to the type of minimum they second derivatives (Hessian ∇ 2 Ē).
seek for. Global optimization methods seek for The basic descent method has the form
an approximate global minimum, whereas lo-
cal methods seek for a precise local minimum. W(k+1) = W(k) + α (k) p(k) , Ē(W(k+1) )< Ē(W(k) ),
Most of the global optimization methods have a (2.29)
stochastic nature (e.g., simulated annealing, evo-
lutionary algorithms, particle swarm optimiza- where p(k) is a search direction and α (k) rep-
tion) and the convergence is achieved almost resents a step length, also called the learning
surely and only in the limit. In this book we rate. Note that we require each step to de-
focus on the local deterministic gradient-based crease the error function. In order to guaran-
optimization methods, which guarantee a rapid tee the error function decrease for arbitrarily
convergence to a local solution under some rea- small step lengths, we need the search direc-
sonable assumptions. In order to apply these tion to be a descent direction, that is, to satisfy
T
methods, we also require the error function to p(k) ∇ Ē(W(k) ) < 0.
be sufficiently smooth (which is usually the case The simplest example of a first-order de-
with neural networks provided all the activa- scent method is the gradient descent (GD) method,
tion functions are smooth). For more detailed which utilizes the negative gradient search di-
information on local optimization methods, we rection, i.e.,
refer to [49–52]. Metaheuristic global optimiza-
tion methods are covered in [53,54]. p(k) = −∇ Ē(W(k) ). (2.30)
Note that the local optimization methods re-
quire an initial guess W(0) for parameter values. The step lengths may be assigned beforehand ∀s
There are various approaches to the initializa- α (k) ≡ α, but if the step α is too large, the error
tion of network parameters. For example, the function might actually increase, and then the it-
parameters may be sampled from a Gaussian erations would diverge. For example, in the case
distribution, i.e., of a convex quadratic error function of the form

1
Wi ∼ N (0, 1), i = 1, . . . , nw . (2.27) Ē(W) = WT AW + bT W + c, (2.31)
2
The following alternative initialization method where A is a symmetric positive definite matrix
for layered feedforward neural networks (2.8), with a maximum eigenvalue of λmax , the step
called Xavier initialization, was suggested in length must satisfy
[55]:
2
α< ,
bli = 0, λmax

6 6 (2.28) in order to guarantee the convergence of gradi-
wli,j ∼U − , .
S l−1 + S l S l−1 + S l ent descent iterations. On the other hand, a small
step α would result in a slow convergence. In or-
Optimization methods may also be classified der to circumvent this problem, we can perform
by the order of error function derivatives used a step length adaptation: we take a “trial” step,
to guide the search process. Thus, zero-order evaluate the error function and check whether
54 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

it has decreased or not. If it has decreased, then Another important first-order method is the
we accept this trial step, and we increase the nonlinear conjugate gradient (CG) method. In fact,
step length. Otherwise, we reject the trial step, it is a family of methods which utilize search di-
and decrease the step length. An alternative ap- rections of the following general form:
proach is to perform a line search for an optimal
step length which provides the maximum pos- p(0) = −∇ Ē(W(0) ),
(2.34)
sible reduction of the error function along the p(k) = −∇ Ē(W(k) ) + β (k) p(k−1) .
search direction, i.e.,
  Depending on the choice of the scalar β (k) , we
α (k) = argmin Ē W(k) + αp(k) . (2.32) obtain several variations of the method. The
α>0 most popular expressions for β (k) are the follow-
ing:
The GD method combined with this exact line
search is called the steepest gradient descent. Note • the Fletcher–Reeves method:
that the global minimum of this univariate func- T
tion is hard to find; in fact, even a search for ∇ Ē(W(k) ) ∇ Ē(W(k) )
β (k) = T
; (2.35)
an accurate estimate of a local minimum would ∇ Ē(W(k−1) ) ∇ Ē(W(k−1) )
require many iterations. Fortunately, we do not
need to find an exact minimum along the spec- • the Polak–Ribière method:
ified direction – the convergence of an overall T  
minimization procedure may be obtained if we ∇ Ē(W(k) ) ∇ Ē(W(k) ) − ∇ Ē(W(k−1) )
β =
(k)
T
;
guarantee a sufficient decrease of an error func- ∇ Ē(W(k−1) ) ∇ Ē(W(k−1) )
tion at each iteration. If the search direction is a (2.36)
descent direction and if the step lengths satisfy
the Wolfe conditions • the Hestenes–Stiefel method:
    T  
Ē W(k) + α (k) p(k)  Ē W(k) ∇ Ē(W(k) ) ∇ Ē(W(k) ) − ∇ Ē(W(k−1) )
β = 
(k)
T .
T ∇ Ē(W(k) ) − ∇ Ē(W(k−1) ) p(k−1)
+ c1 α (k) ∇ Ē(W(k) ) p(k) ,
(2.37)
 T T
∇ Ē W(k) + α (k) p(k) p(k)  c2 ∇ Ē(W(k) ) p(k) , Irrespective of the particular β (k) selected, the
(2.33) first search direction p(0) is simply the nega-
tive gradient direction. If we assume that the er-
for 0 < c1 < c2 < 1, then the iterations con- ror function is convex and quadratic (2.31), then
verge to a stationary point lim ∇ Ē(W(k) ) = 0 the method generates a sequence of conjugate
s→∞
T
from an arbitrary initial guess (i.e., we have a search directions (i.e., p(i) Ap(j ) = 0 for i = j ). If
global convergence to a stationary point). Note we also assume that the line searches are exact,
that there always exist intervals of step lengths then the method converges within nw iterations.
which satisfy the Wolfe conditions. This justifies In the general case of a nonlinear error func-
the use of inexact line search methods, which tion, the convergence rate is linear; however, a
require fewer iterations to find an appropriate twice differentiable error function with nonsin-
step length which provides a sufficient reduc- gular Hessian is approximately quadratic in the
tion of an error function. Unfortunately, the GD neighborhood of the solution, which results in
method has a linear convergence rate, which is fast convergence. Note also that the search direc-
very slow. tions lose conjugacy, hence we need to perform
2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS 55

so-called “restarts,” i.e., to assign β (k) ← 0. For then the gradient and Hessian may be expressed
example, we might reset β (k) if the consecutive

in terms of the error Jacobian as follows:
 (k) T (k−1) 
p p  T
directions are nonorthogonal  
p(k) 2
> ε. In ∂e(p) (W) (p)
∇E (p) (W) = e (W),
the case of the Polak–Ribière method, we should ∂W
T
also reset β (k) if it becomes negative. ∂e(p) (W) ∂e(p) (W)
∇ 2 E (p) (W) = (2.41)
The basic second-order method is Newton’s ∂W ∂W
method: ne (p)
∂ 2 ei (W) (p)
+ ei (W).
 −1 ∂W
i=1
p(k) = − ∇ 2 Ē(W(k) ) ∇ Ē(W(k) ). (2.38)
Then, the Gauss–Newton approximation to the
Hessian is obtained by discarding the second-
If the Hessian ∇ 2 Ē(W(k) ) is positive definite, order terms, i.e.,
the resulting search direction p(k) is a descent T
∂e(p) (W) ∂e(p) (W)
direction. If the error function is convex and ∇ 2 E (p) (W) ≈ B(p) = .
quadratic, Newton’s method with a unit step ∂W ∂W
length α (k) = 1 finds the solution in a single step. (2.42)
For a smooth nonlinear error function with pos- The resulting matrix B can turn out to be degen-
itive definite Hessian at the solution, the con- erate, so we might modify it by adding a scaled
vergence is quadratic, provided the initial guess identity matrix as mentioned above in (2.39).
lies sufficiently close to the solution. If a Hes- Then we have
sian turns out to have negative or zero eigen- T
values, we need to modify it in order to obtain ∂e(p) (W) ∂e(p) (W)
B(p) = + μ(k) I. (2.43)
a positive definite approximation B – for exam- ∂W ∂W
ple, we might add a scaled identity matrix, so This technique leads us to the Levenberg–
we have Marquardt method.
A family of quasi-Newton methods estimate
B(k) = ∇ 2 Ē(W(k) ) + μ(k) I. (2.39) the inverse Hessian by accumulating the changes
of gradients. These methods construct an in-
−1
verse Hessian approximation H ≈ ∇ 2 Ē(W) so
The resulting damped method may be viewed as to satisfy the secant equation:
as hybrid of the ordinary Newton method (for
μ(k) = 0) and a gradient descent (for μ(k) → H(k+1) y(k) = s(k) ,
∞). s(k) = W(k+1) − W(k) , (2.44)
Note that the Hessian computation is very
y (k)
= ∇ Ē(W (k+1)
) − ∇ Ē(W ).
(k)
computationally expensive; hence there have
been proposed various approximations. If we However, for nw > 1 this system of equations
assume that each individual error is a quadratic is underdetermined and there exists an infi-
form, nite number of solutions. Thus, additional con-
straints are imposed, giving rise to various
1 T quasi-Newton methods. Most of them require
E (p) (W) = e(p) (W) e(p) (W), (2.40) that the inverse Hessian approximation H(k+1)
2
56 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

be symmetric and positive definite, and also lor series approximation of the error function as
minimize the distance to the previous estimate the model
H(k) with (k+1) =
 respect  to some norm: H

argmin H − H (k)  . One of the most popular Ē(W(k) + p) ≈ M̄ (k) (p) = Ē(W(k) ) + pT ∇ Ē(W(k) )
H
variations of quasi-Newton methods is the 1
+ pT ∇ 2 Ē(W(k) )p
Broyden–Fletcher–Goldfarb–Shanno (BFGS) algo- 2
(2.46)
rithm:

    and a ball of radius (k) as a trust region, we


(k) (k) (k) T T
H (k+1)
= I −ρ s y H(k) I −ρ (k) y(k) s(k) obtain the following constrained quadratic opti-
T mization subproblems:
+ ρ (k) s(k) s(k) , (2.45)
1
ρ (k) = T
. minimize M̄ (k) (p)
y(k) s(k) p
(2.47)
 
subject to p  (k) .
The initial guess for inverse Hessian may be se-
lected in different ways: for example, if H(0) = The trust region radius is adapted based on the
I, then the first search direction corresponds to ratio of predicted and actual reduction of an er-
that of GD. Note that if H(0) is positive definite ror function, i.e.,
and a step length is selected by a line search so
as to satisfy the Wolfe conditions (2.33), then all Ē(W(k) ) − Ē(W(k) + p(k) )
ρ (k) = . (2.48)
the resulting H(k) will also be positive definite. M̄(0) − M̄(p(k) )
The BFGS algorithm has a superlinear rate of
convergence. We should also mention that the If this ratio is close to 1, the trust region radius
inverse Hessian approximation contains n2w el- is increased by some factor; if the ratio is close
ements, and when the number of parameters nw to 0, the radius is decreased. Also, if ρ (k) is nega-
is large, it might not fit in the memory. In order tive or very small, the step is rejected. We could
to circumvent this issue, a limited memory BFGS also
 use ellipsoidal trust regions of the form
(L-BFGS) [56] version of the algorithm has been Dp  (k) , where D is a nonsingular diago-
proposed, which stores only the m  nw most nal matrix.
 
k−1 There exist various approaches to solving the
recent vector pairs s(j ) , y(j ) j =k−m and uses
them to compute the search direction without constrained quadratic optimization subproblem
(2.47). One of these techniques [58] relies on the
explicit evaluation of H(k) .
linear conjugate gradient method in order to solve
Another strategy, alternative to line search
the system of linear equations
methods, is a family of trust region methods
[57]. These methods repeatedly construct some
local model M̄ (k) of the error function, which ∇ 2 Ē(W(k) )p = −∇ Ē(W(k) ),
is assumed to be valid in a neighborhood of
current point W(k) , and minimize it within this with respect to p. This results in the following
neighborhood. If we utilize a second-order Tay- iterations:
2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS 57

p(k,0) = 0, Note that since the error function (2.25) is a


r (k,0)
= ∇ Ē(W ), (k) summation of errors for each individual training
example, its gradient as well as its Hessian may
d (k,0)
= −∇ Ē(W(k) ), also be represented as summations of gradients
T
r(k,s) r(k,s) and Hessians of these errors, i.e.,
α (k,s) = T
,
d(k,s) ∇ 2 Ē(W(k) )d(k,s) 
P

p(k,s+1)
=p (k,s)
+α (k,s) (k,s)
d , ∇ Ē(W) = ∇E (p) (W), (2.50)
p=1
r(k,s+1) = r(k,s) + α (k,s) ∇ 2 Ē(W(k) )d(k,s) ,

P
T
r(k,s+1) r(k,s+1) ∇ 2 Ē(W) = ∇ 2 E (p) (W). (2.51)
β (k,s+1) = T
, p=1
r(k,s) r(k,s)
d(k,s+1)
=r + β (k,s+1) d(k,s) .
(k,s+1)
In the case the neural network has a large
number of parameters nw and the data set con-
The iterations are terminated prematurely ei-
tains a large number of training examples P ,
ther if they
 (k,s+1)  cross the trust region boundary,
p   (k) , or if a nonpositive curvature computation of the total error function value Ē
T as well as its derivatives can be time consuming.
direction is discovered, d(k,s) ∇ 2 Ē(W(k) )d(k,s)  Thus, even for a simple GD method, each update
0. In these cases, a solution corresponds to the of the weights takes a lot of time. Then, we might
intersection of the current search direction with apply a stochastic gradient descent (SGD) method,
the trust region boundary. It is important to note which randomly shuffles training examples, it-
that this method does not require one to com- erates over them, and updates the parameters
pute the entire Hessian matrix; instead, we need using the gradients of individual errors E (p) :
only the Hessian vector products of the form
∇ 2 Ē(W(k) )d(k,s) , which may be computed more W(k,p) = W(k,p−1) − α (k) ∇E (p) (W(k,p−1) ),
efficiently by reverse-mode automatic differen-
tiation methods described below. Such Hessian- W(k+1,0) = W(k,P ) .
free methods have been successfully applied to (2.52)
neural network training [59,60].
Another approach to solving (2.47) [61,62] re- In contrast, the usual gradient descent is called
places the subproblem with an equivalent prob- the batch method. We need to mention that al-
lem of finding both the vector p ∈ Rnw and the though the (k, p)th step decreases the error for
scalar μ  0 such that the pth training example, it may increase the er-
  ror for the other examples. On the one hand it
∇ 2 Ē(W(k) ) + μI p = −∇ Ē(W(k) ), allows the method to escape some local min-
  (2.49) ima, but on the other hand it becomes difficult
μ( − p) = 0, to converge to a final solution. In order to cir-
cumvent this issue, we might gradually decrease
where ∇ 2 Ē(W(k) ) + μI is positive semidefinite. the step lengths α (k) . Also, in order to achieve
There are two possibilities. If μ = 0, then we
 −1   a “smoother” convergence we could perform
have p = − ∇ 2 Ē(W(k) ) ∇ Ē(W(k) ) and p  the weight updates based on random subsets of
. If μ > 0, then we define p(μ) = training examples, which is called a “minibatch”
 −1
− ∇ 2 Ē(W(k) ) + μI ∇Ē(W(k) ) and solve a one- strategy. The stochastic or minibatch approach
dimensional equation p(μ) =  with respect may also be applied to other optimization meth-
to μ. ods; see [63].
58 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

We should also mention that in the case of the where x(p) ∈ X represent the input vectors and
batch or minibatch update strategy, the compu- ỹ(p) ∈ Y represent the observed output vectors.
tation of total error function values, as well as Note that in general the observed outputs ỹ(p)
its derivatives, can be efficiently parallelized. In do not match the true outputs y(p) = f(x(p) ). We
order to do that, we need to divide the data set assume that the observations are corrupted by
into multiple subsets, compute partial sums of an additive Gaussian noise, i.e.,
the error function and its derivatives over the
training examples of each subset in parallel, and ỹ(p) = y(p) + η(p) , (2.57)
then sum the results. This is not possible in the
where η(p) represent the sample points of a zero-
case of stochastic updates. In the case of an SGD
mean random vector η ∼ N (0, ) with diagonal
method, we can parallelize the gradient compu-
covariance matrix
tations by neurons of each layer.
⎛ 2 ⎞
Finally, we note that any iterative method re- σ1
quires a stopping criterion used to terminate the ⎜
0 ⎟
..
procedure. One simple option is a test based on =⎜ ⎝ . ⎟.

first-order necessary conditions for a local mini- 0 σ 2ny
mum, i.e.,
  The approximation is to be performed using a
 
∇E(W(k) ) < εg . (2.53) layered feedforward neural network of the form
(2.8). Under the abovementioned assumptions
We can also terminate iterations if it seems that on the observation noise, it is reasonable to uti-
no progress is made, i.e., lize a least-squares error function. Thus, we have
a total error function Ē of the form (2.25) with
E(W(k) ) − E(W(k+1) ) < εE , the individual errors
  (2.54)
 (k)  1  (p) T  
W − W(k+1)  < εw .
E (p) (W) = ỹ − ŷ(p)  ỹ(p) − ŷ(p) ,
2
In order to prevent an infinite loop in the case (2.58)
of algorithm divergence, we might stop when a
certain maximum number of iterations has been where ŷ(p) represent the neural network out-
performed, i.e., puts given the corresponding inputs x(p) and
weights W. The diagonal matrix  of fixed “er-
k  k̄. (2.55) ror weights” has the form
⎛ ⎞
ω1 0
2.2.2 Static Neural Network Training ⎜ ⎟
=⎜ ⎝
..
. ⎟,

In this subsection, we consider the function
approximation problem. The problem is stated 0 ωny
as follows. Suppose that we wish to approxi-
mate an unknown mapping f : X → Y, where where ωi are usually taken to be inversely pro-
X ⊂ Rnx and Y ⊂ Rny . Assume we are given an portional to noise variances.
experimental data set of the form We need to minimize the total approximation
error Ē with respect to the neural network pa-
 P rameters W. If activation functions of all the neu-
x(p) , ỹ(p) , (2.56) rons are smooth, then the error function is also
p=1
2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS 59

smooth. Hence, the minimization can be carried The automatic differentiation technique [64]
out using any of the optimization methods de- computes function derivatives at a point by ap-
scribed in Section 2.2.1. However, in order to plying the chain rule to the corresponding nu-
apply those methods, we need an efficient al- merical values instead of symbolic expressions.
gorithm to compute the gradient and Hessian This method produces accurate derivative val-
of the error function with respect to the param- ues, just like the symbolic differentiation, and
eters. As mentioned above, the total error gra- also allows for a certain performance optimiza-
dient ∇ Ē and Hessian ∇ 2 Ē may be expressed tion. Note that automatic differentiation relies
in terms of the individual error gradients ∇E (p) on the original computational graph for the
and Hessians ∇ 2 E (p) . Thus, all that remains is to function to be differentiated. Thus, if the original
compute the derivatives of E (p) . For notational graph makes use of some common intermedi-
convenience, in the remainder of this section we ate values, they will be efficiently reused by the
omit the training example index p. differentiation procedure. Automatic differen-
There exist several approaches to computa- tiation is especially useful for neural network
tion of error function derivatives: training, since it scales well to multiple param-
eters as well as higher-order derivatives. In this
• numeric differentiation; book, we adopt the automatic differentiation ap-
• symbolic differentiation; proach.
• automatic (or algorithmic) differentiation. Automatic differentiation encompasses two
different modes of computation: forward and re-
The numeric differentiation approach relies on
verse. Forward mode computes sensitivities of all
the derivative definition and approximates it via
variables with respect to input variables: it starts
finite differences. This method is very simple to
with the intermediate variables that explicitly
implement, but it suffers from truncation and
depend on the input variables (most deeply
roundoff errors. It is especially inaccurate for
nested subexpressions) and proceeds “forward”
higher-order derivatives. Also, it requires many
by applying the chain rule, until the output vari-
function evaluations: for example, in order to es-
ables are processed. Reverse mode computes sen-
timate the error function gradient with respect to sitivities of output variables with respect to all
nw parameters using the simplest forward differ- variables: it starts with the intermediate vari-
ence scheme we require error function values at ables on which the output variables explicitly
nw + 1 points. depend (outermost subexpressions) and pro-
Symbolic differentiation transforms a symbolic ceeds “in reverse” by applying the chain rule,
expression for the original function (usually rep- until the input variables are processed. Each
resented in the form of a computational graph) mode has its own advantages and disadvan-
into symbolic expressions for its derivatives by tages. The forward mode allows to compute
applying a chain rule. The resulting expressions function values as well as its derivatives of mul-
may be evaluated at any point accurately to tiple orders in a single pass. On the other hand,
working precision. However, these expressions in order to compute the rth-order derivative us-
usually end up to have many identical subex- ing the reverse mode, one needs the derivatives
pressions, which leads to duplicate computa- of all the lower orders s = 0, . . . , r − 1 before-
tions (especially in the case we need the deriva- hand. Computational complexity of first-order
tives with respect to multiple parameters). In or- derivatives computation in the forward mode is
der to avoid this, we need to simplify the expres- proportional to the number of inputs, while in
sions for derivatives, which presents a nontrivial the reverse mode it is proportional to the num-
problem. ber of outputs. In our case, there is only one out-
60 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

put (the scalar error) and multiple inputs; there- We define the error function sensitivities with
fore reverse mode is significantly faster than the respect to weighted sums nli to be as follows:
forward mode. As shown in [65], under realis-
∂E
tic assumptions the error function gradient can δ li  . (2.62)
be computed in reverse mode at a cost of five ∂nli
function evaluations or less. Also note that in the
Sensitivities for the output layer neurons are
ANN field the forward and reverse computation
obtained directly, i.e.,
modes are usually referred to as forward propa-
  
gation and backward propagation (or backprop- δL = −ω ỹ − a L L L
(2.63)
i i i i ϕ i (ni ),
agation).
In the rest of this subsection we present auto- while sensitivities for the hidden layer neurons
matic differentiation algorithms for the compu- are computed during a backward pass:
tation of gradient, Jacobian, and Hessian of the
l+1
squared error function (2.58) in the case of a lay- 
S

ered feedforward neural network (2.8). All these δ li = ϕ li (nli ) δ l+1 l+1
j wj,i , l = L − 1, . . . , 1.
algorithms rely on the fact that the derivatives j =1
of activation functions are known. For example, (2.64)
the derivatives of hyperbolic tangent activation
functions (2.9) are Finally, the error function derivatives with re-
spect to parameters are expressed in terms of

 2 ⎫ sensitivities, i.e.,
ϕ li (nli ) = 1 − ϕ li (nli ) ⎬ l = 1, . . . , L − 1,
⎭ ∂E
ϕ li (nli ) = −2ϕ li (nli )ϕ li (nli ) i = 1, . . . , S ,
  l
= δ li ,
∂bli
(2.59) (2.65)
∂E
= δ li al−1
j .
while the derivatives of a logistic function (2.10) ∂wli,j
equal
In a similar manner, we can compute the
  ⎫ derivatives with respect to network inputs, i.e.,
l l
ϕ i (ni ) = ϕ i (ni ) 1 − ϕ i (ni ) ⎪
l l l l ⎬ l =1, . . . , L − 1,
  
S 1

ϕ li (nli ) = ϕ li (nli ) 1 − 2ϕ li (nli ) ⎪


 
⎭ i =1, . . . , S .
l ∂E
= δ 1j w1j,i . (2.66)
∂a0i j =1
(2.60)
Forward propagation for network outputs
Derivatives of the identity activation functions
Jacobian. We define the pairwise sensitivities of
(2.11) are simply
weighted sums to be as follows:
 "
i (ni ) = 1
ϕL L
∂nli
 L i = 1, . . . , S L . (2.61) ν l,m
i,j  . (2.67)
i (ni ) = 0
ϕL ∂nm
j

Backpropagation algorithm for error func- Pairwise sensitivities for neurons of the same
tion gradient. First, we perform a forward pass layer are obtained directly, i.e.,
to compute the weighted sums nli and activa-
ν l,l
i,i = 1,
tions ali for all neurons i = 1, . . . , S l of each layer (2.68)
l = 1, . . . , L, according to equations (2.8). ν l,l
i,j = 0, i = j.
2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS 61

Since the activations of neurons of some layer m of additional sensitivities, i.e.,


do not affect the neurons of preceding layers l <
m, the corresponding pairwise sensitivities are ∂aL
i 
L L,0
= ϕL
i (ni )ν i,j . (2.74)
identically zero, i.e., ∂a0j

ν l,m
i,j = 0, m > l. (2.69) Backpropagation algorithm for error gradi-
ent and Hessian [66]. First, we perform a for-
The remaining pairwise sensitivities are com- ward pass to compute the weighted sums nli and
puted during the forward pass, along with the activations ali according to Eqs. (2.8), and also to
weighted sums nli and activations ali , i.e., compute the pairwise sensitivities ν l,m i,j accord-
l−1 ing to (2.68)–(2.70).

S
 We define the error function second-order
ν l,m
i,j = wli,k ϕ l−1
k (nl−1 l−1,m
k )ν k,j , l = 2, . . . , L.
sensitivities with respect to weighted sums to be
k=1
(2.70) as follows:
∂ 2E
Finally, the derivatives of neural network out- δ l,m
i,j  . (2.75)
∂nli ∂nm
j
puts with respect to parameters are expressed in
terms of pairwise sensitivities, i.e., Next, during a backward pass we compute
the error function sensitivities δ li as well as
∂aL  L L,m the second-order sensitivities δ l,m
i
= ϕL i,j . According
i (ni )ν i,j ,
∂bm
j to Schwarz’s theorem on equality of mixed
(2.71) partials, due to continuity of second partial
∂aL
i  L L,m m−1
= ϕL
i (ni )ν i,j ak . derivatives of an error function with respect to
∂wmj,k
weighted sums, we have δ l,m m,l
i,j = δ j,i . Hence, we
If we additionally define the sensitivities of need to compute the second-order sensitivities
weighted sums with respect to network inputs, only for the case m  l.
Second-order sensitivities for the output layer
∂nli neurons are obtained directly, i.e.,
ν l,0
i,j  , (2.72)
∂a0j # 2    $
L L
δ L,m
i,j = ω i ϕ i (ni ) − ỹ i − a L
i ϕ L
i (nL
i ) ν L,m
i,j ,
then we obtain the derivatives of network out-
(2.76)
puts with respect to network inputs. First, we
compute the additional sensitivities during the while second-order sensitivities for the hidden
forward pass, i.e., layer neurons are computed during a backward
1,0 pass, i.e.,
ν i,j = w1i,j ,
l+1

S l−1
 
S
 l−1 l−1,0
ν l,0 = wli,k ϕ l−1 (nk )ν k,j , l = 2, . . . , L. δ l,m
i,j = ϕ li (nli ) wl+1 l+1,m
k,i δ k,j
i,j k
k=1 k=1
l+1
(2.73) 
S (2.77)

+ ϕ li (nli )ν l,m
i,j wl+1 l+1
k,i δ k ,
Then, the derivatives of network outputs with k=1
respect to network inputs are expressed in terms l = L − 1, . . . , 1.
62 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

Due to continuity of second partial deriva- ties during the backward pass, i.e.,
tives of an error function with respect to net- #     $
 L 2
work parameters, the Hessian matrix is sym- δ L,0
i,j = ωi ϕL
i (ni ) − ỹ i − a L
i ϕ L
i (nL
i ) ν L,0
i,j ,
metric. Therefore, we need to compute only the
l+1
lower-triangular part of the Hessian matrix. The  
S

error function second derivatives with respect δ l,0


i,j = ϕ li (nli ) wl+1 l+1,0
k,i δ k,j
k=1
to parameters are expressed in terms of second-
l+1
order sensitivities. We have  
S
+ ϕ li (nli )ν l,0
i,j wl+1 l+1
k,i δ k , l = L − 1, . . . , 1.
k=1
∂ 2E (2.80)
= δ l,m
i,k ,
∂bli ∂bm
k Then, the second derivatives of the error func-
∂ 2E tion with respect to network inputs are ex-
= δ l,m m−1
i,k ar ,
∂bli ∂wm
k,r
pressed in terms of additional second-order sen-
sitivities, i.e.,
∂ 2E 
= δ l,m l−1
i,k aj + δ li ϕ l−1
j (nl−1 l−1,m
j )ν j,k ,
∂wli,j ∂bm 
S 1
k ∂ 2E 1,0
l > 1, = w1k,i δ k,j ,
∂a0i ∂a0j k=1
∂ 2E 1,1 0
= δ i,k aj , ∂ 2E
∂w1i,j ∂b1k = δ l,0
i,k ,
∂bli ∂a0k
∂ 2E ∂ 2E
= δ l,m l−1 m−1
i,k aj ar = δ l,0 l−1
+ δ li ϕ l−1 (nl−1
l−1,0
∂wli,j ∂wm i,k aj j j )ν j,k , l > 1,
k,r ∂wli,j ∂a0k
 l−1,m m−1
+ δ li ϕ l−1
j (nl−1
j )ν j,k ar , l > 1, ∂ 2E 1,0 0
= δ i,j aj + δ 1i ,
∂ 2E 1,1 0 0 ∂w1i,j ∂a0j
= δ i,k aj ar .
∂w1i,j ∂w1k,r ∂ 2E 1,0 0
(2.78) = δ i,k aj , j = k.
∂w1i,j ∂a0k
(2.81)
If we additionally define the second-order
sensitivities of the error function with respect to 2.2.3 Dynamic Neural Network Training
network inputs, Traditional dynamic neural networks, such as
the NARX and Elman networks, represent con-
trolled discrete time dynamical systems. Thus, it
∂ 2E
δ l,0
i,j  , (2.79) is natural to utilize them as models for discrete
∂nli ∂a0j time dynamical systems. However, they can also
be used as models for the continuous time dy-
namical systems under the assumption of a uni-
then we obtain error function second deriva- form time step t. In this book we focus on
tives with respect to network inputs. First, we the latter problem. That is, we wish to train the
compute the additional second-order sensitivi- dynamic neural network so that it can perform
2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS 63

accurate closed-loop multistep-ahead prediction function e of the following form:


of the dynamical system behavior. In this sub-
section, we discuss the most general state space 1 T  
e(ỹ, z, W) = ỹ − G(z, W)  ỹ − G(z, W) ,
form (2.13) of dynamic neural networks. 2
Assume we are given an experimental data (2.85)
set of the form
% K (p) &P where  = diag(ω1 , . . . , ωny ) is the diagonal ma-
u(p) (tk ), ỹ(p) (tk ) , (2.82) trix of error weights, usually taken inversely
k=0 p=1 proportional to the corresponding variances of
measurement noise.
where P is the total number of trajectories, K (p) We need to minimize the total prediction er-
is the number of time steps for the correspond- ror Ē with respect to the neural network param-
ing trajectory, tk = kt are the discrete time in- eters W. Again, the minimization can be carried
stants, u(p) (tk ) are the control inputs, and ỹ(p) (tk ) out using any of the optimization methods de-
are the observed outputs. We will also denote scribed in Section 2.2.1, provided we can com-
the total duration of the pth trajectory by t¯(p) = pute the gradient and Hessian of the error func-
K (p) t. tion with respect to the parameters. Just like in
Note that in general the observed outputs the case of static neural networks, the total error
ỹ(p) (tk ) do not match the true outputs y(p) (tk ). gradient ∇ Ē and Hessian ∇ 2 Ē may be expressed
We assume that the observations are corrupted in terms of the individual error gradients ∇E (p)
by an additive white Gaussian noise, i.e., and Hessians ∇ 2 E (p) . Thus, we describe the al-
gorithms for computation of the derivatives for
ỹ(p) (t) = y(p) (t) + η(p) (t). (2.83) E (p) and omit the trajectory index p.
Again, we have two different computation
That is, η(p) (t) represents a stationary Gaussian
modes: forward-in-time and reverse-in-time,
process with zero mean and a covariance func-
each with its own advantages and disadvan-
tion Kη (t1 , t2 ) = δ(t2 − t1 ), where
tages. The forward-in-time approach theoret-
⎛ ⎞ ically allows one to work with infinite dura-
σ 21 0
⎜ ⎟ tion trajectories, i.e., to perform online adapta-
..
=⎜
⎝ . ⎟.
⎠ tion as the new data arrive. In practice, how-
0 σ 2ny ever, each iteration is more computationally ex-
pensive as compared to the reverse-in-time ap-
proach. The reverse-in-time approach is only ap-
The individual errors E (p) for each trajectory
plicable when the whole training set is available
have the following form:
beforehand, but it works significantly faster.
(p) BackPropagation through time algorithm

K
E (p)
(W) = e(ỹ(p) (tk ), z(p) (tk ), W), (2.84) (BPTT) [67–69] for error function gradient.
k=1 First, we perform a forward pass to compute
the predicted states z(tk ) for all time steps tk ,
where z(p) (tk ) are the model states and e(p) : Rny × k = 1, . . . , K, according to equations (2.13). We
Rnz × Rnw → R represents the model prediction also compute the error E(W) according to (2.84)
error at time instant tk . Under the abovemen- and (2.85).
tioned assumptions on the observation noise, it We define the error function sensitivities with
is reasonable to utilize the instantaneous error respect to model states at time step tk to be as
64 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

follows: themselves. We have


∂E(W) ∂z(t0 )
λ(tk ) = . (2.86) = 0,
∂z(tk ) ∂W
Error function sensitivities are computed dur- ∂z(tk ) ∂F(z(tk−1 ), u(tk−1 ), W)
=
ing a backward-in-time pass, i.e., ∂W ∂W
∂F(z(tk−1 ), u(tk−1 ), W) ∂z(tk−1 )
+ ,
λ(tK+1 ) = 0, ∂z ∂W
∂e(ỹ(tk ), z(tk ), W) k = 1, . . . , K.
λ(tk ) = (2.90)
∂z
(2.87)
∂F(z(tk ), u(tk ), W) T
+ λ(tk+1 ), The gradient of the individual trajectory error
∂z
function (2.84) equals
k = K, . . . , 1.

∂E(W)  ∂e(ỹ(tk ), z(tk ), W)


K
Finally, the error function derivatives with re- =
spect to parameters are expressed in terms of ∂W ∂W
k=1 (2.91)
sensitivities, i.e.,
∂z(tk ) T ∂e(ỹ(tk ), z(tk ), W)
+ .
∂W ∂z
∂E(W)  ∂e(ỹ(tk ), z(tk ), W)
K
=
∂W ∂W A Gauss–Newton Hessian approximation
k=1
may be obtained as follows:
∂F(z(tk−1 ), u(tk−1 ), W) T
+ λ(tk ).
∂W ∂ 2 E(W)  ∂ 2 e(ỹ(tk ), z(tk ), W)
K

(2.88) ∂W2 ∂W2
k=1
∂ 2 e(ỹ(tk ), z(tk ), W) ∂z(tk )
First-order derivatives of the instantaneous +
error function (2.85) have the form ∂W∂z ∂W
∂z(tk ) T ∂ 2 e(ỹ(t), z(tk ), W)
+
∂e(ỹ, z, W) ∂G(z, W) T   ∂W ∂z∂W
=−  ỹ − G(z, W) , T 2
∂W ∂W ∂z(tk ) ∂ e(ỹ(tk ), z(tk ), W) ∂z(tk )
+ .
∂e(ỹ, z, W) ∂G(z, W) T   ∂W ∂z2 ∂W
=−  ỹ − G(z, W) . (2.92)
∂z ∂z
(2.89)
The corresponding approximations to second-
Sine the mappings F and G are represented order derivatives of the instantaneous error
by layered feedforward neural networks, their function have the form
derivatives can be computed as described in Sec-
∂ 2 e(ỹ, z, W) ∂G(z, W) T ∂G(z, W)
tion 2.2.2. ≈  ,
Real-Time Recurrent Learning algorithm ∂W2 ∂W ∂W
(RTRL) [68–70] for network outputs Jacobian. ∂ 2 e(ỹ, z, W) ∂G(z, W) T ∂G(z, W)
≈  , (2.93)
The model state sensitivities with respect ∂W∂z ∂W ∂z
to network parameters are computed during ∂ 2 e(ỹ, z, W) ∂G(z, W) T ∂G(z, W)
the forward-in-time pass, along with the states ≈  .
∂z2 ∂z ∂z
2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS 65

Backpropagation through time algorithm ∂ 2 e(ỹ, z, W) ∂G(z, W) T ∂G(z, W)


= 
for error gradient and Hessian. Second-order ∂W2 ∂W ∂W
ny
sensitivities of the error function are computed  ∂ Gi (z, W) 
2 
during a backward-in-time pass as follows: − 2
ωi ỹi − Gi (z, W) ,
∂W
i=1

∂λ(tK+1 ) ∂ 2 e(ỹ, z, W) ∂G(z, W) T ∂G(z, W)


= 0, = 
∂W ∂W∂z ∂W ∂z
ny
∂λ(tk ) ∂ 2 e(ỹ(tk ), z(tk ), W)  ∂ Gi (z, W) 
2 
= − ωi ỹi − Gi (z, W) ,
∂W ∂z∂W ∂W∂z
i=1
∂ 2 e(ỹ(tk ), z(tk ), W) ∂z(tk )
+ ∂ 2 e(ỹ, z, W) ∂G(z, W) T ∂G(z, W)
∂z2 ∂W = 
nz # 2 ∂z2 ∂z ∂z
∂ Fi (z(tk ), u(tk ), W) ny
+ λi (tk+1 )  ∂ Gi (z, W) 
2 
i=1
∂z∂W − ωi ỹi − Gi (z, W) .
$ ∂z2
i=1
∂ 2 Fi (z(tk ), u(tk ), W) ∂z(tk )
+ (2.96)
∂z2 ∂W
∂F(z(tk ), u(tk ), W) T ∂λ(tk+1 ) In the rest of this subsection, we discuss var-
+ ,
∂z ∂W ious difficulties associated with the recurrent
k = K, . . . , 1. neural network training problem. First, notice
(2.94) that a recurrent neural network which performs
a K-step-ahead prediction may be “unfolded” in
The Hessian of the individual trajectory error time to produce an equivalent layered feedfor-
function (2.84) equals ward neural network, comprised of K copies of
the same subnetwork, one per time step. Each
of these identical subnetworks shares a common
∂ 2 E(W)  ∂ 2 e(ỹ(tk ), z(tk ), W)
K
= set of parameters.
∂W2 ∂W2 Given a large prediction horizon, the result-
k=1
∂ 2 e(ỹ(tk ), z(tk ), W) ∂z(tk ) ing feedforward network becomes very deep.
+ Thus, it is natural that all the difficulties associ-
∂W∂z ∂W
nz # 2 ated with deep neural network training are also
∂ Fi (z(tk−1 ), u(tk−1 ), W) inherent to recurrent neural network training. In
+ λi (tk )
∂W2 fact, these problems become even more severe.
i=1
$ They include the following:
∂ 2 Fi (z(tk−1 ), u(tk−1 ), W) ∂z(tk−1 )
+
∂W∂z ∂W 1. Vanishing and exploding gradients [71–74].
∂F(z(tk−1 ), u(tk−1 ), W) T ∂λ(tk ) Note that the sensitivity of a recurrent neu-
+ . ral network (2.13) state at time step tk with
∂W ∂W
respect to its state at time step tl (l  k) has
the following form:
(2.95)

∂z(tk ) ' ∂F(z(tr ), u(tr ), W)


k−1
Second-order derivatives of the instantaneous = . (2.97)
error function (2.85) have the form ∂z(tl ) ∂z
r=l
66 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

If the largest (in absolute value) eigenvalues neurons. LSTM networks have been success-
of ∂F(z(tr ),u(t
∂z
r ),W)
are less than 1 for all time fully applied in speech recognition, machine
steps tr , r = l, . . . , k − 1, then the norm of sen- translation, and anomaly detection. How-
sitivity ∂z(t k) ever, little attention has been attracted to
∂z(tl ) will decay exponentially with
k − l. Hence, the terms of the error gradient applications of LSTM for dynamical system
which correspond to recent time steps will modeling problems [81].
dominate the sum. This is the reason why 2. Bifurcations of a recurrent neural network
gradient-based optimization methods learn dynamics [82–84]. Since the recurrent neu-
short-term dependencies much faster than ral network is a dynamical system itself,
the long-term ones. On the other hand, a gra- its phase portrait might undergo qualitative
changes during the training. If these changes
dient explosion (exponential growth of its
affect the actual predicted trajectories, this
norm) corresponds to a situation when the
might lead to significant changes of the error
eigenvalues exceed 1 at all time steps. The
in response to small changes of parameters
gradient explosion effect might lead to diver-
(i.e., the gradient norm becomes very large),
gence of the optimization method, unless care
provided the duration of these trajectories is
is taken.
large enough.
In particular, if the mapping F is represented
In order to guarantee a complete absence of
by a layered feedforward neural network
bifurcations during the network training, we
(2.8), then the Jacobian ∂F(z(tr ),u(t ∂z
r ),W)
corre- would need a very good initial guess for
sponds to derivatives of network outputs its parameters, so that the model would al-
with respect to its inputs, i.e., ready possess the desired asymptotic behav-
      ior. Since this assumption is very unrealistic,
∂aL it seems more reasonable to modify the op-
0
= diag ϕ L (nL ) ωL · · · diag ϕ 1 (n1 ) ω1 .
∂a timization methods in order to enforce their
(2.98) stability.
3. Spurious valleys in the error surface [85–87].
Assume that the derivatives of all the ac-
These valleys are called spurious due to the
tivation functions ϕ l are bounded by some
fact that they do not depend on the desired
constant ηl . Denote by λlmax the eigenvalue
values of outputs ỹ(tk ). The location of these
with the largest magnitude for the weight
valleys is determined only by initial condi-
matrix ωl of the lth layer. If the inequality
(L l l
tions z(t0 ) and the controls u(tk ). Reasons for
l=1 λmax η < 1 holds, then the largest (in occurrence of such valleys have been inves-
magnitude) eigenvalue of a Jacobian matrix tigated in some special cases. For example, if
∂aL
∂a0
is less than one. Derivatives of the hyper- the initial state z(t0 ) of (2.13) is a global re-
bolic tangent activation function, as well as peller within some area of a parameter space,
the identity activation function, are bounded then an infinitesimal control u(tk ) causes the
by 1. model states z(tk ) to tend to infinity, which
One of the possibilities to speed up the train- in turn leads to an unbounded error growth.
ing is to use the second-order optimization Now assume that this area of parameter
methods [59,74]. Another option would be to space contains a line along which the connec-
utilize the Long-Short Term Memory (LSTM) tion weights between the controls u(tk ) and
models [72,75–80] specially designed to over- the neurons of F are identically zero, that is,
come the vanishing gradient effect by using the recurrent neural network (2.13) does not
the special memory cells instead of context depend on controls. Parameters along this
2.3 DYNAMIC NEURAL NETWORK ADAPTATION METHODS 67

line result in a stationary behavior of model • state estimation


states z(tk ) ≡ z(t0 ), and the corresponding
prediction error remains relatively low. Such ẑ− (tk ) = F− (tk )ẑ(tk−1 );
line represents a spurious valley in the error
surface. • estimation of the error covariance
It is worth mentioning that this problem can
P− (tk ) = F− (tk )P(tk−1 )F− (tk ) + Q(tk−1 );
T
be alleviated by the use of a large number of
trajectories for the training. Since these tra-
jectories have different initial conditions and • the gain matrix
different controls, the corresponding spuri-
ous valleys are also located in different areas G(tk ) = P− (tk )H(tk )T
of the parameter space. Hence, these valleys  −1
are smoothed out on a surface of a total error × H(tk )P− (tk )H(tk )T + R(tk ) ;
function (2.25). In addition, we might apply
the regularization methods so as to modify • correction of the state estimation
the error function, which results in valleys  
“tilted” in some direction. ẑ(tk ) = ẑ− (tk ) + G(tk ) ỹ(tk ) − H(tk )ẑ− (tk ) ;

• correction of the error covariance estimation


2.3 DYNAMIC NEURAL NETWORK
P(tk ) = (I − G(tk )H(tk )) P− (tk ).
ADAPTATION METHODS
However, the dynamic ANN model is a non-
2.3.1 Extended Kalman Filter linear system, so the standard Kalman filter al-
Another class of learning algorithms for dy- gorithm is not suitable for them. If we use the
namic networks can be built based on the con- linearization of the original nonlinear system,
cept of an extended Kalman filter. then in this case we can obtain an extended
The standard Kalman filter algorithm is de- Kalman filter (EKF) suitable for nonlinear sys-
signed to work with linear systems. Namely, the tems.
following model of the dynamical system in the To obtain the EKF algorithm, the model in the
state space is considered: state space is written in the following form:
z(tk+1 ) = F− (tk+1 )z(tk ) + ζ (tk ), z(tk+1 ) = f(tk , z(tk )) + ζ (tk ),
ỹ(tk ) = H(tk )z(tk ) + η(tk ). ỹ(tk ) = h(tk , z(tk )) + η(tk ).
Here ζ (tk ) and η(tk ) are Gaussian noises with
zero mean and covariance matrices Q(tk ) and Here ζ (tk ) and η(tk ) are Gaussian noises with
R(tk ), respectively. zero mean and covariance matrices Q(tk ) and
The algorithm is initialized as follows. For k = R(tk ), respectively.
0, set In this case

ẑ(t0 ) = E[z(t0 )], ∂f(tk , z) 


F− (tk+1 ) =  ,
∂z z=z(tk )
P(t0 ) = E[(z(t0 ) − E[z(t0 )])(z(t0 ) − E[z(t0 )])T ].

Then, for k = 1, 2, . . ., the following values are ∂h(tk , z) 


calculated: H(tk ) =  .
∂z z=z(tk )
68 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

The EKF algorithm is initialized as follows. In order to use the Kalman filter, it is required
For k = 0, set to linearize the observation equation. It is possi-
ble to use statistical linearization, i.e., lineariza-
ẑ(t0 ) = E[z(t0 )], tion with respect to the mathematical expecta-
P(t0 ) = E[(z(t0 ) − E[z(t0 )])(z(t0 ) − E[z(t0 )])T ]. tion. This gives

Then, for k = 1, 2, . . ., the following values are w(tk+1 ) = w(tk ) + ζ (tk ),


calculated: ŷ(tk ) = H(tk )w(tk ) + η(tk ),
• state estimation where the observation matrix has the form
− 
ẑ (tk ) = f(tk , ẑ(tk−1 )); ∂ ŷ  ∂e(tk )
H(tk ) = T  =− = −J(tk ).
∂w w=w(tk ) ∂w(tk )T
• estimation of the error covariance z=z(tk )

P− (tk ) = F− (tk )P(tk−1 )F− (tk ) + Q(tk−1 );


T Here e(tk ) is the observation error vector at the
kth estimation step, i.e.,
• the gain matrix
e(tk ) = ỹ(tk ) − ŷ(tk ) = ỹ(tk ) − f(z(tk ), w(tk )).

G(tk ) = P (tk )H(tk ) T
 −1 The extended Kalman filter equations for es-
× H(tk )P− (tk )H(tk )T + R(tk ) ; timating w(tk+1 ) in the next step have the form

• correction of the state estimation S(tk ) = H(tk )P(tk )H(tk )T + R(tk ),


  K(tk ) = P(tk )H(tk )T S(tk )−1 ,
ẑ(tk ) = ẑ− (tk ) + G(tk ) ỹ(tk ) − h(tk , ẑ− (tk )) ;
P(tk+1 ) = (P(tk ) − K(tk )H(tk )P(tk )) eβ + Q(tk ),
• correction of the error covariance estimation w(tk+1 ) = w(tk ) + K(tk )e(tk )
P(tk ) = (I − G(tk )H(tk )) P− (tk ). Here β is the forgetting factor, which affects the
significance of the previous steps.
We will assume that for an ideal ANN model Here K(tk ) is the Kalman gain, S(tk ) is the co-
the observed process is stationary, that is, variance matrix for state prediction errors e(tk ),
w(tk+1 ) = w(tk ), but its states (weight w(tk )) are and P(tk ) is the covariance matrix for weight es-
“corrupted” by the noise ζ (tk ). timation errors (ŵ(tk ) − w(tk )).
The Kalman filter (KF) in its standard ver- There are alternative variants of the EKF al-
sion is applicable only for systems whose obser- gorithm, which may prove to be more effective
vations are linear in the estimated parameters, in solving the problems under consideration, in
while the neural network observation equation particular,
is nonlinear, i.e.,
P− (tk ) = P(tk ) + Q(tk ),
w(tk+1 ) = w(tk ) + ζ (tk ),
S(tk ) = H(tk )P− (tk )H(tk )T + R(tk ),
ŷ(tk ) = f(u(tk ), w(tk )) + η(tk ),
K(tk ) = P− (tk )H(tk )T S(tk )−1 ,
where u(tk ) are control actions, ζ is the pro-
P(tk+1 ) = (I − K(tk )H(tk )) P− (tk )
cess noise, and η is the observation noise; these
noises are Gaussian random sequences with × (I − K(tk )H(tk ))T + K(tk )K(tk )T ,
zero mean and covariance matrices Q and R. w(tk+1 ) = w(tk ) + K(tk )e(tk ).
2.3 DYNAMIC NEURAL NETWORK ADAPTATION METHODS 69

The variant of the EKF of this type is more sta- ing various applied problems, is that the net-
ble in computational terms and has robustness work can change, adapting to the problem being
to rounding errors, which positively affects the solved. This kind of adjustment can be carried
computational stability of the learning process out in the following directions:
of the ANN model as a whole.
As can be seen from the relationships deter- • the neural network can be trained, i.e., it can
mining the EKF, the key point is again the calcu- change the values of their tuning parameters
lation of the Jacobian J(tk ) of network errors by (this is, as a rule, the synaptic weights of the
adjusted parameters. neural network connections);
When learning a neural network, it is impos- • the neural network can change its structural
sible to use only the current measurement in the organization by adding or removing neurons
EKF due to the unacceptably low accuracy of the and rebuilding the interneural connections;
search (the effect of the noise ζ and η); it is neces- • the neural network can be dynamically tuned
sary to form a vector estimate on the observation to the solution of the current task by replac-
interval, and then the update of the matrix P(tk ) ing some of its constituent parts (subnets)
is more correct. with previously prepared fragments, or by
As a vector of observations, we can take a se- changing the values of the network settings
quence of values on a certain sliding interval, and its structural organization on the basis of
i.e., the previously prepared relationships linking
the task to the required changes in the ANN
 T
ŷ(tk ) = ŷ(ti−l ), ŷ(ti−l+1 ), . . . , ŷ(ti ) , model.

where l is the length of the sliding interval, the The first of these options leads to the traditional
index i refers to the time point (sampling step), learning of ANN models, the second to the class
and the index k indicates the valuation number. of growing networks, and the third to networks
The error of the ANN model will also be a with pretuning.
vector value, i.e., The most important limitation related to the
peculiarities of the first of these approaches
 T
e(tk ) = e(ti−l ), e(ti−l+1 ), . . . , e(ti ) . (ANN training) to the adjustment of the ANN
models is that the network, before it started to be
taught, is potentially suitable for a wide class of
2.3.2 ANN Models With Interneurons problems, but after the completion of the learn-
From the point of view of ensuring the adapt- ing process it can already decide only a specific
ability of ANN models, the idea of an interme- task; in the case of another task, it is necessary
diate neuron (interneuron) and the subnetwork to retrain the network, during which the skill of
of such neurons (intersubnet) is very fruitful. solving the previous task is lost.
The second approach (growing networks) al-
2.3.2.1 The Concept of an Interneuron and lows to cope with this problem only partially.
an ANN Model With Such Neurons Namely, if new training examples appeared that
An effective approach to the implementation do not fit into the ANN model obtained accord-
of adaptive ANN models, based on the concepts ing to the first of the approaches, then this model
of an interneuron and a pretuned network, was is built up with new elements, with the addition
proposed by A.I. Samarin [88]. As noted in this of appropriate links, after which the network is
paper, one of the main properties of ANN mod- trained additionally, not affecting the previously
els, which makes them an attractive tool for solv- constructed part of it.
70 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

The third of the approaches (networks with


pretuning) is the most powerful and, accord-
ingly, the most complicated one. Following this
approach, it is necessary either to organize the
process of dynamic (i.e., directly during the
operation of the ANN model) replacement of
the components of the model with their al- FIGURE 2.26 Intermediate elements in a network (com-
ternative versions prepared in advance, corre- positional) model.
sponding to the changed task, or to organize
the ANN model in the form of an integrated
system in which there are special structural el- example, by adjusting the values of the param-
ements, called interneurons and intersubnets, eters of the work items, which in turn changes
whose function is to act on the operational el- the character of the transformation realized by
ements of the network in such a way that their this element.
current characteristics meet the specifics of the Thus, intermediate elements are introduced
particular task being solved at the given mo- into the NM as a tool of contextual impact on the
ment. parameters and characteristics of the NM work-
ing elements. Using intermediate elements is the
2.3.2.2 An Intersubnet as an Adaptation most effective way to make a network (compos-
Tool for ANN Models ite) model adaptive. The functional role of the
We can formulate the concept of NM, which intermediate elements is illustrated in Fig. 2.26,
generalizes the notion of the ANN model. Some in which the these elements are combined into
NM is a set of interrelated elements (NM ele- an intermediate subnet (intersubnet). It is seen
ments) organized as network associations built that the intermediate subnet receives the same
according to certain rules from a very small inputs as the working subnet of the NM in ques-
number of primitives. One possible example of tion, which implements the basic algorithm for
an NM element can be a single artificial neuron. processing input data. In addition, an intersub-
If this approach is followed by the principle net can also receive some additional informa-
of minimalism, then the most promising way, as tion, referred to herein as the NM context. Ac-
noted earlier, is the formation of a very limited cording to the received initial data (inputs of the
set of basic NM elements. Then a variety of spe- ANN model + context), the subnet introduces
cific types of NM elements required to produce adjustments to the working subnet in such a
ANN models are formed as particular cases of way that this working subnetwork corresponds
basic elements. to the changed task. The training of the subnet-
Processing elements of the NM can have two work is carried out in advance, at the stage of
versions: working elements and intermediate el- formation of the ANN model, so that the change
ements. The most important difference between of the task to be solved does not require addi-
them is that the work elements convert the input tional training (and, especially, retraining) of the
data to the desired output of the ANN model, working subnet; only its reconfiguration is per-
that is, in the desired result. In other words, the formed, which requires a short time.
set of interacting work items implements the
algorithm for solving the required application 2.3.2.3 Presetting of ANN Models and Its
task. In contrast, intermediate elements of the Possible Variants
NM do not participate directly in the above al- We will distinguish two options for preset-
gorithm; their role is to act on the work items, for ting: a strong presetting and a weak presetting.
2.3 DYNAMIC NEURAL NETWORK ADAPTATION METHODS 71

Strong pretuning is oriented to the adaptation


of the ANN model in a wide range of condi-
tions. A characteristic architectural feature of
the ANN model in this case is the presence of
NM elements in the processing elements, along
with the working elements, as well as insert ele-
ments affecting the parameters of the NM work-
ing elements. This approach allows implement-
ing both parametric and structural adaptation of
the ANN model. FIGURE 2.27 Structural options for presetting the ANN
model. (A) A sequential variant. (B) A parallel version.
Weak preadjustment does not use insert ele-
ments. With it, fragments of the ANN model are
distinguished, which change as the conditions in the process of functioning of the modeled ob-
change and the fragments are adjusted accord- ject.
ing to a two-stage scheme. For example, let the In both variants, both sequential and paral-
problem of modeling the motion of an aircraft lel, the a priori model is trained in off-line mode
be solved. As the basis of the required model, a in advance using the available knowledge about
system of differential equations is used that de- the modeled object. The refinement model is ad-
scribes the motion of an aircraft. This system, justed already directly in the process of the ob-
according to the scheme, which is presented in ject’s operation on the basis of data received on-
Section 5.2, is transformed into an ANN model. line.
This is a general model, which should be refined In the sequential version (Fig. 2.27A), the out-
in relation to a particular aircraft by specifying put of the fˆ(x) a priori model corresponding
the specific values of its geometric, mass, iner- to this particular value of the input vector x is
tial, and aerodynamic characteristics. The most the input for the refinement model realizing the
difficult problem is the specification of the aero- transformation f (fˆ(x)).
dynamic characteristics of the simulated aircraft In the parallel version (Fig. 2.27B) the a pri-
due to incomplete and inaccurate knowledge of ori and refinement models act independently of
the corresponding quantities. In this situation, each other, calculating the estimate fˆ(x) corre-
it is advisable to present these characteristics sponding to this particular value of the input
as a two-component structure: the first one is vector x and the initial knowledge of the mod-
based on a priori knowledge (for example, on eled object, as well as the f (x) correction for
data obtained by experiments in a wind tun- the same value of the input vector x, taking into
nel) and the second contains refining data ob- account the data that became available for use in
tained directly in flight. The presetting of the the process of object functioning. The required
ANN model in this case is carried out due to value of f (x) is the sum of these components,
the fact that during the transition from the sim- i.e., f (x) = fˆ(x) + f (x).
ulation of one particular aircraft to another in It should be emphasized that the neural net-
the ANN model, a part of the description of the work implementation of the a priori and refining
aerodynamic characteristics, based on a priori models is, as a rule, different from the point of
knowledge, is replaced. The clarifying part of view of the attracted architectural solutions, al-
this description is an instrument of adaptation of though in a particular case it may be the same;
the ANN model, which is already implemented for example, both models can be constructed in
72 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

the form of multiperceptrons with sigmoid ac- a relatively small subdomain of the values of
tivation functions. This allows us to most effec- state and control variables, and then, in the on-
tively meet the requirements, which are differ- line mode, an incremental learning process of
ent, generally speaking, for a priori and refin- the ANN model is performed, during which at
ing models. In particular, the main requirement each step a step-by-step extension of the sub-
for the a priori model is the ability to repre- region is performed. From here, the model is
sent complex nonlinear dependencies with the operational, in order to eventually expand the
required accuracy, while the time spent on learn- given subdomain to the full domain of the vari-
ing such a model is uncritical, since this train- ables.
ing is carried out in an autonomous (off-line) In the structural-parametric version of the in-
mode. At the same time, the refining model in cremental model formation procedure, at first,
its work must fit into the very rigid framework a “truncated” ANN model is constructed. This
of the real (or even advanced) time scale. For preliminary model has only a part of the state
this reason, in particular, in the vast majority of variables as its inputs, and it is trained on a
cases, the ANN architectures will be unaccept- dataset that covers only a subset of the domain
able, requiring full retraining, even with minor of definition. This initial model is then gradually
changes in the training data with which they expanded by introducing new variables into it,
work. In such a situation, an incremental ap- followed by further training.
proach to teaching and learning the ANN mod- For example, the initial model is the model of
els is more appropriate, allowing not to retrain the longitudinal angular motion of the aircraft,
the entire network, but only to correct those el- which is then expanded by adding trajectory
ements that are directly related to the changed longitudinal motion, after which lateral motion
training data. components are added to it, that is, the model
is brought to the desired full model of the space
motion in a few steps.
2.3.3 Incremental Formation of ANN The structural-parametric variant of the in-
Models cremental formation of ANN models allows
One of the tools for adapting ANN models is us to start with a simple model, sequentially
an incremental formation that exists in two vari- complicating it, for example, according to the
scheme
ants: parametric and structural-parametric.
With the parametric version of the incremen-
tal formation, the structural organization of the material point
ANN model is set immediately and fixed, after ⇓
which it is incrementally adjusted (basic or ad-
rigid body
ditional learning) in several stages, for example,
to extend the domain of operation modes of the ⇓
dynamical system in which the model operates elastic body
with the required accuracy.

For example, if we take a full spatial model
of the aircraft motion, taking into account both a set of coupled rigid and/or elastic bodies
its trajectory and angular motion, then in accor-
dance with the incremental approach, first an This makes it possible to build up the model
off-line training of this model is carried out for step-by-step in a structural sense.
2.4 TRAINING SET ACQUISITION PROBLEM FOR DYNAMIC NEURAL NETWORKS 73

2.4 TRAINING SET ACQUISITION 2.4.2 Direct Approach to the Process of


PROBLEM FOR DYNAMIC Forming Data Sets Required for
NEURAL NETWORKS Training Dynamic Neural
Networks
2.4.1 Specifics of the Process of Forming 2.4.2.1 General Characteristics of the
Data Sets Required for Training Direct Approach to the Forming of
Dynamic Neural Networks Training Data Sets
Getting a training set that has the required We will clarify the concept of informative
level of informativeness is a critical step in solv- content of the training set, and we will also es-
ing the problem of forming the ANN model. timate its required volume to provide the neces-
If some features of dynamics (behavior) do not sary level of informativeness. First, we will per-
find reflection in the training set, they, accord- form these actions in the framework of a direct
ingly, will not be reproduced by the model. In approach to solving the problem of the forma-
one of the fundamental guidelines for the identi- tion of a training set; in the next section, the con-
cept will be extended to an indirect approach.
fication of systems, this provision is formulated
Consider a controllable dynamical system of
as the Basic Identification Rule: “If it is not in the
the form
data, it cannot be identified” (see [89], page 85).
The training data set required for the for- ẋ = F (x, u, t), (2.99)
mation of the dynamical system ANN model
where x = (x1 , x2 , . . . , xn ) are the state variables,
should be informative (representative). For the
u = (u1 , u2 , . . . , um ) are control variables, and t ∈
time being we will assume that the training set T = [t0 , tf ] is time.
is informative if the data contained in it are suf- The variables x1 , x2 , . . . , xn and u1 , u2 , . . . , um ,
ficient to produce an ANN model that, with the taken at a particular moment in time tk ∈ T , char-
required level of accuracy, reproduces the be- acterize, respectively, the state of the dynamical
havior of the dynamical system over the entire system and the control actions on it at a given
range of possible values for the quantities and time. Each of these values takes values from the
their derivatives that characterize this behavior. corresponding area, i.e.,
To ensure the fulfillment of this condition, when
forming a training set, it is required to obtain x1 (tk ) ∈ X1 ⊂ R, . . . , xn (tk ) ∈ Xn ⊂ R;
data not only about changes in quantities, but (2.100)
u1 (tk ) ∈ U1 ⊂ R, . . . , un (tk ) ∈ Um ⊂ R.
also about the rate of their change, i.e., we can
assume that the training set has the required in- In addition, there are, as a rule, restrictions on
formativeness if the ANN model obtained with the values of the combinations of these vari-
its use reproduces the behavior of the system not ables, i.e.,
only over the whole range of changes in the val-
x =x1 , . . . , xn  ∈ RX ⊂ X1 × · · · Xn ,
ues of the quantities characterizing the behavior (2.101)
of the dynamical system but also their deriva- u =u1 , . . . , um  ∈ RU ⊂ U1 × · · · Un ,
tives (and also all admissible combinations of
as well as on blends of these combinations,
both quantities and the values of their deriva-
tives). x, u ∈ RXU ⊂ RX × RU . (2.102)
Such an intuitive understanding of the infor-
mativeness of the training set will be further re- The example included in the training set
fined. should show the response of the DS to some
74 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

combination of x, u. By a reaction of this kind, In a similar way, we introduce one more exam-
we will understand the state x(tk+1 ), to which ple of pj ∈ P :
the dynamical system (2.99) passes from the
state x(tk ) with the value u(tk ) of the control ac- pj = {x (j ) (tk ), u(j ) (tk ), x (j ) (tk+1 )}. (2.107)
tion, written
The source data of the examples pi and pj will
F (x,u,t) be considered as not coincident, i.e.,
x(tk ), u(tk ) −−−−−→ x(tk+1 ). (2.103)
x (i) (tk ) = x (j ) (tk ), u(i) (tk ) = u(j ) (tk ).
Accordingly, some example p from the training
set P will include two parts, namely, the input In the general case, the dynamical system re-
(this is the pair x(tk ), u(tk )) and output (this is sponses to the original data from these examples
the reaction x(tk+1 )) of the dynamical system. do not coincide, i.e.,
2.4.2.2 Informativity of the Training Set x (i) (tk+1 ) = x (j ) (tk+1 ).
The training set should (ideally) show the dy-
namical system responses to any combinations We introduce the concept of ε-proximity for
of x, u satisfying the condition (2.102). Then, a pair of examples pi and pj . Namely, we will
according to the Basic Identification Rule (see consider examples of pi and pj ε-close if the fol-
page 73), the training set will be informative, lowing condition is satisfied:
that is, allow to reproduce in the model all the
specific behavior of the simulated DS.5 x (i) (tk+1 ) − x (j ) (tk+1 )  ε, (2.108)
Let us clarify this situation. We introduce the
where ε > 0 is the predefined real number.
notation Np
We select from the set of examples P = {pi }i=1
pi = {x (i) (tk ), u(i) (tk ), x (i) (tk+1 )}, (2.104) a subset consisting of such examples ps for
which the ε-proximity relation to the example
where pi ∈ P is the ith example from the train- ps is satisfied, i.e.,
ing set P . In this example
x (i) (tk+1 ) − x (j ) (tk+1 )  ε, ∀s ∈ Is ⊂ I. (2.109)
(i)
x (i) (tk ) = (x1 (tk ), . . . , xn(i) (tk )), Here Is is the set of indices (numbers) of those
(2.105)
u (i) (i)
(tk ) = (u1 (tk ), . . . , u(i) examples for which ε-proximity is satisfied with
m (tk )).
respect to the example ps , while Is ⊂ I = {1, . . . ,
The response x (i) (tk+1 ) of the considered dynam- Np }.
ical system to the example pi is We call an example pi ε-representative6 if for
the whole collection of examples ps , ∀s ∈ Is ,
that is, for any example ps , s ∈ Is , the condition
x (i) (tk+1 ) = (x1(i) (tk+1 ), . . . , xn(i) (tk+1 )). (2.106)
of ε-proximity is satisfied. Accordingly, we can
now replace the collection of examples {ps }, s ∈
5 It should be noted that the availability of an informative Is , by a single ε-representative pi , and the er-
training set provides a potential opportunity to obtain a ror introduced by such a replacement will not
model that will be adequate to a simulated dynamical sys- exceed ε. Input parts of collections of examples
tem. However, this potential opportunity must still be taken
advantage of, which is a separate nontrivial problem, the
successful solution of which depends on the chosen class of 6 This means that the example p is included in the set of
i
models and learning algorithms. examples {ps }, s ∈ Is .
2.4 TRAINING SET ACQUISITION PROBLEM FOR DYNAMIC NEURAL NETWORKS 75
(s)
{ps }, s ∈ Is , allocate the subdomain RXU , s ∈ In Eq. (2.111), ϕ(·) is a nonlinear vector func-
Is , in the domain RXU defined by the relation tion of the vector arguments x, u and the scalar
(2.102); in this case argument t. It is assumed to be given and be-
longs to some class of functions that admits the
Np
) (s)
existence of a solution of Eq. (2.111) for given
RXU = RXU . (2.110) x(t0 ) and u(t) in the considered part of the space
s=1 of states for the plant.
The behavior of the plant, determined by its
Now we can state the task of forming a train- dynamic properties, can be influenced by set-
ing set as a collection of ε-representatives that ting a correction value for the control variable
covers the domain RXU (2.102) of all possible u(x, u∗ ). The operation of forming the required
values of pairs x, u. value u(x, u∗ ) for some time ti+1 from the val-
The relation (2.110) is the ε-covering condi- ues of the state vector x and the command con-
tion for the training set P of the domain RXU . trol vector u∗ at the time instant ti
A set P carrying out an ε-covering of the domain
RXU will be called ε-informative or, for brevity, u(ti+1 ) = (x(ti ), u∗ (ti )) (2.112)
simply informative.
If the training set P has ε-informativity, this we will perform in the device, which we call the
means that for any pair x, u ∈ RXU there is correcting controller (CC). We assume that the
at least one example pi ∈ P which is an ε- character of the transformation (·) in (2.112) is
representative for a given pair. determined by the composition and values of
With respect to the ε-covering (2.110) of the the components of a certain parameter vector
domain RXU , the following two problems can be w = (w1 w2 . . . wNw ). The set (2.111), (2.112) from
formulated: the plant and CC is referred to as a controlled
system.
1. Given the number of examples Np in the
The behavior of the system (2.111), (2.112)
training set P , find their distribution in the
with the initial conditions x0 = x(t0 ) under the
domain RXU which minimizes the error ε. control u(t) is a multistep process if we assume
2. A permissible error value ε is given; obtain a that the values of this process x(tk ) are observed
minimal collection of a number of Np exam- at time instants tk , i.e.,
ples which ensures that ε is obtained.
{x(tk )} , tk = t0 + kt ,
2.4.2.3 Example of Direct Formation of
t f − t0 (2.113)
Training Set k = 0, 1, . . . , Nt , t = .
Nt
Suppose that the controlled object under con-
sideration (plant) is a dynamical system de- In the problem (2.111), (2.112), as a teaching
scribed by a vector differential equation of the example, generally speaking, we could use a
form [91,92] pair
ẋ = ϕ(x, u, t). (2.111) (e)
(x0 , u(e) (t)), {x (e) (tk ) , k = 0, 1, . . . , Nt } ,
Here, x = (x1 x2 . . . xn ) ∈ Rn is the vector of state
(e)
variables of the op-amp; u = (u1 u2 . . . um ) ∈ Rm where (x0 , u(e) (t)) is the initial state of the sys-
is a vector of control variables of the op-amp; tem (2.111) and the formed control law, respec-
Rn , Rm are Euclidean spaces of dimension n and tively, and {x (e) (tk ) , k = 0, 1, . . . , Nt } is the mul-
m, respectively; t ∈ [t0 , tf ] is the time. tistep process (2.113), which should be carried
76 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

(e)
out given the initial state x0 under the influ- We define on these ranges a grid {(i) , (j ) } as
ence of some control u(e) (t) on the time interval follows:
[t0 , tf ]. Comparing the process {x (e) (tk )} with the (si )
(i) : xi = ximin + si xi ,
process {x(tk )}, obtained for the same initial con-
ditions x0(e) and control u( rusinde) (t), in fact, that i = 1, . . . , n; si = 0, 1, . . . , Ni ,
(pj ) (2.116)
is, for some fixed value of the parameters w, it (j ) : uj = umin
j + pj uj ,
would be possible in some way to determine the
j = 1, . . . , m; pi = 0, 1, . . . , Mj .
distance between the required and actually im-
plemented processes, and then try to minimize In expressions (2.116), we have
it by varying the values of the parameters w.
This kind of “straightforward” approach, how- ximax − ximin
xi = , i = 1, . . . , n ,
ever, leads to a sharp increase in the amount of Ni
computation at the stage of training of the ANN umax − umin
j j
and, in particular, at the stage of the formation uj = , j = 1, . . . , m.
Mj
of the corresponding training set.
There is, however, the possibility of drasti- Here we denote the following: Ni is the number
cally reducing these volumes of calculations if of segments divided by the range of values for
we take advantage of the fact that the state into the state variable xi , i = 1, . . . , n; Mj is the num-
which the system goes (2.111), (2.112) for the ber of segments divided by the range of values
time t = ti+1 − ti , depends only on its state for the control variable uj , j = 1, . . . , m.
x(ti ) at time ti , and also on the value u(ti ) of the The nodes of this grid are tuples of length
(pj )
control action at the same instant of time. This (n + m) of the form xi(si ) , uj , where the com-
circumstance gives grounds to replace the mul- (s )
ponents xi i , i = 1, . . . , n, are taken from the
tistep process {x (e) (tk )}, k = 0, 1, . . . , Nt , a set of (p )
Nt one-step processes, each of which consists in corresponding (i) , and the components uj j ,
(2.111), (2.112) of one step in time of length t j = 1, . . . , m, from (j ) in (2.116). If the domain
from some initial point x(tk ). RXU is a subset of the Cartesian product X × U ,
In order to obtain a set of initial points xt (t0 ), then this fact can be taken into account by ex-
ut (t0 ), which completely characterizes the be- cluding the “extra” tuples from the grid (2.116).
In [90] an example of the solution of the ANN
havior of the system (2.111), (2.112) on the whole
modeling problem was considered, in which the
range of admissible values RXU ⊆ X × U , x ∈ X,
training set was formed according to the method
u ∈ U , we construct the corresponding grid.
presented above. The source model of motion in
Let the state variables xi , i = 1, . . . , n, in equa- this example is a system of equations of the fol-
tion (2.111) take values from the ranges defined lowing form:
for each of them, i.e.,
m(V̇z − qVx ) = Z ,
(2.117)
ximin  xi  ximax , i = 1, . . . , n. (2.114) Iy q̇ = M ,

where Z is the aerodynamic normal force; M


Similar inequalities hold for control variables
is the aerodynamic pitching moment; q is the
uj , j = 1, . . . , m, in (2.111), i.e., pitch angular velocity; m is the aircraft mass; Iy
is the pitch moment of inertia; Vx and Vz are the
umin
j  uj  umax
j , j = 1, . . . , m. (2.115) longitudinal and normal velocities, respectively.
2.4 TRAINING SET ACQUISITION PROBLEM FOR DYNAMIC NEURAL NETWORKS 77

Here, the force Z and moment M depend on the As noted above, each of the grid nodes (2.116)
angle of attack α. However, in case of a rectilin- is used as the initial value x0 = x(t0 ), u0 = u(t0 )
ear horizontal flight the angle of attack equals for the system of equations (2.111); with these
the pitch angle θ. The pitch angle, in turn, is initial values, one step of integration is per-
related to velocity Vz and airspeed V by the fol- formed with the value t. These initial val-
lowing kinematic dependence: ues x(t0 ), u(t0 ) constitute the input vector in
the learning example, and the resulting value
Vz = V sin θ .
x(t0 + t) is the target vector, that is, vector-
Thus, the system of equations (2.117) is closed. sample, showing the learning algorithm of the
The pitching moment M in (2.117) is a func- HC model, which should be the output value
tion of the all-moving stabilizer deflection angle, of the NS under given starting conditions x(t0 ),
i.e., M = M(δe ). u(t0 ).
Thus, the system of equations (2.117) de- The formation of a learning set for solving
scribes transient processes in angular velocity the neural network approximation problem of
and pitch angle, which arise immediately after a the dynamical system (2.111) (in particular, in its
violation of balancing corresponding to a steady particular version (2.117)) is a nontrivial task. As
horizontal flight. the computing experiment [90] has shown, the
So, in the particular case under consideration, convergence of the learning process is very sen-
the composition of the state and control vari- sitive to the grid step xi , uj and the time step
ables is as follows: t.
We explain this situation by the example of
x = [Vz q]T , u = [δe ]. (2.118) the system (2.117), when
In terms of the problem (2.117), when the
mathematical model of the controlled object of x1 = Vz , x2 = q, u1 = δe .
the inequality is approximated (2.114),
We represent, as shown in Fig. 2.28, the part
Vzmin  Vz  Vzmax , of the grid {(Vz ) , (q) }, whose nodes are used
(2.119)
q min  q  q max , as initial values (the input part of the training
example) to obtain the target part of the train-
the inequality (2.115) will be written as ing example. In Fig. 2.28, the grid node is shown
in a circle, and the cross is the state of the sys-
δemin  δe  δemax , (2.120) tem (2.117), obtained by integrating its equa-
tions with a time step t with the initial condi-
and the grid (2.116) is rewritten in the following (i)
form: tions (Vz , q (j ) ), for a fixed position of the stabi-
(k)
lizer δe .
(sV )
(Vz ) : Vz z = Vzmin + sVz Vz , In a series of computational experiments it
sVz = 0, 1, . . . , NVz , was established that for t = const, the condi-
tions of convergence of the learning process of
(q) : q (sq ) = q min + sq q , the neural controller will be as follows:
(2.121)
sq = 0, 1, . . . , Nq ,
(p)
(δe ) : δe = δemin + pδe δe , Vz (t0 + t) − Vz (t0 ) < Vz ,
(2.122)
pδe = 0, 1, . . . , Mδe . q(t0 + t) − q(t0 ) < q ,
78 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

2.4.2.4 Assessment of the Volume of the


Training Set With a Direct
Approach to Its Formation
Let us estimate the volume of the training set,
obtained with a direct approach to its formation.
Let us first consider the simplest version of a di-
rect one-step method of forming a training set,
i.e., the one in which the reaction of DS (2.106)
at the time instant tk+1 depends on the values
FIGURE 2.28 Fragment of the grid {(Vz ) , (q) } for δe =
of the state and control variables (2.105) only at
const. ◦ – starting grid node; × – mesh target point; Vz , q time instant tk .
is the grid spacing for the state variables Vz and q, respec- Let us consider this question on a specific ex-
tively; Vz , q  is the shift of the target point relative to the ample related to the problem, which is solved
grid node that spawned it (From [90], used with permission in Section 6.2 (formation of the ANN model of
from Moscow Aviation Institute).
longitudinal short-period motion of a maneu-
verable aircraft). The initial model of motion in
the form of a system of ODEs is written as fol-
where Vz , q is the grid spacing (2.121) for the lows:
corresponding state variables for the given fixed
value δe . q̄S g
α̇ = q − CL (α, q, δe ) + cos θ ,
The grid {(Vz ) , (q) }, constructed for some mV V
(p)
fixed point δe from (δe ) , can be graphically q̄S c̄
q̇ = Cm (α, q, δe ) , (2.123)
depicted as shown in Fig. 2.29. Here, for each Iy
of the grid nodes (they are shown as circles), T 2 δ̈e = −2T ζ δ̇e − δe + δeact ,
also the corresponding target points are repre-
sented (crosses). The set (“bundle”) of such im- where α is the angle of attack, deg; θ is the
(p)
ages, each for its value δe ∈ (δe ) , gives impor- pitch angle, deg; q is the pitch angular velocity,
tant information about the structure of the train- deg/sec; δe is the all-moving stabilizer deflec-
ing set for the system (2.117), allowing, in some tion angle , deg; CL is the lift coefficient; Cm is
pitching moment coefficient; m is mass of the air-
cases, to significantly reduce the volume of this
craft, kg; V is the airspeed, m/sec; q̄ = ρV 2 /2 is
set.
the dynamic pressure, kg·m−1 sec−2 ; ρ is air den-
Now, after the grid is formed (2.116) (or
sity, kg/m3 ; g is the acceleration due to gravity,
(2.121), for the case of longitudinal short-period
m/sec2 ; S is the wing area, m2 ; c̄ is the mean
motion), you can build the corresponding train- aerodynamic chord of the wing, m; Iy is the mo-
ing set, after which the problem of learning the ment of inertia of the aircraft relative to the lat-
network with the teacher can be solved. This eral axis, kg·m2 ; the dimensionless coefficients
task was done in [90]. The results obtained in CL and Cm are nonlinear functions of their argu-
this paper show that the direct method of form- ments; T , ζ are the time constant and the relative
ing training sets can be successfully used for damping coefficient of the actuator, and δeact is
problems of small dimension (determined by the command signal to the actuator of the all-
the dimensions of the state and control vectors, moving controllable stabilizer (limited to ±25
and also by the magnitude of the range of ad- deg). In the model (2.123), the variables α, q, δe ,
missible values of the components of these vec- and δ̇e are the states of the controlled object, and
tors). the variable δeact is the control.
2.4 TRAINING SET ACQUISITION PROBLEM FOR DYNAMIC NEURAL NETWORKS 79

FIGURE 2.29 Graphic grid representation {(Vz ) , (q) } when δe = const, combined with the target points; this grid sheet
is built with δe = −8 deg (From [90], used with permission from Moscow Aviation Institute).

Let us carry out a discretization of the consid- N = 20 : 20 × 20 × 20 = 8000,


ered dynamical system as it was described in the N = 25 : 25 × 25 × 25 = 15625, (2.124)
previous section. In order to reduce the dimen- N = 30 : 30 × 30 × 30 = 27000.
sion of the problem, we will only consider the
variables α, q, and δeact , which directly charac- If not only the variables α, q, and δeact , but
terize the behavior of the considered dynamical also δe and δ̇e are required in the dynamical sys-
system, and treat the variables δe and δ̇e as “hid- tem model to be formed, then the estimates of
den” variables. the volume of training sets received take the
If the dependencies for δe and δ̇e are “hid- form
den”, then for the remaining variables α, q, and
δeact we set variables Nα , Nq , Mδeact , which are the N = 20 : 20 × 20 × 20 × 20 × 20 = 3200000,
number of counts for these variables. Assum- N = 25 : 25 × 25 × 25 × 25 × 25 = 9765625,
ing that all combinations of the values of these N = 30 : 30 × 30 × 30 × 30 × 30 = 25200000.
variables are admissible, the quantity N = Nα · (2.125)
Nq · Mδeact , the number of examples in the prob-
lem book for different values of the number of As we can see from these estimates, from the
samples Nα , Nq , Mδeact (for simplicity, we assume point of view of the volume of the training set,
that Nα = Nq = Mδeact = N ), is only the variants related to the dynamical sys-
80 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

tem with state vectors and small-size controls 2.4.3 Indirect Approach to the
and with a moderate number of samples with Acquisition of Training Data Sets
respect to these variables are acceptable (the first for Dynamic Neural Networks
and second variants in (2.124)). Even a slight in- 2.4.3.1 General Characteristics of the
crease in the values of these parameters leads, as Indirect Approach to the Acquisition
can be seen from (2.125)), to unacceptable values of Training Data Sets
of the value of the training set. As noted in the previous section, the indirect
In real-world applied problems, where the approach is based on the application of a set of
possibilities of ANN modeling are particularly specially designed control signals to the dynam-
in demand, the result is even more impres- ical system, instead of direct sampling of the do-
sive. main RX,U of feasible values of state and control
variables.
In particular, in the full model of the angular
With this approach, the actual motion of
motion of the aircraft (the corresponding ANN the dynamical system (x(t), u(t)) is composed
model for this case is considered in Section 6.3) of a program (test maneuver) of the motion
of Chapter 6, we have 14 state variables and 3 (x ∗ (t), u∗ (t)), generated by the control signal
control variables, hence the volume of the train- u∗ (t), as well as the motion (x̃(t), ũ(t)), gener-
ing set for it in the direct approach to its forma- ated by the additional perturbing action ũ(t), i.e.,
tion and at Nw = Nq = Mδe = 20 will be N =
2017 = 2 · 1018 , which, of course, is completely x(t) = x ∗ (t) + x̃(t), u(t) = u∗ (t) + ũ(t). (2.126)
unacceptable. Examples of test maneuvers include:
Thus, the direct approach to the formation
• a straight-line horizontal flight with a con-
of training sets for modeling dynamical systems
stant speed;
has a very small “niche,” in which its application
• a flight with a monotonically increasing angle
is possible – simple problems of low dimension- of attack;
ality. An alternative indirect approach is more • a U-turn in the horizontal plane;
well-suited for complex high dimensional prob- • an ascending/descending spiral.
lems. This approach is based on the application
Possible variants of the test perturbing actions
of a set of specially designed control signals to ũ(t) are considered below.
a dynamical system of interest. This approach The type of test maneuver (x ∗ (t), u∗ (t)) in
is discussed in more detail in the next section. (2.126) determines the obtained ranges for
The indirect approach has its advantages and changing the values of the state and control vari-
disadvantages. The indirect approach is the only ables; ũ(t) is the variety of examples within these
viable option in situations where the training ranges.
data acquisition is required to be performed in What is the ideal form of a training set and
how can it be obtained in practice using an indi-
real or even in advanced time. However, in cases
rect approach? We consider this issue in several
when there are no rigid time restrictions regard-
stages, starting with the simplest version of the
ing the acquisition and processing of training dynamical system and proceeding to more com-
data, the most appropriate approach is a mixed plex versions.
one, which is a combination of direct and indi- We first consider a simpler case of an uncon-
rect approaches. trolled dynamical system (Fig. 2.30).
2.4 TRAINING SET ACQUISITION PROBLEM FOR DYNAMIC NEURAL NETWORKS 81

or trajectory in the state space) of the dynamical


system.
The behavior of an (uncontrolled) dynami-
cal system is determined, as already noted, by
its initial state {xi (t0 )}ni=1 and “the nature of the
FIGURE 2.30 Uncontrolled dynamical system. (A) With- dynamical system,” i.e., the way in which the
out external influences. (B) With external influences.
variables xi are related to each other in the evo-
lution law (the law of functioning) of the dy-
Suppose that there is some dynamical system, namical system F (x, t). This evolution law de-
that is, a system whose state varies with time. termines the state of the dynamical system at
This dynamical system is uncontrollable, its be- time (t + t), if these states are known at pre-
havior is affected only by the initial conditions vious time instants.
and, possibly, by some external influences (the
impact of the environment in which and in in- 2.4.3.2 Formation of a Set of Test
teraction with which the dynamical system real- Maneuvers
izes its behavior). An example of such a dynam- It was noted above that the selected program
ical system can be an artillery shell, the flight motion (reference trajectory) as part of the test
path of which is affected by the initial conditions maneuver determines the range of values of the
of shooting. The impact of the medium in this state variables in which the training data will
case is determined by the gravitational field in
be obtained. It is required to choose such a set
which the projectile moves, and also by the at-
of reference trajectories that covers the whole
mosphere.
range of changes in the values of the state vari-
The state of the dynamical system in ques-
ables of the dynamical system. The required
tion at a particular moment in time t ∈ T =
number of trajectories in such a collection is de-
[t0 , tf ] is characterized by a set of values x =
(x1 , . . . , xn ). The composition of this set of quan- termined from the condition of ε-proximity of
tities, as noted above, is determined by the ques- the phase trajectories of the dynamical system,
tions that are asked about the dynamical system i.e.
in question.
The state of the dynamical system in ques- xi (t) − xj (t)  ε, xi (t), xj (t) ∈ X, t ∈ T .
tion at a particular moment in time t ∈ T =
[t0 , tf ] is characterized by a set of values x = (2.127)
(x1 , . . . , xn ). The composition of this set of quan-
tities, as noted above, is determined by the ques- We define a family of reference trajectories of
tions that are asked about the considered dy- the dynamical system,
namical system.
At the initial instant of time t ∈ T , the state
of the dynamical system takes on the value x 0 = {xi∗ (t)}N R ∗
i=1 , xi (t) ∈ X, t ∈ T . (2.128)
x(t0 ) = (x10 , . . . , xn0 ), where x 0 = x(t0 ) ∈ X.
Since the variables {xi }ni=1 describe exactly We assume that the reference trajectory xi∗ (t),
some dynamical system, they, according to the i = 1, . . . , NR , is an ε-representative of the family
definition of the dynamical system, vary with Xi ⊂ X of the phase trajectories of the dynamical
time, that is, the dynamical system is character- system in the domain Xi ⊂ X if for each of the
ized by a set of variables {xi (t)}ni=1 , t ∈ T . This phase trajectories x(t) ∈ Xi the following condi-
set will be called the behavior (phase trajectory tion is satisfied:
82 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

xi∗ (t) − x(t)  ε, xi∗ (t) ∈ Xi , x(t) ∈ Xi , t ∈ T .


(2.129)

The family of reference trajectories {xi∗ (t)}N R


i=1
of the dynamical system must be such that

)
NR
Xi = X1 ∪ X2 ∪ . . . ∪ XNR = X, (2.130)
i=1 FIGURE 2.31 Typical test excitation signals used in the
study of the dynamics of controllable systems. (A) Stepwise
where X is the family (collection) of all phase excitation. (B) Impulse excitation. From [109], used with per-
trajectories (trajectories in the state space) poten- mission from Moscow Aviation Institute.
tially realized by the dynamical system in ques-
tion. This condition means that the family of influences in such a way as to obtain an informa-
reference trajectories {xi∗ (t)}N R
i=1 should together
tive set of training data for a dynamical system
represent all potentially possible variants of the are considered.
behavior of the dynamical system in question.
TYPICAL TEST EXCITATION SIGNALS FOR THE
This condition can be treated as a condition for
IDENTIFICATION OF SYSTEMS
completeness of the ε-covering by support tra-
jectories of the domain of possible variants of the Elimination of uncertainties in the ANN
behavior of the dynamical system. model by refining (restoring) a number of el-
An optimal ε-covering problem for the do- ements included in it (for example, functions
main X of possible variants of the dynamical describing the aerodynamic characteristics of
system behavior can be stated, consisting in the aircraft) is a typical problem of identifying
minimizing the number NR of reference trajec- systems [44,93–99]. When solving identification
tories in the set {xi∗ (t)}N R problems for controllable dynamic systems, a
i=1 , i.e.,
number of typical test disturbances are used.
N∗ Among them, the most common are the follow-
{xi∗ (t)}i=1
R
= min{xi∗ (t)}N R
i=1 , (2.131)
NR ing impacts [89,100–103]:
that allows to minimize the volume of the train- • stepwise excitation;
ing set while preserving its informativeness. • impulse excitation;
A desirable condition (but difficult to realize) • doublet (signal type 1–1);
is also the condition • triplet (signal type 2–1–1);
• quadruplet (signal type 3–2–1–1);
*
NR
• random signal;
Xi = X1 ∩ X2 ∩ . . . ∩ XNR = ∅. (2.132) • polyharmonic signal.
i=1
Stepwise excitation (Fig. 2.31A) is a function
2.4.3.3 Formation of Test Excitation Signal u(t) that changes at a certain moment in time ti
As already noted, the type of test maneuver from u = 0 to u = u∗ , i.e.,
in (2.126) determines the resulting ranges for +
changing the values of the state and control vari- 0, t < ti ;
u(t) = ∗ (2.133)
ables, while the kind of perturbation effect pro- u , t  ti .
vides a variety of examples within these ranges.
In the following sections, the questions of form- Let u∗ = 1. Then (2.133) is the function of the
ing (with a given test maneuver) test excitatory unit jump σ (t). With its use, you can define an-
2.4 TRAINING SET ACQUISITION PROBLEM FOR DYNAMIC NEURAL NETWORKS 83

FIGURE 2.32 Typical test excitation signals used in the study of the dynamics of controllable systems. (A) Doublet (sig-
nal type 1–1). (B) Triplet (signal type 2–1–1). (C) Quadruplet (signal type 3–2–1–1). From [109], used with permission from
Moscow Aviation Institute.

FIGURE 2.33 Modified versions of the test excitation signals used in the study of the dynamics of controllable systems. (A)
Triplet (signal type 2–1–1). (B) Quadruplet (signal type 3–2–1–1). From [109], used with permission from Moscow Aviation
Institute.

other kind of test action – a rectangular pulse A triplet (signal of the type 2–1–1) is a combi-
(Fig. 2.31B): nation of a rectangular pulse of duration T = 2Tr
and a complete rectangular oscillation with a pe-
u(t) = A(σ (t) − σ (t − Tr )), (2.134) riod T = 2Tr .
A quadruplet (a signal of the type 3–2–1–1–1)
where A is the pulse amplitude and Tr = tf − ti is formed from a triplet by adding to its origin
a rectangular pulse of width T = Tr . In addition,
is the pulse duration.
we can also use triplet and quadruplet variants
On the basis of the rectangular pulse signal
in which each of the constituent parts of the sig-
(2.134), perturbing effects of oscillatory charac-
nal is a full-period oscillation (see Fig. 2.33). We
ter are determined, consisting of a series of rect- will designate them as signals of the type 2–1–1
angular oscillations with a definite relationship and 3–2–1–1, respectively.
between their periods. Among the most com- Another typical excitation signal is shown in
monly used influences of this kind are the dou- Fig. 2.34A. Its values are kept constant for all
blet (Fig. 2.32A), the triplet (Fig. 2.32B), and the time intervals [ti , ti+1 ), i = 0, 1, . . . , n − 1, and at
quadruplet (Fig. 2.32C). time instances ti it can be changed randomly. In
The doublet (also denoted as a signal of type more detail, a signal of this type will be consid-
1–1) is one complete rectangular wave with a pe- ered below by the example of solving the prob-
riod T = 2Tr equal to twice the duration of the lem of the ANN simulation of the longitudinal
rectangular pulse. angular motion of an aircraft.
84 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

FIGURE 2.34 Test excitations as functions of time used in studying the dynamics of controlled systems. (A) A random
signal. (B) A polyharmonic signal. Here, ϕact is the command signal for the elevator actuator (full-movable horizontal tail) of
the aircraft from the example (2.123) on page 78. From [109], used with permission from Moscow Aviation Institute.

POLYHARMONIC EXCITATION SIGNALS FOR In the case where dynamical system param-
THE IDENTIFICATION OF SYSTEMS eter estimation is performed in real time, it is
To solve problems of identification of dy- desirable that the stimulating effects on the dy-
namic systems, including aircraft, frequency namical system are small. If this condition is
methods are successfully applied. The available met, then the response of the dynamical system
results show [104–107] that for a given frequency (in particular, aircraft) to the effect of the excit-
range, it is possible to effectively estimate the ing inputs will be comparable in intensity with
parameters of dynamical system models in real the reaction, for example, to atmospheric tur-
time. bulence. Then the test excitatory effects will be
The determination of the composition of the hardly distinguishable from the natural distur-
experiments for modeling the dynamical system bances and will not cause unnecessary worries
in the frequency domain is an important part of to the crew of the aircraft.
the identification problem solving process. The Modern aircraft, as one of the most impor-
experiments should be carried out with the aid tant types of simulated dynamical system, have
of excitation signals applied to the input of the a significant number of controls (rudders, etc.).
dynamical system covering a certain predeter- When obtaining the data required for frequency
mined frequency range. analysis and dynamical system identification, it
2.4 TRAINING SET ACQUISITION PROBLEM FOR DYNAMIC NEURAL NETWORKS 85

is highly desirable to be able to apply a test exci- harmonic polynomial


tation signal to all these organs at the same time , -
 2πkt
to reduce the total time required for data collec- uj = Ak sin + ϕk ,
tion. T (2.135)
k∈Ik
Schröder’s work [108] showed the promise of
using a polyharmonic excitation signal for these Ik ⊂ K, K = {1, 2, . . . , M},
purposes, which is a set of sinusoids shifted in
as a finite linear combination of the fundamental
phase relative to each other. Such a signal makes
harmonic A1 sin(ωt + ϕ1 ) and higher-order har-
it possible to obtain an excitation signal with a
monics A2 sin(2ωt + ϕ2 ), A3 sin(3ωt + ϕ2 ), and so
rich frequency content and a small peak factor
on.
value (amplitude coefficient). Such a signal is re-
The input effect for each of the m controls (for
ferred to as a Schröder sweep.
example, the steering surfaces of the aircraft) is
The peak factor is the ratio of the maximum
formed as the sum of the harmonic signals (si-
amplitude of the input signal to the energy of
nusoids), each of which has its own phase shift
the input signal. Inputs with small peak factor
ϕk . The input signal uj corresponding to the j th
values are effective in that they provide a good
control body has the following form:
frequency content of the dynamical system re-
sponse without large amplitudes of the output  , -
2πkt
signal (reaction) of the dynamical system in the uj = Ak cos + ϕk , j = 1, . . . , m,
T
time domain. k∈Ik
The paper [107] proposes the development of Ik ⊂ K, K = {1, 2, . . . , M},
an approach to the formation of a sweep signal (2.136)
by Schröder, which makes it possible to obtain
such a signal for the case of several controls used where M is the total number of harmonically re-
simultaneously, with optimization of the peak lated frequencies; T is the time interval during
factor values for them. This development is ori- which a test excitation signal acts on the dy-
ented to work in real time. namical system; Ak is the amplitude of the kth
The excitation signals generated in [107] are sinusoidal component. The expression (2.136) is
mutually orthogonal in both the time and fre- written in discrete time for N samples
quency domain; they are interpreted as pertur-
bations that are additional to the values of the uj = {uj (0), uj (1), . . . , uj (i), . . . , uj (N − 1)},
corresponding control inputs required for the re-
alization of the given behavior of the dynamical where uj (i) = uj (t (i)).
system. Each of the m inputs (disturbance effects) is
To generate test excitation signals, only a pri- formed from sinusoids with frequencies
ori information is needed in the form of approxi- 2πk
mate estimates of the frequency band inherent in ωk = , k ∈ Ik , Ik ⊂ K, K = {1, 2, . . . , M},
T
the dynamical system in question, as well as the
relative effectiveness of the controls for correctly where ωM = 2πM/T is the upper boundary
scaling the amplitudes of the input signals. value of the frequency band of the exciting input
signals (influences). The interval [ω1 , ωM ] speci-
GENERATION OF A SET OF POLYHARMONIC fies the frequency range in which the dynamics
EXCITATION SIGNALS of the aircraft under study is expected to lie.
The mathematical model of the input pertur- If the phase angles ϕk in (2.136) are chosen
bation signal uj affecting the j th control is the randomly in the interval (−π, π], then in gen-
86 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

eral, the individual harmonic components (oscil- GENERATION PROCEDURE FOR


lations), being summed, can give at t (i) a value POLYHARMONIC EXCITATION SIGNALS
of the amplitude of the sum signal uj (i) at which The procedure for forming a polyharmonic
the conditions of proximity of the disturbed mo- input for a given set of controls consists of the
tion to the reference one are violated. following steps.
In (2.136), ϕk is the phase shift that must be
selected for each of the harmonic components in 1. Set the value of the time interval T , during
such a way as to provide a small value of the which a disturbing effect will be applied to
peak factor7 (amplitude factor) PF(uj ), which is the input of the control object. The value of T
defined by the relation determines the smallest value of the resolving
power in frequency f = 1/T , as well as the
(umax − umin minimum frequency limit fmin  2/T .
j j )
PF(uj ) = . (2.137) 2. Set the frequency range [fmin , fmax ], from
2 (uTj uj )/N which the frequencies of disturbing effects
for the dynamical system under considera-
or tion will be selected. It corresponds to the
frequency range of the expected reactions of
(umax
j − umin
j ) ||uj ||∞ this system to the applied effects. These ef-
PF(uj ) = = , (2.138)
2 rms(uj ) ||uj ||2 fects cover the interval [fmin , fmax ] uniformly,
with step f . The total number of used fre-
where the last equality is satisfied only in the quencies is
case when uj oscillates symmetrically with re-
spect to zero. In the relations (2.137) and (2.138), / fmax − fmin 0
M= + 1,
f
umin
j = min[uj (i)], umax
j = max[uj (i)].
i i where · is the integer part of the real num-
ber.
For an individual sinusoidal component in 3. Divide the set of indices K = {1, 2, . . . , M}
(2.135),
√ if the value of the peak factor equals into approximately equal in number of ele-
PF = 2, then the value of the peak factor re- ments subsets Ij ⊂ K, each of which deter-
lated to such a component RPF(uj ) (relative mines the set of frequencies for the corre-
peak factor,8 relative amplitude factor) is de- sponding j th body management. This sepa-
fined as ration should be performed in such a way
that the frequencies for different controls al-
(umax
j − umin
j ) PF(uj )
RPF(uj ) = √ = √ . (2.139) ternate. For example, for two controls, the set
2 2 rms(uj ) 2 K = {1, 2, . . . , 12} is divided according to this
rule into subsets I1 = {1, 3, . . . , 11} and I2 =
Minimizing the exponent (2.139) by selecting the {2, 4, . . . , 12}, and for three controls into sub-
appropriate phase shift values ϕk for all k al- sets I1 = {1, 4, 7, 10}, I2 = {2, 5, 8, 11}, and I3 =
lows to prevent the occurrence of the situation {3, 6, 9, 12}. This approach ensures the pro-
mentioned above, with the deviation of the dis- duction of small peak factor values for indi-
turbed motion from the reference to an invalid vidual input signals and also allows uniform
value. coverage of the frequency range [fmin , fmax ]
for each of these signals. If necessary, this
7 PF – Peak Factor. kind of uniformity can be avoided, for exam-
8 RPF – Relative Peak Factor. ple, if certain frequencies are to be empha-
2.4 TRAINING SET ACQUISITION PROBLEM FOR DYNAMIC NEURAL NETWORKS 87

sized, or, if necessary, some frequency com- noted that to obtain a constant time shift of
ponents should be eliminated (in particular, all components uj their phase shifts will be
for fear of causing undesired reaction of the different in magnitude, since each of the com-
control object). In the paper [106] it was estab- ponents has its own frequency different from
lished empirically that if the sets of Ij indices the frequency of the other components. Since
are formed in such a way that they contain all components of the signal uj are harmon-
numbers greater than 1, multiples of 2 or 3 ics of the same fundamental frequency for the
(for example, k = 2, 4, 6 or k = 5, 10, 15, 20), period of oscillations T , if the phase angles ϕk
then the phase shift for them can be opti- of all components are changed so that the ini-
mized in such a way that the relative peak tial value of the input signal was zero, then
factor for the corresponding input action will its value at the final moment of time will also
be very close to 1, and in some cases it will be zero. In this case, the energy spectrum, or-
be even less than 1. For the distribution of thogonality, and relative peak factor of the in-
indices over subsets Ij , the following condi- put signals remain unchanged.
tions must be satisfied: 7. Go back to step 5 and repeat the appropri-
) * ate actions until either the relative peak fac-
Ij = K, K = {1, 2, . . . , M}, Ij = ∅. tor reaches the prescribed value, or the limit
j j number of iterations of the process is reached.
For example, the target value of the relative
Each index k ∈ K must be used exactly once. peak factor can be set as 1.01, the maximum
Compliance with this condition ensures mu- number of iterations 50.
tual orthogonality of the input actions both There are a number of methods that allow to
in the time domain and in the frequency do- optimize the frequency spectrum of input (test)
main. signals when solving the problem of estimating
4. Generate, according to (2.136), the input ac- the parameters of a dynamic system. However,
tion uj for each of the controls used, and then all these methods require a significant amount of
calculate the initial phase angle values ϕk ac- computation, as well as a certain level of knowl-
cording to the Schröder method, assuming edge about the dynamical being investigated,
the uniformity of the power spectrum. usually tied to a certain nominal state of the sys-
5. Find the phase angle values ϕk for each of the tem. With respect to the situation considered in
input actions uj which minimize the relative this chapter, such methods are useless because
peak factor for them. the task is to identify the dynamics of the system
6. For each of the input actions uj , perform a in real time for various modes of its functioning
one-dimensional search procedure to find a that vary widely. In addition, the solution of the
constant time offset value such that the cor- task of reconfiguring the control system in the
responding input signal starts at a zero value event of failures and damage of the dynamical
of its amplitude. This operation is equiva- system requires the solution of the problem of
lent to shifting the graph of the input signal identification with significant and unpredictable
along the time axis so that the point of inter- changes in the dynamics of the system. Under
section of this graph with the abscissa axis such conditions, the laborious calculation of the
(i.e., with the time axis) coincides with the input effect optimized for the frequency spec-
origin. The phase shift corresponding to such trum does not make sense, and in some cases it
a displacement is added to the values of ϕk is impossible, since it does not fit into real time.
of all sinusoidal components (harmonics) of Instead, the frequency spectrum of all generated
the considered input actions uj . It should be input influences is selected in such a way that it
88 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

is uniform in a given frequency range in order to [15] Mandic DP, Chambers JA. Recurrent neural networks
exert a sufficient excitatory effect on the dynam- for prediction: Learning algorithms, architectures and
ical system. stability. New York, NY: John Wiley & Sons, Inc.; 2001.
[16] Medsker LR, Jain LC. Recurrent neural networks: De-
Step 6 of the process described above pro- sign and applications. New York, NY: CRC Press; 2001.
vides an input perturbation signal added to the [17] Michel A, Liu D. Qualitative analysis and synthesis of
main control action selected, for example, for recurrent neural networks. London, New York: CRC
balancing an airplane or for performing a pre- Press; 2002.
determined maneuver. [18] Yi Z, Tan KK. Convergence analysis of recurrent neural
networks. Berlin: Springer; 2004.
[19] Gupta MM, Jin L, Homma N. Static and dynamic neu-
ral networks: From fundamentals to advanced theory.
REFERENCES Hoboken, New Jersey: John Wiley & Sons; 2003.
[20] Lin DT, Dayhoff JE, Ligomenides PA. Trajectory pro-
[1] Ollongren A. Definition of programming languages by duction with the adaptive time-delay neural network.
interpreting automata. London, New York, San Fran- Neural Netw 1995;8(3):447–61.
cisco: Academic Press; 1974. [21] Guh RS, Shiue YR. Fast and accurate recognition of
[2] Brookshear JG. Theory of computation: Formal lan- control chart patterns using a time delay neural net-
guages, automata, and complexity. Redwood City, work. J Chin Inst Ind Eng 2010;27(1):61–79.
California: The Benjamin/Cummings Publishing Co.; [22] Yazdizadeh A, Khorasani K, Patel RV. Identification of
1989. a two-link flexible manipulator using adaptive time
[3] Chiswell I. A course in formal languages, automata delay neural networks. IEEE Trans Syst Man Cybern,
and groups. London: Springer-Verlag; 2009. Part B, Cybern 2010;30(1):165–72.
[4] Fu KS. Syntactic pattern recognition. London, New [23] Juang JG, Chang HH, Chang WB. Intelligent automatic
York: Academic Press; 1974. landing system using time delay neural network con-
[5] Fu KS. Syntactic pattern recognition and applications. troller. Appl Artif Intell 2003;17(7):563–81.
Englewood Cliffs, New Jersey: Prentice Hall, Inc.; 1982. [24] Sun Y, Babovic V, Chan ES. Multi-step-ahead model er-
[6] Fu KS, editor. Syntactic methods in pattern recog- ror prediction using time-delay neural networks com-
nition, applications. Berlin, Heidelberg, New York: bined with chaos theory. J Hydrol 2010;395:109–16.
Springer-Verlag; 1977. [25] Zhang J, Wang Z, Ding D, Liu X. H∞ state estima-
[7] Gonzalez RC, Thomason MG. Syntactic pattern recog- tion for discrete-time delayed neural networks with
nition: An introduction. London: Addison-Wesley randomly occurring quantizations and missing mea-
Publishing Company Inc.; 1978. surements. Neurocomputing 2015;148:388–96.
[8] Tutschku K. Recurrent multilayer perceptrons for iden-
[26] Yazdizadeh A, Khorasani K. Adaptive time delay neu-
tification and control: The road to applications. Uni-
ral network structures for nonlinear system identifica-
versity of Würzburg, Institute of Computer Science,
tion. Neurocomputing 2002;77:207–40.
Research Report Series, Report No. 118; June 1995.
[27] Ren XM, Rad AB. Identification of nonlinear systems
[9] Heister F, Müller R. An approach for the identifica-
with unknown time delay based on time-delay neural
tion of nonlinear, dynamic processes with Kalman-
networks. IEEE Trans Neural Netw 2007;18(5):1536–41.
filter-trained recurrent neural structures. University of
[28] Beale MH, Hagan MT, Demuth HB. Neural network
Würzburg, Institute of Computer Science, Research
toolbox: User’s guide. Natick, MA: The MathWorks,
Report Series, Report No. 193; April 1999.
[10] Haykin S. Neural networks: A comprehensive founda- Inc.; 2017.
tion. 2nd ed. Upper Saddle River, NJ, USA: Prentice [29] Čerňanský M, Beňušková L. Simple recurrent network
Hall; 1998. trained by RTRL and extended Kalman filter algo-
[11] Hagan MT, Demuth HB, Beale MH, De Jesús O. Neural rithms. Neural Netw World 2003;13(3):223–34.
network design. 2nd ed. PSW Publishing Co.; 2014. [30] Elman JL. Finding structure in time. Cogn Sci
[12] Graves A. Supervised sequence labelling with recur- 1990;14(2):179–211.
rent neural networks. Berlin, Heidelberg: Springer; [31] Elman JL. Distributed representations, simple recur-
2012. rent networks, and grammatical structure. Mach Learn
[13] Hammer B. Learning with recurrent neural networks. 1991;7:195–225.
Berlin, Heidelberg: Springer; 2000. [32] Elman JL. Learning and development in neural net-
[14] Kolen JF, Kremer SC. A field guide to dynamical recur- works: the importance of starting small. Cognition
rent networks. New York: IEEE Press; 2001. 1993;48(1):71–99.
REFERENCES 89

[33] Chen S, Wang SS, Harris C. NARX-based non- [49] Gill PE, Murray W, Wright MH. Practical optimization.
linear system identification using orthogonal least London, New York: Academic Press; 1981.
squares basis hunting. IEEE Trans Control Syst Tech- [50] Nocedal J, Wright S. Numerical optimization. 2nd ed.
nol 2008;16(1):78–84. Springer; 2006.
[34] Sahoo HK, Dash PK, Rath NP. NARX model based [51] Fletcher R. Practical methods of optimization.
nonlinear dynamic system identification using low 2nd ed. New York, NY, USA: Wiley-Interscience.
complexity neural networks and robust H∞ filter. ISBN 0-471-91547-5, 1987.
Appl Soft Comput 2013;13(7):3324–34. [52] Dennis J, Schnabel R. Numerical methods for uncon-
[35] Hidayat MIP, Berata W. Neural networks with ra- strained optimization and nonlinear equations. Society
dial basis function and NARX structure for mate- for Industrial and Applied Mathematics; 1996.
rial lifetime assessment application. Adv Mater Res [53] Gendreau M, Potvin J. Handbook of metaheuristics.
2011;277:143–50. International series in operations research & manage-
[36] Wong CX, Worden K. Generalised NARX shunting ment science. US: Springer. ISBN 9781441916655, 2010.
neural network modelling of friction. Mech Syst Sig- [54] Du K, Swamy M. Search and optimization by
nal Process 2007;21:553–72. metaheuristics: Techniques and algorithms in-
[37] Potenza R, Dunne JF, Vulli S, Richardson D, King P. spired by nature. Springer International Publishing.
Multicylinder engine pressure reconstruction using ISBN 9783319411927, 2016.
NARX neural networks and crank kinematics. Int J [55] Glorot X, Bengio Y. Understanding the difficulty
Eng Res 2017;8:499–518. of training deep feedforward neural networks. In:
[38] Patel A, Dunne JF. NARX neural network modelling Teh YW, Titterington M, editors. Proceedings of the
of hydraulic suspension dampers for steady-state and Thirteenth International Conference on Artificial Intel-
variable temperature operation. Veh Syst Dyn: Int J ligence and Statistics. Proceedings of machine learning
Veh Mech Mobility 2003;40(5):285–328. research, vol. 9. Chia Laguna Resort, Sardinia, Italy:
[39] Gaya MS, Wahab NA, Sam YM, Samsudin SI, Ja- PMLR; 2010. p. 249–56. http://proceedings.mlr.press/
maludin IW. Comparison of NARX neural network v9/glorot10a.html.
and classical modelling approaches. Appl Mech Mater [56] Nocedal J. Updating quasi-Newton matrices with lim-
2014;554:360–5. ited storage. Math Comput 1980;35:773–82.
[40] Siegelmann HT, Horne BG, Giles CL. Computa- [57] Conn AR, Gould NIM, Toint PL. Trust-region methods.
tional capabilities of recurrent NARX neural net- Philadelphia, PA, USA: Society for Industrial and Ap-
works. IEEE Trans Syst Man Cybern, Part B, Cybern plied Mathematics. ISBN 0-89871-460-5, 2000.
1997;27(2):208–15. [58] Steihaug T. The conjugate gradient method and trust
[41] Kao CY, Loh CH. NARX neural networks for nonlinear regions in large scale optimization. SIAM J Numer
analysis of structures in frequency domain. J Chin Inst Anal 1983;20(3):626–37.
Eng 2008;31(5):791–804. [59] Martens J, Sutskever I. Learning recurrent neural net-
[42] Billings SA. Nonlinear system identification: NAR- works with Hessian-free optimization. In: Proceedings
MAX methods in the time, frequency and spatio- of the 28th International Conference on International
temporal domains. New York, NY: John Wiley & Sons; Conference on Machine Learning. USA: Omnipress.
2013. ISBN 978-1-4503-0619-5, 2011. p. 1033–40. http://dl.
[43] Pearson PK. Discrete-time dynamic models. New acm.org/citation.cfm?id=3104482.3104612.
York–Oxford: Oxford University Press; 1999. [60] Martens J, Sutskever I. Training deep and recurrent
[44] Nelles O. Nonlinear system identification: From classi- networks with Hessian-free optimization. In: Neu-
cal approaches to neural networks and fuzzy models. ral networks: Tricks of the trade. Springer; 2012.
Berlin: Springer; 2001. p. 479–535.
[45] Sutton RS, Barto AG. Reinforcement learning: An in- [61] Moré JJ. The Levenberg–Marquardt algorithm: Imple-
troduction. Cambridge, Massachusetts: The MIT Press; mentation and theory. In: Watson G, editor. Numer-
1998. ical analysis. Lecture notes in mathematics, vol. 630.
[46] Busoniu L, Babuška R, De Schutter B, Ernst D. Rein- Springer Berlin Heidelberg. ISBN 978-3-540-08538-6,
forcement learning and dynamic programming using 1978. p. 105–16.
function approximators. London: CRC Press; 2010. [62] Moré JJ, Sorensen DC. Computing a trust region step.
[47] Kamalapurkar R, Walters P, Rosenfeld J, Dixon W. Re- SIAM J Sci Stat Comput 1983;4(3):553–72. https://doi.
inforcement learning for optimal feedback control: A org/10.1137/0904038.
Lyapunov-based approach. Berlin: Springer; 2018. [63] Bottou L, Curtis F, Nocedal J. Optimiza-
[48] Lewis FL, Liu D. Reinforcement learning and approx- tion methods for large-scale machine learn-
imate dynamic programming for feedback control. ing. SIAM Rev 2018;60(2):223–311. https://
Hoboken, New Jersey: John Wiley & Sons; 2013. doi.org/10.1137/16M1080173.
90 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

[64] Griewank A, Walther A. Evaluating derivatives: Prin- [79] Graves A, Schmidhuber J. Framewise phoneme clas-
ciples and techniques of algorithmic differentiation. sification with bidirectional LSTM networks. In: Pro-
2nd ed. Philadelphia, PA, USA: Society for Industrial ceedings. 2005 IEEE International Joint Conference on
and Applied Mathematics. ISBN 0898716594, 2008. Neural Networks, 2005, vol. 4; 2005. p. 2047–52.
[65] Griewank A. On automatic differentiation. In: Math- [80] Greff K, Srivastava RK, Koutník J, Steune-
ematical programming: Recent developments and brink BR, Schmidhuber J. LSTM: A search space
applications. Kluwer Academic Publishers; 1989. odyssey. CoRR 2015;abs/1503.04069. http://
p. 83–108. arxiv.org/abs/1503.04069.
[66] Bishop C. Exact calculation of the Hessian ma- [81] Wang Y. A new concept using LSTM neural networks
trix for the multilayer perceptron. Neural Com- for dynamic system identification. In: 2017 American
put 1992;4(4):494–501. https://doi.org/10.1162/neco. Control Conference (ACC), vol. 2017; 2017. p. 5324–9.
1992.4.4.494. [82] Doya K. Bifurcations in the learning of recurrent neu-
[67] Werbos PJ. Backpropagation through time: What it
ral networks. In: Proceedings of 1992 IEEE Interna-
does and how to do it. Proc IEEE 1990;78(10):1550–60.
tional Symposium on Circuits and Systems, vol. 6;
[68] Chauvin Y, Rumelhart DE, editors. Backpropagation:
1992. p. 2777–80.
Theory, architectures, and applications. Hillsdale, NJ,
[83] Pasemann F. Dynamics of a single model neuron.
USA: L. Erlbaum Associates Inc.. ISBN 0-8058-1259-8,
Int J Bifurc Chaos Appl Sci Eng 1993;03(02):271–8.
1995.
[69] Jesus OD, Hagan MT. Backpropagation algorithms for http://www.worldscientific.com/doi/abs/10.1142/
a broad class of dynamic networks. IEEE Trans Neural S0218127493000210.
Netw 2007;18(1):14–27. [84] Haschke R, Steil JJ. Input space bifurcation mani-
[70] Williams RJ, Zipser D. A learning algorithm for contin- folds of recurrent neural networks. Neurocomput-
ually running fully recurrent neural networks. Neural ing 2005;64(Supplement C):25–38. https://doi.org/10.
Comput 1989;1(2):270–80. 1016/j.neucom.2004.11.030.
[71] Bengio Y, Simard P, Frasconi P. Learning long-term [85] Jesus OD, Horn JM, Hagan MT. Analysis of recurrent
dependencies with gradient descent is difficult. Trans network training and suggestions for improvements.
Neural Netw 1994;5(2):157–66. https://doi.org/10. In: Neural Networks, 2001. Proceedings. IJCNN ’01. In-
1109/72.279181. ternational Joint Conference on, vol. 4; 2001. p. 2632–7.
[72] Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J. Gra- [86] Horn J, Jesus OD, Hagan MT. Spurious valleys
dient flow in recurrent nets: The difficulty of learning in the error surface of recurrent networks: Anal-
long-term dependencies. In: Kolen J, Kremer S, editors. ysis and avoidance. IEEE Trans Neural Netw
A field guide to dynamical recurrent networks. IEEE 2009;20(4):686–700.
Press; 2001. p. 15. [87] Phan MC, Hagan MT. Error surface of recur-
[73] Kremer SC. A field guide to dynamical recurrent net- rent neural networks. IEEE Trans Neural Netw
works. 1st ed. Wiley-IEEE Press. ISBN 0780353692, Learn Syst 2013;24(11):1709–21. https://doi.org/10.
2001. 1109/TNNLS.2013.2258470.
[74] Pascanu R, Mikolov T, Bengio Y. On the difficulty of [88] Samarin AI. Neural networks with pre-tuning. In: VII
training recurrent neural networks. In: Proceedings All-Russian Conference on Neuroinformatics. Lectures
of the 30th International Conference on International on neuroinformatics. Moscow: MEPhI; 2005. p. 10–20
Conference on Machine Learning, vol. 28. JMLR.org; (in Russian).
2013. pp. III–1310–III–1318.
[89] Jategaonkar RV. Flight vehicle system identification: A
[75] Hochreiter S, Schmidhuber J. Long short-term mem-
time domain methodology. Reston, VA: AIAA; 2006.
ory. Neural Comput 1997;9:1735–80.
[76] Gers FA, Schmidhuber J, Cummins F. Learning to for- [90] Morozov NI, Tiumentsev YV, Yakovenko AV. An ad-
get: Continual prediction with LSTM. Neural Comput justment of dynamic properties of a controllable ob-
1999;12:2451–71. ject using artificial neural networks. Aerosp MAI J
[77] Gers FA, Schmidhuber J. Recurrent nets that time and 2002;(1):73–94 (in Russian).
count. In: Proceedings of the IEEE-INNS-ENNS Inter- [91] Krasovsky AA. Automatic flight control systems and
national Joint Conference on Neural Networks. IJCNN their analytical design. Moscow: Nauka; 1973 (in Rus-
2000. Neural computing: new challenges and perspec- sian).
tives for the New Millennium, vol. 3; 2000. p. 189–94. [92] Krasovsky AA, editor. Handbook of automatic control
[78] Gers FA, Schraudolph NN, Schmidhuber J. Learn- theory. Moscow: Nauka; 1987 (in Russian).
ing precise timing with LSTM recurrent networks. [93] Graupe D. System identification: A frequency domain
J Mach Learn Res 2003;3:115–43. https://doi.org/10. approach. New York, NY: R.E. Krieger Publishing Co.;
1162/153244303768966139. 1976.
REFERENCES 91

[94] Ljung L. System identification: Theory for the user. 2nd [103] Tischler M, Remple RK. Aircraft and rotorcraft system
ed. Upper Saddle River, NJ: Prentice Hall; 1999. identification: Engineering methods with flight-test ex-
[95] Sage AP, Melsa JL. System identification. New York amples. Reston, VA: AIAA; 2006.
and London: Academic Press; 1971. [104] Morelli EA, In-flight system identification. AIAA–
[96] Tsypkin YZ. Information theory of identification. 98–4261, 10.
Moscow: Nauka; 1995 (in Russian). [105] Morelli EA, Klein V. Real-time parameter estima-
[97] Isermann R, Münchhoh M. Identification of dynamic tion in the frequency domain. J Guid Control Dyn
systems: An introduction with applications. Berlin: 2000;23(5):812–8.
Springer; 2011. [106] Morelli EA, Multiple input design for real-time param-
[98] Juang JN, Phan MQ. Identification and control of me- eter estimation in the frequency domain, in: 13th IFAC
chanical systems. Cambridge, MA: Cambridge Univer- Conf. on System Identification, Aug. 27–29, 2003, Rot-
sity Press; 1994. terdam, The Netherlands. Paper REG-360, 7.
[99] Pintelon R, Schoukens J. System identification: A fre- [107] Smith MS, Moes TR, Morelli EA, Flight investigation
quency domain approach. New York, NY: IEEE Press; of prescribed simultaneous independent surface exci-
2001. tations for real-time parameter identification. AIAA–
[100] Berestov LM, Poplavsky BK, Miroshnichenko LY. 2003–5702, 23.
Frequency domain aircraft identification. Moscow: [108] Schroeder MR. Synthesis of low-peak-factor signals
Mashinostroyeniye; 1985 (in Russian). and binary sequences with low autocorrelation. IEEE
[101] Vasilchenko KK, Kochetkov YA, Leonov VA, Trans Inf Theory 1970;16(1):85–9.
Poplavsky BK. Structural identification of math- [109] Brusov VS, Tiumentsev YuV. Neural network model-
ematical model of aircraft motion. Moscow: ing of aircraft motion. Moscow: MAI; 2016 (in Rus-
Mashinostroyeniye; 1993 (in Russian). sian).
[102] Klein V, Morelli EA. Aircraft system identification:
Theory and practice. Reston, VA: AIAA; 2006.

You might also like