Professional Documents
Culture Documents
2
Dynamic Neural Networks: Structures
and Training Methods
n
y(x) = ϕ0 (x) + λi ϕi (x), λi ∈ R. (2.2)
i=1
FIGURE 2.4 Multilevel adjustable functional expansion. From [109], used with permission from Moscow Aviation Insti-
tute.
erated variants of the required model), by form- 2.1.1.4 Functional and Neural Networks
ing it as a multilevel network structure and by Thus, we can interpret the model as an expan-
appropriate parametrization of the elements of sion on the functional basis (2.6), where each el-
this structure. ement ϕi (x1 , x2 , . . . xn ) transforms n-dimensional
Fig. 2.4 shows how we can construct a mul- input x = (x1 , x2 , . . . xn ) in the scalar output y.
tilevel adjustable functional expansion. We see We can distinguish the following types of el-
that in this case, the adjustment of the expansion ements of the functional basis:
is carried out not only by varying the coefficients
of the linear combination, as in expansions of • the FB element as an integrated (one-stage)
the type (2.6). Now the elements of the func- mapping ϕi : Rn → R that directly transforms
tional basis are also parametrized. Therefore, in some n-dimensional input x = (x1 , x2 , . . . xn )
the process of solving the problem, the basis is to the scalar output y;
adjusted to obtain a dynamical system model • the FB element as a compositional (two-stage)
which is acceptable in the sense of the criterion mapping of the n-dimensional input x =
(1.30). (x1 , x2 , . . . xn ) to the scalar output y.
As we can see from Fig. 2.4, the transition In the two-stage (compositional) version, the
from a single-level decomposition to a multi- mapping Rn → R is performed in the first stage,
level one consists in the fact that each element “compressing” the vector input x = (x1 , x2 , . . . xn )
ϕj (v, wϕ ), j = 1, . . . , M, is decomposed using to the intermediate scalar output v, which in the
some functional basis {ψk (x, wψ )}, j = 1, . . . , K. second stage is additionally processed by the
Similarly, we can construct the expansion of the output mapping R → R to obtain the output y
elements ψk (x, wψ ) for another FB, and so on, (Fig. 2.5).
the required number of times. This approach Depending on which of these FB elements are
gives us the network structure with the re- used in the formation of network models (NMs),
quired number of levels, as well as the required the following basic variants of these models are
parametrization of the FB elements. obtained:
2.1 ARTIFICIAL NEURAL NETWORK STRUCTURES 39
The most general way of introducing feed- with L(1) and up to L(NL ) , or for some of them,
back into a “stack of layers”–type structure is for some range of numbers p1 p p2 . The im-
shown in Fig. 2.7C. Here the feedback comes plementation depends on which layers of the
from some hidden layer L(q) , 1 < q < NL , to ANN we cover by feedback. However, in any
the layer L(p) , 1 p < NL , q > p. Similar to case, some strict sequence of operation of the
the case shown in Fig. 2.7A, this variant can be layers is preserved. If one of the ANN layers
treated as a serial connection of a feedforward started its work, then, until this work is com-
neural network (layers L(1) , . . . , L(p−1) ), the net- pleted, no other layer will be launched for pro-
work with feedback (layers L(p) , . . . , L(q) ), and cessing.
another feedforward network (layers L(q+1) , . . . , The rejection of this kind of strict firing se-
L(NL ) ). The operation of such a network can, for quence for the ANN layers leads to the appear-
example, be interpreted as follows. The recur- ance of parallelism in the network at the level of
rent subnet (the layers L(p) , . . . , L(q) ) is the main its layers. In the most general case, we allow for
part of the ANN as a whole. One feedforward any neuron from the layer L(p) and any neuron
subnet (layers L(1) , . . . , L(p−1) ) preprocesses the from the layer L(q) to establish a connection of
data entering the main subnet, while the second any type. Namely, we allow forward, backward
subnet (layers L(q+1) , . . . , L(NL ) ) performs some (for these cases p = q), or lateral (in this case
postprocessing of the data produced by the main p = q) connections. Here, for the time being, it is
recurrent subnet. still considered that a layered organization like
Fig. 2.7D shows an example of a generaliza- “stack of layers” is used.
tion of the structure shown in Fig. 2.7C, for the Variants of the ANN structural organization
case in which, in addition to strictly consecutive shown in Fig. 2.7 use the same “stack of lay-
connections between the layers of the network, ers” scheme for ordering the layers of the net-
there are also bypass connections. work. Here, at each time interval, the neurons
In all the ANN variants shown in Fig. 2.6, of only one layer work. The remaining layers ei-
the strict sequence of layers is preserved un- ther have already worked or are waiting for their
changed. The layers are activated one after the turn. This approach applies to both feedforward
other in the order specified by forward and networks and recurrent networks.
backward connections in the considered ANN. The following variant allows us to refuse
For a feedforward network, this means that any the “stack of layers” scheme and to replace it
neuron from the layer L(p) receives its inputs with more complex structures. As an example
only from neurons from the layer L(p−1) and illustrating structures of this kind, we show in
passes its outputs to the layer L(p+1) , i.e.,
Fig. 2.8 two variants of the structures of an ANN
L(p−1) → L(p) → L(p+1) , p ∈ {0, 1, . . . , NL }. with parallelism in them at the layer level.4
(2.7) Consider the schemes shown in Fig. 2.7 and
Fig. 2.8. Obviously, to activate a neuron from
At the same time (simultaneously) two or more some pth layer, it must first get the values of
layers cannot be executed (“fired”), even if there all its inputs it “waits for” until that moment.
is such a technical capability (the network is ex- For paralleling the work of neurons, we must
ecuted on some parallel computing system) due meet the same conditions. Namely, all neurons
to the sequential operation logic of the ANN lay- that have a complete set of inputs at a given mo-
ers noted above.
The use of feedback introduces cyclicity into 4 If we refuse the “stack of layers” scheme, some layers in
the order of operation for the layers. We can im- the ANN can work in parallel, i.e., simultaneously with each
plement this cyclicity for all layers, beginning other, if there is such a technical possibility.
42 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS
FIGURE 2.8 An example of a structural organization for a layered neural network with layer-level parallelism. (A) Feed-
forward ANN. (B) ANN with feedbacks.
ment of time can operate independently from without branches and cycles. In structures with
each other in an arbitrary order or parallel if parallelism at the layer level for networks, as
there is such a technical capability. shown in Fig. 2.8, both forward “jumps” and
Suppose we have an ANN organized accord- feedbacks can be present. Such structures bring
ing to the “stack of layers” scheme. The logic of nonlinearity to the cause-and-effect chains; in
neuron activation (i.e., the sequence and condi- particular, they provide tree structures and cy-
tions of neuron operation) in this ANN ensures cles.
the absence of conflicts between them. If we in- The cause-and-effect chain should show
troduce a parallelism at the layer level in the which neurons transmit signals to some ana-
ANN, we need to add some additional synchro- lyzed neuron. In other words, it is required to
nization rules to provide such conflict-free net- show which neural predecessors should work
work operation. so that a given neuron receives a complete set of
Namely, a neuron can work as soon as it is input values. As noted above, this is a necessary
ready to operate, and it will be ready as soon condition for the readiness to operate a given
as it receives the values for all its inputs. Once neuron. This condition is the causal part of the
the neuron is ready for functioning, we should chain. Also, the chain indicates which neurons
start it immediately, as soon as it becomes possi- will get the output of this “current neuron.” This
ble. This is significant because the outputs of this indication will be the “effect” part of the cause-
neuron are required to ensure the operational and-effect chain.
readiness for other neurons that follow. In all the considered variants of the ANN
For the particular ANN, it is possible to spec- structural organization, only forward and back-
ify (to generate) a set of cause-and-effect rela- ward links were contained, i.e., connections be-
tions (chains) that provide the ability to monitor tween pairs of neurons in which the neurons
the operational conditions for different neurons from this pair belong to different layers.
to prevent conflicts between them. The third kind of connections that are possi-
For layered feedforward networks with the ble between neurons in the ANN is lateral con-
structures shown in Fig. 2.7, the cause-and- nections, in which the two neurons, between
effect chains will have a strictly linear structure, which the connection is established, belong to
2.1 ARTIFICIAL NEURAL NETWORK STRUCTURES 43
FIGURE 2.11 Examples of a structural organization for feedforward dynamic neural networks. (A) Jordan network. (B) El-
man network. Din are source (input) data; Dout are output data (results); L(0) is input layer; L(1) is hidden layer; L(2) is output
layer; T DL(1) is tapped delay line (TDL) of order 1.
FIGURE 2.12 Examples of a structural organization for feedforward dynamic neural networks. (A) Hopfield network.
(B) Hamming network. Din are source (input) data; Dout are output data (results); L(0) is input layer; L(1) is hidden layer;
L(2) is output layer; T DL(1) is tapped delay line (TDL) of order 1.
In Fig. 2.13A the ANN model Nonlinear any topology of forward and backward connec-
AutoRegression with eXternal inputs (NARX) tions, that is, in a certain sense, this structural
[33–41] is shown, which is widely used in mod- organization of the neural network is the most
eling and control tasks for dynamical systems. common.
The same structural organization has a variant The set of Figs. 2.14–2.17 allows us to spec-
of this network, expanded by the composition ify the structural organization of the layers of
of the parameters considered. This is the ANN the ANN model: the input layer (Fig. 2.14) and
model Nonlinear AutoRegression with Moving working (hidden and output) layers (Fig. 2.15).
Average and eXternal inputs (NARMAX) [42, In Fig. 2.16 the structure of the TDL element is
43]. presented, and in Fig. 2.17 the structure of the
In Fig. 2.13B we can see an example of an neuron as the main element that is part of the
ANN model with the Layered Digital Dynamic working layers of the ANN model is shown.
Network (LDDN) structure [11,28]. Networks One of the most popular static neural net-
with a structure of this type can have practically work architectures is a Layered Feedforward
2.1 ARTIFICIAL NEURAL NETWORK STRUCTURES 45
FIGURE 2.13 Examples of a structural organization for feedforward dynamic neural networks. (A) NARX (Nonlinear
AutoRegression with eXternal inputs). (B) LDDN (Layered Digital Dynamic Network). Din are source (input) data; Dout are
output data (results); L(0) is input layer; L(1) is hidden layer; L(2) is output layer for NARX network and hidden layer for
(m) (m) (n1) (n2)
LDDN; L(3) is output layer for LDDN; T DL1 , T DL2 , T DL1 , T DL1 are tapped delay lines of order m, m, n1, and n2
respectively.
FIGURE 2.16 Tapped delay lines (TDLs) as ANN structural elements. (A) TDL of order n. (B) TDL of order 1. D is delay
(memory) element.
1
ϕ li (nli ) = logsig(nli ) = ,
1 + e−ni
l
(2.10)
l = 1, . . . , L − 1, i = 1, . . . , S l .
FIGURE 2.22 Structure of the neuron. I – input vector; II – input mappings; III – aggregating mapping; IV – converter;
V – the output mapping; VI – output vector. From [109], used with permission from Moscow Aviation Institute.
(p)
where j is the transformation of the input
(p)
vector of dimension Nj into the output vector
FIGURE 2.23 The sequence of transformations (primitive
(p)
(p)
mappings) realized by the neuron. I – input vector; II – input of dimension Mj ;
Rj is the connection of the
mappings; III – aggregating mapping; IV – converter (activa- (p)
tion function); V – the output mapping; VI – output vector. output of the element Sj with other neurons of
From [109], used with permission from Moscow Aviation In- the considered ANN (with neurons from other
stitute. layers, they are direct and inverse relations; with
neurons from the same layer, they are lateral
element ordered set {xj
(out)
}, each element of connections).
(p) (r,p)
The transformation j (xi,j ) is the compo-
which takes the value xj(out)
= y.
sition of the primitives from which the neuron
The map is formed as a composition of the
consists, i.e.,
mappings {fi }, ψ, ϕ, and E (m) (Fig. 2.22), i.e.,
(p) (r,p) (r,p) (r,p)
x (out) = (x (in) ) j (xi,j ) = (ψ(ϕ(fi,j (xi,j )))). (2.23)
(in) (in)
= E (m) (ψ(ϕ(f1 (x1 ), . . . , fn(in) (xn(in) )))). (p) (p)
The connections Rj of the neuron Sj are the
(2.21) set of ordered pairs showing where the outputs
50 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS
(r,p) (p,q)
FIGURE 2.24 The numeration of the inputs/outputs of neurons and the notation of signals (xi,j and xj,k ), transmitted
(r) (p) (q)
via interneuron links; it is the basic level of the description of the ANN. Si , Sj , and Sk are neurons of the ANN (ith in
(r) (p) (q) (r)
the rth layer, j th in the pth layer, and kth in the qth layer, respectively); Ni , Nj , Nk are the number of inputs and Mi ,
(p) (q) (r) (p) (q) (r,p)
M j , Mk are the number of outputs in the neurons Si , Sj , and Sk , respectively; xi,j is the signal transferred from the
(p,q)
output of the ith neuron from the rth layer to the input of the j th neuron from the pth layer; xj,k is the signal transferred
from the output of the j th neuron from the pth layer to the input of the kth neuron from the qth layer; g, h, l, m, n, s are the
numbers of the neuron inputs/outputs; NL is the number of layers in the ANN; N (r) , N (p) , N (q) is the number of neurons in
the layers with numbers r, p, q, respectively. From [109], used with permission from Moscow Aviation Institute.
(r,p) (p,q)
FIGURE 2.25 The numbering of the inputs/outputs of neurons and the designations of signals (x(i,h),(j,l) and x(j,m),(k,n) ),
(r) (p) (q)
transmitted through interneuronal connections; it is the extended level of the description of the ANN. Si , Sj , and Sk
(r) (p) (q)
are the neurons of the ANN (ith in the rth layer, j th in the pth layer, and kth in the qth layer, respectively); Ni , Nj , Nk
(r) (p) (q) (r) (p) (q)
are the number of inputs and Mi , Mj , Mk are the number of outputs in the neurons Si , Sj , and Sk , respectively;
(r,p)
x(i,h),(j,l) is the signal transferred from the hth output of the ith neuron from the rth layer on the lth input of the j th neuron
(r,p)
from the pth layer; x(i,h),(j,l) is the signal transferred from the mth exit of the j th neuron from the pth layer to the nth input
of the kth neuron from the qth layer; g, h, l, m, n, s are the numbers of the neuron inputs/outputs; NL is the number of layers
in the ANN; N (r) , N (p) , N (q) are the number of neurons in the layers with numbers r, p, q, respectively.
applied to classification, regression, and system methods, see [10]. Reinforcement learning meth-
identification problems. ods are presented in the books [45–48].
If a training data set is not known beforehand, We need to mention that the actual goal of
but rather presented sequentially one example the neural network supervised learning is not
at a time, and a neural network is expected to to achieve a perfect match of predictions with
operate and learn simultaneously, then it is said the training data, but to perform highly accurate
to perform incremental learning. Additionally, if predictions on the independent data during the
the environment is assumed to be nonstationary, network operation, i.e., the network should be
i.e., the desired response to some input may vary able to generalize. In order to evaluate the gen-
over time, then the training data set becomes eralization ability of a network, we split all the
inconsistent and a neural network needs to per- available experimental data into training set and
form adaptation. In this case, we face a stability- test set. The model learns only on the training set,
plasticity dilemma: if the network lacks plastic- and then it is evaluated on an independent test
set. Sometimes, yet another subset is reserved –
ity, then it cannot rapidly adapt to changes; on
the so-called validation set, which is used to select
the other hand, if it lacks stability, then it forgets
the model hyperparameters (such as the number
the previously learned data.
of layers or neurons).
Another variation of supervised learning is
active learning, which assumes that the neural
network itself is responsible for the data set ac- 2.2.1 Overview of the Neural Network
quisition. That is, the network selects a new in- Training Framework
put and queries an external system (for example,
Suppose that the network parameters are rep-
some sensor) for the desired outputs that corre-
resented by a finite-dimensional vector W ∈ Rnw .
spond to this input. Hence, a neural network is
The supervised learning approach implies a
expected to “explore” the environment by inter-
minimization of an error function (also called
acting with it and to “exploit” the obtained data objective function, loss function, or cost func-
by minimizing some objective. In this paradigm, tion), which represents the deviation of actual
finding a balance between exploration and ex- network outputs from their desired values. We
ploitation becomes an important issue. Reinforce- define a total error function Ē : Rnw → R to be a
ment learning takes the idea of active learning sum of individual errors for each of the training
one step further by assuming that the external examples, i.e.,
system cannot provide the network with exam-
ples of desired behavior – instead, it can only
P
score the previous behavior of the network. This Ē(W) = E (p) (W). (2.25)
approach is usually applied to intelligent control i=1
and decision making problems. The error function (2.25) is to be minimized with
In this book, we cover only the supervised respect to neural network parameters W. Thus,
learning approach and focus on the modeling we have an unconstrained nonlinear optimiza-
and identification problem for dynamical sys- tion problem:
tems. Section 2.3.1 treats the training methods
for static neural networks with applications to minimize Ē(W). (2.26)
W
function approximation problems. These meth-
ods constitute the basis for dynamic neural In order for the minimization problem to
network training algorithms, discussed in Sec- make sense, we require the error function to be
tion 2.3.3. For a discussion of unsupervised bounded from below.
2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS 53
Minimization is carried out by means of vari- methods use only the error function values; first-
ous iterative numerical methods. The optimiza- order methods rely on the first derivatives (gra-
tion methods can be divided into global and lo- dient ∇ Ē); second-order methods also utilize the
cal ones, according to the type of minimum they second derivatives (Hessian ∇ 2 Ē).
seek for. Global optimization methods seek for The basic descent method has the form
an approximate global minimum, whereas lo-
cal methods seek for a precise local minimum. W(k+1) = W(k) + α (k) p(k) , Ē(W(k+1) )< Ē(W(k) ),
Most of the global optimization methods have a (2.29)
stochastic nature (e.g., simulated annealing, evo-
lutionary algorithms, particle swarm optimiza- where p(k) is a search direction and α (k) rep-
tion) and the convergence is achieved almost resents a step length, also called the learning
surely and only in the limit. In this book we rate. Note that we require each step to de-
focus on the local deterministic gradient-based crease the error function. In order to guaran-
optimization methods, which guarantee a rapid tee the error function decrease for arbitrarily
convergence to a local solution under some rea- small step lengths, we need the search direc-
sonable assumptions. In order to apply these tion to be a descent direction, that is, to satisfy
T
methods, we also require the error function to p(k) ∇ Ē(W(k) ) < 0.
be sufficiently smooth (which is usually the case The simplest example of a first-order de-
with neural networks provided all the activa- scent method is the gradient descent (GD) method,
tion functions are smooth). For more detailed which utilizes the negative gradient search di-
information on local optimization methods, we rection, i.e.,
refer to [49–52]. Metaheuristic global optimiza-
tion methods are covered in [53,54]. p(k) = −∇ Ē(W(k) ). (2.30)
Note that the local optimization methods re-
quire an initial guess W(0) for parameter values. The step lengths may be assigned beforehand ∀s
There are various approaches to the initializa- α (k) ≡ α, but if the step α is too large, the error
tion of network parameters. For example, the function might actually increase, and then the it-
parameters may be sampled from a Gaussian erations would diverge. For example, in the case
distribution, i.e., of a convex quadratic error function of the form
1
Wi ∼ N (0, 1), i = 1, . . . , nw . (2.27) Ē(W) = WT AW + bT W + c, (2.31)
2
The following alternative initialization method where A is a symmetric positive definite matrix
for layered feedforward neural networks (2.8), with a maximum eigenvalue of λmax , the step
called Xavier initialization, was suggested in length must satisfy
[55]:
2
α< ,
bli = 0, λmax
6 6 (2.28) in order to guarantee the convergence of gradi-
wli,j ∼U − , .
S l−1 + S l S l−1 + S l ent descent iterations. On the other hand, a small
step α would result in a slow convergence. In or-
Optimization methods may also be classified der to circumvent this problem, we can perform
by the order of error function derivatives used a step length adaptation: we take a “trial” step,
to guide the search process. Thus, zero-order evaluate the error function and check whether
54 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS
it has decreased or not. If it has decreased, then Another important first-order method is the
we accept this trial step, and we increase the nonlinear conjugate gradient (CG) method. In fact,
step length. Otherwise, we reject the trial step, it is a family of methods which utilize search di-
and decrease the step length. An alternative ap- rections of the following general form:
proach is to perform a line search for an optimal
step length which provides the maximum pos- p(0) = −∇ Ē(W(0) ),
(2.34)
sible reduction of the error function along the p(k) = −∇ Ē(W(k) ) + β (k) p(k−1) .
search direction, i.e.,
Depending on the choice of the scalar β (k) , we
α (k) = argmin Ē W(k) + αp(k) . (2.32) obtain several variations of the method. The
α>0 most popular expressions for β (k) are the follow-
ing:
The GD method combined with this exact line
search is called the steepest gradient descent. Note • the Fletcher–Reeves method:
that the global minimum of this univariate func- T
tion is hard to find; in fact, even a search for ∇ Ē(W(k) ) ∇ Ē(W(k) )
β (k) = T
; (2.35)
an accurate estimate of a local minimum would ∇ Ē(W(k−1) ) ∇ Ē(W(k−1) )
require many iterations. Fortunately, we do not
need to find an exact minimum along the spec- • the Polak–Ribière method:
ified direction – the convergence of an overall T
minimization procedure may be obtained if we ∇ Ē(W(k) ) ∇ Ē(W(k) ) − ∇ Ē(W(k−1) )
β =
(k)
T
;
guarantee a sufficient decrease of an error func- ∇ Ē(W(k−1) ) ∇ Ē(W(k−1) )
tion at each iteration. If the search direction is a (2.36)
descent direction and if the step lengths satisfy
the Wolfe conditions • the Hestenes–Stiefel method:
T
Ē W(k) + α (k) p(k) Ē W(k) ∇ Ē(W(k) ) ∇ Ē(W(k) ) − ∇ Ē(W(k−1) )
β =
(k)
T .
T ∇ Ē(W(k) ) − ∇ Ē(W(k−1) ) p(k−1)
+ c1 α (k) ∇ Ē(W(k) ) p(k) ,
(2.37)
T T
∇ Ē W(k) + α (k) p(k) p(k) c2 ∇ Ē(W(k) ) p(k) , Irrespective of the particular β (k) selected, the
(2.33) first search direction p(0) is simply the nega-
tive gradient direction. If we assume that the er-
for 0 < c1 < c2 < 1, then the iterations con- ror function is convex and quadratic (2.31), then
verge to a stationary point lim ∇ Ē(W(k) ) = 0 the method generates a sequence of conjugate
s→∞
T
from an arbitrary initial guess (i.e., we have a search directions (i.e., p(i) Ap(j ) = 0 for i = j ). If
global convergence to a stationary point). Note we also assume that the line searches are exact,
that there always exist intervals of step lengths then the method converges within nw iterations.
which satisfy the Wolfe conditions. This justifies In the general case of a nonlinear error func-
the use of inexact line search methods, which tion, the convergence rate is linear; however, a
require fewer iterations to find an appropriate twice differentiable error function with nonsin-
step length which provides a sufficient reduc- gular Hessian is approximately quadratic in the
tion of an error function. Unfortunately, the GD neighborhood of the solution, which results in
method has a linear convergence rate, which is fast convergence. Note also that the search direc-
very slow. tions lose conjugacy, hence we need to perform
2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS 55
so-called “restarts,” i.e., to assign β (k) ← 0. For then the gradient and Hessian may be expressed
example, we might reset β (k) if the consecutive
in terms of the error Jacobian as follows:
(k) T (k−1)
p p T
directions are nonorthogonal
p(k) 2
> ε. In ∂e(p) (W) (p)
∇E (p) (W) = e (W),
the case of the Polak–Ribière method, we should ∂W
T
also reset β (k) if it becomes negative. ∂e(p) (W) ∂e(p) (W)
∇ 2 E (p) (W) = (2.41)
The basic second-order method is Newton’s ∂W ∂W
method: ne (p)
∂ 2 ei (W) (p)
+ ei (W).
−1 ∂W
i=1
p(k) = − ∇ 2 Ē(W(k) ) ∇ Ē(W(k) ). (2.38)
Then, the Gauss–Newton approximation to the
Hessian is obtained by discarding the second-
If the Hessian ∇ 2 Ē(W(k) ) is positive definite, order terms, i.e.,
the resulting search direction p(k) is a descent T
∂e(p) (W) ∂e(p) (W)
direction. If the error function is convex and ∇ 2 E (p) (W) ≈ B(p) = .
quadratic, Newton’s method with a unit step ∂W ∂W
length α (k) = 1 finds the solution in a single step. (2.42)
For a smooth nonlinear error function with pos- The resulting matrix B can turn out to be degen-
itive definite Hessian at the solution, the con- erate, so we might modify it by adding a scaled
vergence is quadratic, provided the initial guess identity matrix as mentioned above in (2.39).
lies sufficiently close to the solution. If a Hes- Then we have
sian turns out to have negative or zero eigen- T
values, we need to modify it in order to obtain ∂e(p) (W) ∂e(p) (W)
B(p) = + μ(k) I. (2.43)
a positive definite approximation B – for exam- ∂W ∂W
ple, we might add a scaled identity matrix, so This technique leads us to the Levenberg–
we have Marquardt method.
A family of quasi-Newton methods estimate
B(k) = ∇ 2 Ē(W(k) ) + μ(k) I. (2.39) the inverse Hessian by accumulating the changes
of gradients. These methods construct an in-
−1
verse Hessian approximation H ≈ ∇ 2 Ē(W) so
The resulting damped method may be viewed as to satisfy the secant equation:
as hybrid of the ordinary Newton method (for
μ(k) = 0) and a gradient descent (for μ(k) → H(k+1) y(k) = s(k) ,
∞). s(k) = W(k+1) − W(k) , (2.44)
Note that the Hessian computation is very
y (k)
= ∇ Ē(W (k+1)
) − ∇ Ē(W ).
(k)
computationally expensive; hence there have
been proposed various approximations. If we However, for nw > 1 this system of equations
assume that each individual error is a quadratic is underdetermined and there exists an infi-
form, nite number of solutions. Thus, additional con-
straints are imposed, giving rise to various
1 T quasi-Newton methods. Most of them require
E (p) (W) = e(p) (W) e(p) (W), (2.40) that the inverse Hessian approximation H(k+1)
2
56 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS
be symmetric and positive definite, and also lor series approximation of the error function as
minimize the distance to the previous estimate the model
H(k) with (k+1) =
respect to some norm: H
argmin H − H (k) . One of the most popular Ē(W(k) + p) ≈ M̄ (k) (p) = Ē(W(k) ) + pT ∇ Ē(W(k) )
H
variations of quasi-Newton methods is the 1
+ pT ∇ 2 Ē(W(k) )p
Broyden–Fletcher–Goldfarb–Shanno (BFGS) algo- 2
(2.46)
rithm:
p(k,s+1)
=p (k,s)
+α (k,s) (k,s)
d , ∇ Ē(W) = ∇E (p) (W), (2.50)
p=1
r(k,s+1) = r(k,s) + α (k,s) ∇ 2 Ē(W(k) )d(k,s) ,
P
T
r(k,s+1) r(k,s+1) ∇ 2 Ē(W) = ∇ 2 E (p) (W). (2.51)
β (k,s+1) = T
, p=1
r(k,s) r(k,s)
d(k,s+1)
=r + β (k,s+1) d(k,s) .
(k,s+1)
In the case the neural network has a large
number of parameters nw and the data set con-
The iterations are terminated prematurely ei-
tains a large number of training examples P ,
ther if they
(k,s+1) cross the trust region boundary,
p (k) , or if a nonpositive curvature computation of the total error function value Ē
T as well as its derivatives can be time consuming.
direction is discovered, d(k,s) ∇ 2 Ē(W(k) )d(k,s) Thus, even for a simple GD method, each update
0. In these cases, a solution corresponds to the of the weights takes a lot of time. Then, we might
intersection of the current search direction with apply a stochastic gradient descent (SGD) method,
the trust region boundary. It is important to note which randomly shuffles training examples, it-
that this method does not require one to com- erates over them, and updates the parameters
pute the entire Hessian matrix; instead, we need using the gradients of individual errors E (p) :
only the Hessian vector products of the form
∇ 2 Ē(W(k) )d(k,s) , which may be computed more W(k,p) = W(k,p−1) − α (k) ∇E (p) (W(k,p−1) ),
efficiently by reverse-mode automatic differen-
tiation methods described below. Such Hessian- W(k+1,0) = W(k,P ) .
free methods have been successfully applied to (2.52)
neural network training [59,60].
Another approach to solving (2.47) [61,62] re- In contrast, the usual gradient descent is called
places the subproblem with an equivalent prob- the batch method. We need to mention that al-
lem of finding both the vector p ∈ Rnw and the though the (k, p)th step decreases the error for
scalar μ 0 such that the pth training example, it may increase the er-
ror for the other examples. On the one hand it
∇ 2 Ē(W(k) ) + μI p = −∇ Ē(W(k) ), allows the method to escape some local min-
(2.49) ima, but on the other hand it becomes difficult
μ( − p) = 0, to converge to a final solution. In order to cir-
cumvent this issue, we might gradually decrease
where ∇ 2 Ē(W(k) ) + μI is positive semidefinite. the step lengths α (k) . Also, in order to achieve
There are two possibilities. If μ = 0, then we
−1 a “smoother” convergence we could perform
have p = − ∇ 2 Ē(W(k) ) ∇ Ē(W(k) ) and p the weight updates based on random subsets of
. If μ > 0, then we define p(μ) = training examples, which is called a “minibatch”
−1
− ∇ 2 Ē(W(k) ) + μI ∇Ē(W(k) ) and solve a one- strategy. The stochastic or minibatch approach
dimensional equation p(μ) = with respect may also be applied to other optimization meth-
to μ. ods; see [63].
58 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS
We should also mention that in the case of the where x(p) ∈ X represent the input vectors and
batch or minibatch update strategy, the compu- ỹ(p) ∈ Y represent the observed output vectors.
tation of total error function values, as well as Note that in general the observed outputs ỹ(p)
its derivatives, can be efficiently parallelized. In do not match the true outputs y(p) = f(x(p) ). We
order to do that, we need to divide the data set assume that the observations are corrupted by
into multiple subsets, compute partial sums of an additive Gaussian noise, i.e.,
the error function and its derivatives over the
training examples of each subset in parallel, and ỹ(p) = y(p) + η(p) , (2.57)
then sum the results. This is not possible in the
where η(p) represent the sample points of a zero-
case of stochastic updates. In the case of an SGD
mean random vector η ∼ N (0, ) with diagonal
method, we can parallelize the gradient compu-
covariance matrix
tations by neurons of each layer.
⎛ 2 ⎞
Finally, we note that any iterative method re- σ1
quires a stopping criterion used to terminate the ⎜
0 ⎟
..
procedure. One simple option is a test based on =⎜ ⎝ . ⎟.
⎠
first-order necessary conditions for a local mini- 0 σ 2ny
mum, i.e.,
The approximation is to be performed using a
∇E(W(k) ) < εg . (2.53) layered feedforward neural network of the form
(2.8). Under the abovementioned assumptions
We can also terminate iterations if it seems that on the observation noise, it is reasonable to uti-
no progress is made, i.e., lize a least-squares error function. Thus, we have
a total error function Ē of the form (2.25) with
E(W(k) ) − E(W(k+1) ) < εE , the individual errors
(2.54)
(k) 1 (p) T
W − W(k+1) < εw .
E (p) (W) = ỹ − ŷ(p) ỹ(p) − ŷ(p) ,
2
In order to prevent an infinite loop in the case (2.58)
of algorithm divergence, we might stop when a
certain maximum number of iterations has been where ŷ(p) represent the neural network out-
performed, i.e., puts given the corresponding inputs x(p) and
weights W. The diagonal matrix of fixed “er-
k k̄. (2.55) ror weights” has the form
⎛ ⎞
ω1 0
2.2.2 Static Neural Network Training ⎜ ⎟
=⎜ ⎝
..
. ⎟,
⎠
In this subsection, we consider the function
approximation problem. The problem is stated 0 ωny
as follows. Suppose that we wish to approxi-
mate an unknown mapping f : X → Y, where where ωi are usually taken to be inversely pro-
X ⊂ Rnx and Y ⊂ Rny . Assume we are given an portional to noise variances.
experimental data set of the form We need to minimize the total approximation
error Ē with respect to the neural network pa-
P rameters W. If activation functions of all the neu-
x(p) , ỹ(p) , (2.56) rons are smooth, then the error function is also
p=1
2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS 59
smooth. Hence, the minimization can be carried The automatic differentiation technique [64]
out using any of the optimization methods de- computes function derivatives at a point by ap-
scribed in Section 2.2.1. However, in order to plying the chain rule to the corresponding nu-
apply those methods, we need an efficient al- merical values instead of symbolic expressions.
gorithm to compute the gradient and Hessian This method produces accurate derivative val-
of the error function with respect to the param- ues, just like the symbolic differentiation, and
eters. As mentioned above, the total error gra- also allows for a certain performance optimiza-
dient ∇ Ē and Hessian ∇ 2 Ē may be expressed tion. Note that automatic differentiation relies
in terms of the individual error gradients ∇E (p) on the original computational graph for the
and Hessians ∇ 2 E (p) . Thus, all that remains is to function to be differentiated. Thus, if the original
compute the derivatives of E (p) . For notational graph makes use of some common intermedi-
convenience, in the remainder of this section we ate values, they will be efficiently reused by the
omit the training example index p. differentiation procedure. Automatic differen-
There exist several approaches to computa- tiation is especially useful for neural network
tion of error function derivatives: training, since it scales well to multiple param-
eters as well as higher-order derivatives. In this
• numeric differentiation; book, we adopt the automatic differentiation ap-
• symbolic differentiation; proach.
• automatic (or algorithmic) differentiation. Automatic differentiation encompasses two
different modes of computation: forward and re-
The numeric differentiation approach relies on
verse. Forward mode computes sensitivities of all
the derivative definition and approximates it via
variables with respect to input variables: it starts
finite differences. This method is very simple to
with the intermediate variables that explicitly
implement, but it suffers from truncation and
depend on the input variables (most deeply
roundoff errors. It is especially inaccurate for
nested subexpressions) and proceeds “forward”
higher-order derivatives. Also, it requires many
by applying the chain rule, until the output vari-
function evaluations: for example, in order to es-
ables are processed. Reverse mode computes sen-
timate the error function gradient with respect to sitivities of output variables with respect to all
nw parameters using the simplest forward differ- variables: it starts with the intermediate vari-
ence scheme we require error function values at ables on which the output variables explicitly
nw + 1 points. depend (outermost subexpressions) and pro-
Symbolic differentiation transforms a symbolic ceeds “in reverse” by applying the chain rule,
expression for the original function (usually rep- until the input variables are processed. Each
resented in the form of a computational graph) mode has its own advantages and disadvan-
into symbolic expressions for its derivatives by tages. The forward mode allows to compute
applying a chain rule. The resulting expressions function values as well as its derivatives of mul-
may be evaluated at any point accurately to tiple orders in a single pass. On the other hand,
working precision. However, these expressions in order to compute the rth-order derivative us-
usually end up to have many identical subex- ing the reverse mode, one needs the derivatives
pressions, which leads to duplicate computa- of all the lower orders s = 0, . . . , r − 1 before-
tions (especially in the case we need the deriva- hand. Computational complexity of first-order
tives with respect to multiple parameters). In or- derivatives computation in the forward mode is
der to avoid this, we need to simplify the expres- proportional to the number of inputs, while in
sions for derivatives, which presents a nontrivial the reverse mode it is proportional to the num-
problem. ber of outputs. In our case, there is only one out-
60 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS
put (the scalar error) and multiple inputs; there- We define the error function sensitivities with
fore reverse mode is significantly faster than the respect to weighted sums nli to be as follows:
forward mode. As shown in [65], under realis-
∂E
tic assumptions the error function gradient can δ li . (2.62)
be computed in reverse mode at a cost of five ∂nli
function evaluations or less. Also note that in the
Sensitivities for the output layer neurons are
ANN field the forward and reverse computation
obtained directly, i.e.,
modes are usually referred to as forward propa-
gation and backward propagation (or backprop- δL = −ω ỹ − a L L L
(2.63)
i i i i ϕ i (ni ),
agation).
In the rest of this subsection we present auto- while sensitivities for the hidden layer neurons
matic differentiation algorithms for the compu- are computed during a backward pass:
tation of gradient, Jacobian, and Hessian of the
l+1
squared error function (2.58) in the case of a lay-
S
ered feedforward neural network (2.8). All these δ li = ϕ li (nli ) δ l+1 l+1
j wj,i , l = L − 1, . . . , 1.
algorithms rely on the fact that the derivatives j =1
of activation functions are known. For example, (2.64)
the derivatives of hyperbolic tangent activation
functions (2.9) are Finally, the error function derivatives with re-
spect to parameters are expressed in terms of
2 ⎫ sensitivities, i.e.,
ϕ li (nli ) = 1 − ϕ li (nli ) ⎬ l = 1, . . . , L − 1,
⎭ ∂E
ϕ li (nli ) = −2ϕ li (nli )ϕ li (nli ) i = 1, . . . , S ,
l
= δ li ,
∂bli
(2.59) (2.65)
∂E
= δ li al−1
j .
while the derivatives of a logistic function (2.10) ∂wli,j
equal
In a similar manner, we can compute the
⎫ derivatives with respect to network inputs, i.e.,
l l
ϕ i (ni ) = ϕ i (ni ) 1 − ϕ i (ni ) ⎪
l l l l ⎬ l =1, . . . , L − 1,
S 1
Backpropagation algorithm for error func- Pairwise sensitivities for neurons of the same
tion gradient. First, we perform a forward pass layer are obtained directly, i.e.,
to compute the weighted sums nli and activa-
ν l,l
i,i = 1,
tions ali for all neurons i = 1, . . . , S l of each layer (2.68)
l = 1, . . . , L, according to equations (2.8). ν l,l
i,j = 0, i = j.
2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS 61
ν l,m
i,j = 0, m > l. (2.69) Backpropagation algorithm for error gradi-
ent and Hessian [66]. First, we perform a for-
The remaining pairwise sensitivities are com- ward pass to compute the weighted sums nli and
puted during the forward pass, along with the activations ali according to Eqs. (2.8), and also to
weighted sums nli and activations ali , i.e., compute the pairwise sensitivities ν l,m i,j accord-
l−1 ing to (2.68)–(2.70).
S
We define the error function second-order
ν l,m
i,j = wli,k ϕ l−1
k (nl−1 l−1,m
k )ν k,j , l = 2, . . . , L.
sensitivities with respect to weighted sums to be
k=1
(2.70) as follows:
∂ 2E
Finally, the derivatives of neural network out- δ l,m
i,j . (2.75)
∂nli ∂nm
j
puts with respect to parameters are expressed in
terms of pairwise sensitivities, i.e., Next, during a backward pass we compute
the error function sensitivities δ li as well as
∂aL L L,m the second-order sensitivities δ l,m
i
= ϕL i,j . According
i (ni )ν i,j ,
∂bm
j to Schwarz’s theorem on equality of mixed
(2.71) partials, due to continuity of second partial
∂aL
i L L,m m−1
= ϕL
i (ni )ν i,j ak . derivatives of an error function with respect to
∂wmj,k
weighted sums, we have δ l,m m,l
i,j = δ j,i . Hence, we
If we additionally define the sensitivities of need to compute the second-order sensitivities
weighted sums with respect to network inputs, only for the case m l.
Second-order sensitivities for the output layer
∂nli neurons are obtained directly, i.e.,
ν l,0
i,j , (2.72)
∂a0j # 2 $
L L
δ L,m
i,j = ω i ϕ i (ni ) − ỹ i − a L
i ϕ L
i (nL
i ) ν L,m
i,j ,
then we obtain the derivatives of network out-
(2.76)
puts with respect to network inputs. First, we
compute the additional sensitivities during the while second-order sensitivities for the hidden
forward pass, i.e., layer neurons are computed during a backward
1,0 pass, i.e.,
ν i,j = w1i,j ,
l+1
S l−1
S
l−1 l−1,0
ν l,0 = wli,k ϕ l−1 (nk )ν k,j , l = 2, . . . , L. δ l,m
i,j = ϕ li (nli ) wl+1 l+1,m
k,i δ k,j
i,j k
k=1 k=1
l+1
(2.73)
S (2.77)
+ ϕ li (nli )ν l,m
i,j wl+1 l+1
k,i δ k ,
Then, the derivatives of network outputs with k=1
respect to network inputs are expressed in terms l = L − 1, . . . , 1.
62 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS
Due to continuity of second partial deriva- ties during the backward pass, i.e.,
tives of an error function with respect to net- # $
L 2
work parameters, the Hessian matrix is sym- δ L,0
i,j = ωi ϕL
i (ni ) − ỹ i − a L
i ϕ L
i (nL
i ) ν L,0
i,j ,
metric. Therefore, we need to compute only the
l+1
lower-triangular part of the Hessian matrix. The
S
If the largest (in absolute value) eigenvalues neurons. LSTM networks have been success-
of ∂F(z(tr ),u(t
∂z
r ),W)
are less than 1 for all time fully applied in speech recognition, machine
steps tr , r = l, . . . , k − 1, then the norm of sen- translation, and anomaly detection. How-
sitivity ∂z(t k) ever, little attention has been attracted to
∂z(tl ) will decay exponentially with
k − l. Hence, the terms of the error gradient applications of LSTM for dynamical system
which correspond to recent time steps will modeling problems [81].
dominate the sum. This is the reason why 2. Bifurcations of a recurrent neural network
gradient-based optimization methods learn dynamics [82–84]. Since the recurrent neu-
short-term dependencies much faster than ral network is a dynamical system itself,
the long-term ones. On the other hand, a gra- its phase portrait might undergo qualitative
changes during the training. If these changes
dient explosion (exponential growth of its
affect the actual predicted trajectories, this
norm) corresponds to a situation when the
might lead to significant changes of the error
eigenvalues exceed 1 at all time steps. The
in response to small changes of parameters
gradient explosion effect might lead to diver-
(i.e., the gradient norm becomes very large),
gence of the optimization method, unless care
provided the duration of these trajectories is
is taken.
large enough.
In particular, if the mapping F is represented
In order to guarantee a complete absence of
by a layered feedforward neural network
bifurcations during the network training, we
(2.8), then the Jacobian ∂F(z(tr ),u(t ∂z
r ),W)
corre- would need a very good initial guess for
sponds to derivatives of network outputs its parameters, so that the model would al-
with respect to its inputs, i.e., ready possess the desired asymptotic behav-
ior. Since this assumption is very unrealistic,
∂aL it seems more reasonable to modify the op-
0
= diag ϕ L (nL ) ωL · · · diag ϕ 1 (n1 ) ω1 .
∂a timization methods in order to enforce their
(2.98) stability.
3. Spurious valleys in the error surface [85–87].
Assume that the derivatives of all the ac-
These valleys are called spurious due to the
tivation functions ϕ l are bounded by some
fact that they do not depend on the desired
constant ηl . Denote by λlmax the eigenvalue
values of outputs ỹ(tk ). The location of these
with the largest magnitude for the weight
valleys is determined only by initial condi-
matrix ωl of the lth layer. If the inequality
(L l l
tions z(t0 ) and the controls u(tk ). Reasons for
l=1 λmax η < 1 holds, then the largest (in occurrence of such valleys have been inves-
magnitude) eigenvalue of a Jacobian matrix tigated in some special cases. For example, if
∂aL
∂a0
is less than one. Derivatives of the hyper- the initial state z(t0 ) of (2.13) is a global re-
bolic tangent activation function, as well as peller within some area of a parameter space,
the identity activation function, are bounded then an infinitesimal control u(tk ) causes the
by 1. model states z(tk ) to tend to infinity, which
One of the possibilities to speed up the train- in turn leads to an unbounded error growth.
ing is to use the second-order optimization Now assume that this area of parameter
methods [59,74]. Another option would be to space contains a line along which the connec-
utilize the Long-Short Term Memory (LSTM) tion weights between the controls u(tk ) and
models [72,75–80] specially designed to over- the neurons of F are identically zero, that is,
come the vanishing gradient effect by using the recurrent neural network (2.13) does not
the special memory cells instead of context depend on controls. Parameters along this
2.3 DYNAMIC NEURAL NETWORK ADAPTATION METHODS 67
The EKF algorithm is initialized as follows. In order to use the Kalman filter, it is required
For k = 0, set to linearize the observation equation. It is possi-
ble to use statistical linearization, i.e., lineariza-
ẑ(t0 ) = E[z(t0 )], tion with respect to the mathematical expecta-
P(t0 ) = E[(z(t0 ) − E[z(t0 )])(z(t0 ) − E[z(t0 )])T ]. tion. This gives
The variant of the EKF of this type is more sta- ing various applied problems, is that the net-
ble in computational terms and has robustness work can change, adapting to the problem being
to rounding errors, which positively affects the solved. This kind of adjustment can be carried
computational stability of the learning process out in the following directions:
of the ANN model as a whole.
As can be seen from the relationships deter- • the neural network can be trained, i.e., it can
mining the EKF, the key point is again the calcu- change the values of their tuning parameters
lation of the Jacobian J(tk ) of network errors by (this is, as a rule, the synaptic weights of the
adjusted parameters. neural network connections);
When learning a neural network, it is impos- • the neural network can change its structural
sible to use only the current measurement in the organization by adding or removing neurons
EKF due to the unacceptably low accuracy of the and rebuilding the interneural connections;
search (the effect of the noise ζ and η); it is neces- • the neural network can be dynamically tuned
sary to form a vector estimate on the observation to the solution of the current task by replac-
interval, and then the update of the matrix P(tk ) ing some of its constituent parts (subnets)
is more correct. with previously prepared fragments, or by
As a vector of observations, we can take a se- changing the values of the network settings
quence of values on a certain sliding interval, and its structural organization on the basis of
i.e., the previously prepared relationships linking
the task to the required changes in the ANN
T
ŷ(tk ) = ŷ(ti−l ), ŷ(ti−l+1 ), . . . , ŷ(ti ) , model.
where l is the length of the sliding interval, the The first of these options leads to the traditional
index i refers to the time point (sampling step), learning of ANN models, the second to the class
and the index k indicates the valuation number. of growing networks, and the third to networks
The error of the ANN model will also be a with pretuning.
vector value, i.e., The most important limitation related to the
peculiarities of the first of these approaches
T
e(tk ) = e(ti−l ), e(ti−l+1 ), . . . , e(ti ) . (ANN training) to the adjustment of the ANN
models is that the network, before it started to be
taught, is potentially suitable for a wide class of
2.3.2 ANN Models With Interneurons problems, but after the completion of the learn-
From the point of view of ensuring the adapt- ing process it can already decide only a specific
ability of ANN models, the idea of an interme- task; in the case of another task, it is necessary
diate neuron (interneuron) and the subnetwork to retrain the network, during which the skill of
of such neurons (intersubnet) is very fruitful. solving the previous task is lost.
The second approach (growing networks) al-
2.3.2.1 The Concept of an Interneuron and lows to cope with this problem only partially.
an ANN Model With Such Neurons Namely, if new training examples appeared that
An effective approach to the implementation do not fit into the ANN model obtained accord-
of adaptive ANN models, based on the concepts ing to the first of the approaches, then this model
of an interneuron and a pretuned network, was is built up with new elements, with the addition
proposed by A.I. Samarin [88]. As noted in this of appropriate links, after which the network is
paper, one of the main properties of ANN mod- trained additionally, not affecting the previously
els, which makes them an attractive tool for solv- constructed part of it.
70 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS
the form of multiperceptrons with sigmoid ac- a relatively small subdomain of the values of
tivation functions. This allows us to most effec- state and control variables, and then, in the on-
tively meet the requirements, which are differ- line mode, an incremental learning process of
ent, generally speaking, for a priori and refin- the ANN model is performed, during which at
ing models. In particular, the main requirement each step a step-by-step extension of the sub-
for the a priori model is the ability to repre- region is performed. From here, the model is
sent complex nonlinear dependencies with the operational, in order to eventually expand the
required accuracy, while the time spent on learn- given subdomain to the full domain of the vari-
ing such a model is uncritical, since this train- ables.
ing is carried out in an autonomous (off-line) In the structural-parametric version of the in-
mode. At the same time, the refining model in cremental model formation procedure, at first,
its work must fit into the very rigid framework a “truncated” ANN model is constructed. This
of the real (or even advanced) time scale. For preliminary model has only a part of the state
this reason, in particular, in the vast majority of variables as its inputs, and it is trained on a
cases, the ANN architectures will be unaccept- dataset that covers only a subset of the domain
able, requiring full retraining, even with minor of definition. This initial model is then gradually
changes in the training data with which they expanded by introducing new variables into it,
work. In such a situation, an incremental ap- followed by further training.
proach to teaching and learning the ANN mod- For example, the initial model is the model of
els is more appropriate, allowing not to retrain the longitudinal angular motion of the aircraft,
the entire network, but only to correct those el- which is then expanded by adding trajectory
ements that are directly related to the changed longitudinal motion, after which lateral motion
training data. components are added to it, that is, the model
is brought to the desired full model of the space
motion in a few steps.
2.3.3 Incremental Formation of ANN The structural-parametric variant of the in-
Models cremental formation of ANN models allows
One of the tools for adapting ANN models is us to start with a simple model, sequentially
an incremental formation that exists in two vari- complicating it, for example, according to the
scheme
ants: parametric and structural-parametric.
With the parametric version of the incremen-
tal formation, the structural organization of the material point
ANN model is set immediately and fixed, after ⇓
which it is incrementally adjusted (basic or ad-
rigid body
ditional learning) in several stages, for example,
to extend the domain of operation modes of the ⇓
dynamical system in which the model operates elastic body
with the required accuracy.
⇓
For example, if we take a full spatial model
of the aircraft motion, taking into account both a set of coupled rigid and/or elastic bodies
its trajectory and angular motion, then in accor-
dance with the incremental approach, first an This makes it possible to build up the model
off-line training of this model is carried out for step-by-step in a structural sense.
2.4 TRAINING SET ACQUISITION PROBLEM FOR DYNAMIC NEURAL NETWORKS 73
combination of x, u. By a reaction of this kind, In a similar way, we introduce one more exam-
we will understand the state x(tk+1 ), to which ple of pj ∈ P :
the dynamical system (2.99) passes from the
state x(tk ) with the value u(tk ) of the control ac- pj = {x (j ) (tk ), u(j ) (tk ), x (j ) (tk+1 )}. (2.107)
tion, written
The source data of the examples pi and pj will
F (x,u,t) be considered as not coincident, i.e.,
x(tk ), u(tk ) −−−−−→ x(tk+1 ). (2.103)
x (i) (tk ) = x (j ) (tk ), u(i) (tk ) = u(j ) (tk ).
Accordingly, some example p from the training
set P will include two parts, namely, the input In the general case, the dynamical system re-
(this is the pair x(tk ), u(tk )) and output (this is sponses to the original data from these examples
the reaction x(tk+1 )) of the dynamical system. do not coincide, i.e.,
2.4.2.2 Informativity of the Training Set x (i) (tk+1 ) = x (j ) (tk+1 ).
The training set should (ideally) show the dy-
namical system responses to any combinations We introduce the concept of ε-proximity for
of x, u satisfying the condition (2.102). Then, a pair of examples pi and pj . Namely, we will
according to the Basic Identification Rule (see consider examples of pi and pj ε-close if the fol-
page 73), the training set will be informative, lowing condition is satisfied:
that is, allow to reproduce in the model all the
specific behavior of the simulated DS.5 x (i) (tk+1 ) − x (j ) (tk+1 ) ε, (2.108)
Let us clarify this situation. We introduce the
where ε > 0 is the predefined real number.
notation Np
We select from the set of examples P = {pi }i=1
pi = {x (i) (tk ), u(i) (tk ), x (i) (tk+1 )}, (2.104) a subset consisting of such examples ps for
which the ε-proximity relation to the example
where pi ∈ P is the ith example from the train- ps is satisfied, i.e.,
ing set P . In this example
x (i) (tk+1 ) − x (j ) (tk+1 ) ε, ∀s ∈ Is ⊂ I. (2.109)
(i)
x (i) (tk ) = (x1 (tk ), . . . , xn(i) (tk )), Here Is is the set of indices (numbers) of those
(2.105)
u (i) (i)
(tk ) = (u1 (tk ), . . . , u(i) examples for which ε-proximity is satisfied with
m (tk )).
respect to the example ps , while Is ⊂ I = {1, . . . ,
The response x (i) (tk+1 ) of the considered dynam- Np }.
ical system to the example pi is We call an example pi ε-representative6 if for
the whole collection of examples ps , ∀s ∈ Is ,
that is, for any example ps , s ∈ Is , the condition
x (i) (tk+1 ) = (x1(i) (tk+1 ), . . . , xn(i) (tk+1 )). (2.106)
of ε-proximity is satisfied. Accordingly, we can
now replace the collection of examples {ps }, s ∈
5 It should be noted that the availability of an informative Is , by a single ε-representative pi , and the er-
training set provides a potential opportunity to obtain a ror introduced by such a replacement will not
model that will be adequate to a simulated dynamical sys- exceed ε. Input parts of collections of examples
tem. However, this potential opportunity must still be taken
advantage of, which is a separate nontrivial problem, the
successful solution of which depends on the chosen class of 6 This means that the example p is included in the set of
i
models and learning algorithms. examples {ps }, s ∈ Is .
2.4 TRAINING SET ACQUISITION PROBLEM FOR DYNAMIC NEURAL NETWORKS 75
(s)
{ps }, s ∈ Is , allocate the subdomain RXU , s ∈ In Eq. (2.111), ϕ(·) is a nonlinear vector func-
Is , in the domain RXU defined by the relation tion of the vector arguments x, u and the scalar
(2.102); in this case argument t. It is assumed to be given and be-
longs to some class of functions that admits the
Np
) (s)
existence of a solution of Eq. (2.111) for given
RXU = RXU . (2.110) x(t0 ) and u(t) in the considered part of the space
s=1 of states for the plant.
The behavior of the plant, determined by its
Now we can state the task of forming a train- dynamic properties, can be influenced by set-
ing set as a collection of ε-representatives that ting a correction value for the control variable
covers the domain RXU (2.102) of all possible u(x, u∗ ). The operation of forming the required
values of pairs x, u. value u(x, u∗ ) for some time ti+1 from the val-
The relation (2.110) is the ε-covering condi- ues of the state vector x and the command con-
tion for the training set P of the domain RXU . trol vector u∗ at the time instant ti
A set P carrying out an ε-covering of the domain
RXU will be called ε-informative or, for brevity, u(ti+1 ) = (x(ti ), u∗ (ti )) (2.112)
simply informative.
If the training set P has ε-informativity, this we will perform in the device, which we call the
means that for any pair x, u ∈ RXU there is correcting controller (CC). We assume that the
at least one example pi ∈ P which is an ε- character of the transformation (·) in (2.112) is
representative for a given pair. determined by the composition and values of
With respect to the ε-covering (2.110) of the the components of a certain parameter vector
domain RXU , the following two problems can be w = (w1 w2 . . . wNw ). The set (2.111), (2.112) from
formulated: the plant and CC is referred to as a controlled
system.
1. Given the number of examples Np in the
The behavior of the system (2.111), (2.112)
training set P , find their distribution in the
with the initial conditions x0 = x(t0 ) under the
domain RXU which minimizes the error ε. control u(t) is a multistep process if we assume
2. A permissible error value ε is given; obtain a that the values of this process x(tk ) are observed
minimal collection of a number of Np exam- at time instants tk , i.e.,
ples which ensures that ε is obtained.
{x(tk )} , tk = t0 + kt ,
2.4.2.3 Example of Direct Formation of
t f − t0 (2.113)
Training Set k = 0, 1, . . . , Nt , t = .
Nt
Suppose that the controlled object under con-
sideration (plant) is a dynamical system de- In the problem (2.111), (2.112), as a teaching
scribed by a vector differential equation of the example, generally speaking, we could use a
form [91,92] pair
ẋ = ϕ(x, u, t). (2.111) (e)
(x0 , u(e) (t)), {x (e) (tk ) , k = 0, 1, . . . , Nt } ,
Here, x = (x1 x2 . . . xn ) ∈ Rn is the vector of state
(e)
variables of the op-amp; u = (u1 u2 . . . um ) ∈ Rm where (x0 , u(e) (t)) is the initial state of the sys-
is a vector of control variables of the op-amp; tem (2.111) and the formed control law, respec-
Rn , Rm are Euclidean spaces of dimension n and tively, and {x (e) (tk ) , k = 0, 1, . . . , Nt } is the mul-
m, respectively; t ∈ [t0 , tf ] is the time. tistep process (2.113), which should be carried
76 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS
(e)
out given the initial state x0 under the influ- We define on these ranges a grid {(i) , (j ) } as
ence of some control u(e) (t) on the time interval follows:
[t0 , tf ]. Comparing the process {x (e) (tk )} with the (si )
(i) : xi = ximin + si xi ,
process {x(tk )}, obtained for the same initial con-
ditions x0(e) and control u( rusinde) (t), in fact, that i = 1, . . . , n; si = 0, 1, . . . , Ni ,
(pj ) (2.116)
is, for some fixed value of the parameters w, it (j ) : uj = umin
j + pj uj ,
would be possible in some way to determine the
j = 1, . . . , m; pi = 0, 1, . . . , Mj .
distance between the required and actually im-
plemented processes, and then try to minimize In expressions (2.116), we have
it by varying the values of the parameters w.
This kind of “straightforward” approach, how- ximax − ximin
xi = , i = 1, . . . , n ,
ever, leads to a sharp increase in the amount of Ni
computation at the stage of training of the ANN umax − umin
j j
and, in particular, at the stage of the formation uj = , j = 1, . . . , m.
Mj
of the corresponding training set.
There is, however, the possibility of drasti- Here we denote the following: Ni is the number
cally reducing these volumes of calculations if of segments divided by the range of values for
we take advantage of the fact that the state into the state variable xi , i = 1, . . . , n; Mj is the num-
which the system goes (2.111), (2.112) for the ber of segments divided by the range of values
time t = ti+1 − ti , depends only on its state for the control variable uj , j = 1, . . . , m.
x(ti ) at time ti , and also on the value u(ti ) of the The nodes of this grid are tuples of length
(pj )
control action at the same instant of time. This (n + m) of the form xi(si ) , uj , where the com-
circumstance gives grounds to replace the mul- (s )
ponents xi i , i = 1, . . . , n, are taken from the
tistep process {x (e) (tk )}, k = 0, 1, . . . , Nt , a set of (p )
Nt one-step processes, each of which consists in corresponding (i) , and the components uj j ,
(2.111), (2.112) of one step in time of length t j = 1, . . . , m, from (j ) in (2.116). If the domain
from some initial point x(tk ). RXU is a subset of the Cartesian product X × U ,
In order to obtain a set of initial points xt (t0 ), then this fact can be taken into account by ex-
ut (t0 ), which completely characterizes the be- cluding the “extra” tuples from the grid (2.116).
In [90] an example of the solution of the ANN
havior of the system (2.111), (2.112) on the whole
modeling problem was considered, in which the
range of admissible values RXU ⊆ X × U , x ∈ X,
training set was formed according to the method
u ∈ U , we construct the corresponding grid.
presented above. The source model of motion in
Let the state variables xi , i = 1, . . . , n, in equa- this example is a system of equations of the fol-
tion (2.111) take values from the ranges defined lowing form:
for each of them, i.e.,
m(V̇z − qVx ) = Z ,
(2.117)
ximin xi ximax , i = 1, . . . , n. (2.114) Iy q̇ = M ,
Here, the force Z and moment M depend on the As noted above, each of the grid nodes (2.116)
angle of attack α. However, in case of a rectilin- is used as the initial value x0 = x(t0 ), u0 = u(t0 )
ear horizontal flight the angle of attack equals for the system of equations (2.111); with these
the pitch angle θ. The pitch angle, in turn, is initial values, one step of integration is per-
related to velocity Vz and airspeed V by the fol- formed with the value t. These initial val-
lowing kinematic dependence: ues x(t0 ), u(t0 ) constitute the input vector in
the learning example, and the resulting value
Vz = V sin θ .
x(t0 + t) is the target vector, that is, vector-
Thus, the system of equations (2.117) is closed. sample, showing the learning algorithm of the
The pitching moment M in (2.117) is a func- HC model, which should be the output value
tion of the all-moving stabilizer deflection angle, of the NS under given starting conditions x(t0 ),
i.e., M = M(δe ). u(t0 ).
Thus, the system of equations (2.117) de- The formation of a learning set for solving
scribes transient processes in angular velocity the neural network approximation problem of
and pitch angle, which arise immediately after a the dynamical system (2.111) (in particular, in its
violation of balancing corresponding to a steady particular version (2.117)) is a nontrivial task. As
horizontal flight. the computing experiment [90] has shown, the
So, in the particular case under consideration, convergence of the learning process is very sen-
the composition of the state and control vari- sitive to the grid step xi , uj and the time step
ables is as follows: t.
We explain this situation by the example of
x = [Vz q]T , u = [δe ]. (2.118) the system (2.117), when
In terms of the problem (2.117), when the
mathematical model of the controlled object of x1 = Vz , x2 = q, u1 = δe .
the inequality is approximated (2.114),
We represent, as shown in Fig. 2.28, the part
Vzmin Vz Vzmax , of the grid {(Vz ) , (q) }, whose nodes are used
(2.119)
q min q q max , as initial values (the input part of the training
example) to obtain the target part of the train-
the inequality (2.115) will be written as ing example. In Fig. 2.28, the grid node is shown
in a circle, and the cross is the state of the sys-
δemin δe δemax , (2.120) tem (2.117), obtained by integrating its equa-
tions with a time step t with the initial condi-
and the grid (2.116) is rewritten in the following (i)
form: tions (Vz , q (j ) ), for a fixed position of the stabi-
(k)
lizer δe .
(sV )
(Vz ) : Vz z = Vzmin + sVz Vz , In a series of computational experiments it
sVz = 0, 1, . . . , NVz , was established that for t = const, the condi-
tions of convergence of the learning process of
(q) : q (sq ) = q min + sq q , the neural controller will be as follows:
(2.121)
sq = 0, 1, . . . , Nq ,
(p)
(δe ) : δe = δemin + pδe δe , Vz (t0 + t) − Vz (t0 ) < Vz ,
(2.122)
pδe = 0, 1, . . . , Mδe . q(t0 + t) − q(t0 ) < q ,
78 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS
FIGURE 2.29 Graphic grid representation {(Vz ) , (q) } when δe = const, combined with the target points; this grid sheet
is built with δe = −8 deg (From [90], used with permission from Moscow Aviation Institute).
tem with state vectors and small-size controls 2.4.3 Indirect Approach to the
and with a moderate number of samples with Acquisition of Training Data Sets
respect to these variables are acceptable (the first for Dynamic Neural Networks
and second variants in (2.124)). Even a slight in- 2.4.3.1 General Characteristics of the
crease in the values of these parameters leads, as Indirect Approach to the Acquisition
can be seen from (2.125)), to unacceptable values of Training Data Sets
of the value of the training set. As noted in the previous section, the indirect
In real-world applied problems, where the approach is based on the application of a set of
possibilities of ANN modeling are particularly specially designed control signals to the dynam-
in demand, the result is even more impres- ical system, instead of direct sampling of the do-
sive. main RX,U of feasible values of state and control
variables.
In particular, in the full model of the angular
With this approach, the actual motion of
motion of the aircraft (the corresponding ANN the dynamical system (x(t), u(t)) is composed
model for this case is considered in Section 6.3) of a program (test maneuver) of the motion
of Chapter 6, we have 14 state variables and 3 (x ∗ (t), u∗ (t)), generated by the control signal
control variables, hence the volume of the train- u∗ (t), as well as the motion (x̃(t), ũ(t)), gener-
ing set for it in the direct approach to its forma- ated by the additional perturbing action ũ(t), i.e.,
tion and at Nw = Nq = Mδe = 20 will be N =
2017 = 2 · 1018 , which, of course, is completely x(t) = x ∗ (t) + x̃(t), u(t) = u∗ (t) + ũ(t). (2.126)
unacceptable. Examples of test maneuvers include:
Thus, the direct approach to the formation
• a straight-line horizontal flight with a con-
of training sets for modeling dynamical systems
stant speed;
has a very small “niche,” in which its application
• a flight with a monotonically increasing angle
is possible – simple problems of low dimension- of attack;
ality. An alternative indirect approach is more • a U-turn in the horizontal plane;
well-suited for complex high dimensional prob- • an ascending/descending spiral.
lems. This approach is based on the application
Possible variants of the test perturbing actions
of a set of specially designed control signals to ũ(t) are considered below.
a dynamical system of interest. This approach The type of test maneuver (x ∗ (t), u∗ (t)) in
is discussed in more detail in the next section. (2.126) determines the obtained ranges for
The indirect approach has its advantages and changing the values of the state and control vari-
disadvantages. The indirect approach is the only ables; ũ(t) is the variety of examples within these
viable option in situations where the training ranges.
data acquisition is required to be performed in What is the ideal form of a training set and
how can it be obtained in practice using an indi-
real or even in advanced time. However, in cases
rect approach? We consider this issue in several
when there are no rigid time restrictions regard-
stages, starting with the simplest version of the
ing the acquisition and processing of training dynamical system and proceeding to more com-
data, the most appropriate approach is a mixed plex versions.
one, which is a combination of direct and indi- We first consider a simpler case of an uncon-
rect approaches. trolled dynamical system (Fig. 2.30).
2.4 TRAINING SET ACQUISITION PROBLEM FOR DYNAMIC NEURAL NETWORKS 81
)
NR
Xi = X1 ∪ X2 ∪ . . . ∪ XNR = X, (2.130)
i=1 FIGURE 2.31 Typical test excitation signals used in the
study of the dynamics of controllable systems. (A) Stepwise
where X is the family (collection) of all phase excitation. (B) Impulse excitation. From [109], used with per-
trajectories (trajectories in the state space) poten- mission from Moscow Aviation Institute.
tially realized by the dynamical system in ques-
tion. This condition means that the family of influences in such a way as to obtain an informa-
reference trajectories {xi∗ (t)}N R
i=1 should together
tive set of training data for a dynamical system
represent all potentially possible variants of the are considered.
behavior of the dynamical system in question.
TYPICAL TEST EXCITATION SIGNALS FOR THE
This condition can be treated as a condition for
IDENTIFICATION OF SYSTEMS
completeness of the ε-covering by support tra-
jectories of the domain of possible variants of the Elimination of uncertainties in the ANN
behavior of the dynamical system. model by refining (restoring) a number of el-
An optimal ε-covering problem for the do- ements included in it (for example, functions
main X of possible variants of the dynamical describing the aerodynamic characteristics of
system behavior can be stated, consisting in the aircraft) is a typical problem of identifying
minimizing the number NR of reference trajec- systems [44,93–99]. When solving identification
tories in the set {xi∗ (t)}N R problems for controllable dynamic systems, a
i=1 , i.e.,
number of typical test disturbances are used.
N∗ Among them, the most common are the follow-
{xi∗ (t)}i=1
R
= min{xi∗ (t)}N R
i=1 , (2.131)
NR ing impacts [89,100–103]:
that allows to minimize the volume of the train- • stepwise excitation;
ing set while preserving its informativeness. • impulse excitation;
A desirable condition (but difficult to realize) • doublet (signal type 1–1);
is also the condition • triplet (signal type 2–1–1);
• quadruplet (signal type 3–2–1–1);
*
NR
• random signal;
Xi = X1 ∩ X2 ∩ . . . ∩ XNR = ∅. (2.132) • polyharmonic signal.
i=1
Stepwise excitation (Fig. 2.31A) is a function
2.4.3.3 Formation of Test Excitation Signal u(t) that changes at a certain moment in time ti
As already noted, the type of test maneuver from u = 0 to u = u∗ , i.e.,
in (2.126) determines the resulting ranges for +
changing the values of the state and control vari- 0, t < ti ;
u(t) = ∗ (2.133)
ables, while the kind of perturbation effect pro- u , t ti .
vides a variety of examples within these ranges.
In the following sections, the questions of form- Let u∗ = 1. Then (2.133) is the function of the
ing (with a given test maneuver) test excitatory unit jump σ (t). With its use, you can define an-
2.4 TRAINING SET ACQUISITION PROBLEM FOR DYNAMIC NEURAL NETWORKS 83
FIGURE 2.32 Typical test excitation signals used in the study of the dynamics of controllable systems. (A) Doublet (sig-
nal type 1–1). (B) Triplet (signal type 2–1–1). (C) Quadruplet (signal type 3–2–1–1). From [109], used with permission from
Moscow Aviation Institute.
FIGURE 2.33 Modified versions of the test excitation signals used in the study of the dynamics of controllable systems. (A)
Triplet (signal type 2–1–1). (B) Quadruplet (signal type 3–2–1–1). From [109], used with permission from Moscow Aviation
Institute.
other kind of test action – a rectangular pulse A triplet (signal of the type 2–1–1) is a combi-
(Fig. 2.31B): nation of a rectangular pulse of duration T = 2Tr
and a complete rectangular oscillation with a pe-
u(t) = A(σ (t) − σ (t − Tr )), (2.134) riod T = 2Tr .
A quadruplet (a signal of the type 3–2–1–1–1)
where A is the pulse amplitude and Tr = tf − ti is formed from a triplet by adding to its origin
a rectangular pulse of width T = Tr . In addition,
is the pulse duration.
we can also use triplet and quadruplet variants
On the basis of the rectangular pulse signal
in which each of the constituent parts of the sig-
(2.134), perturbing effects of oscillatory charac-
nal is a full-period oscillation (see Fig. 2.33). We
ter are determined, consisting of a series of rect- will designate them as signals of the type 2–1–1
angular oscillations with a definite relationship and 3–2–1–1, respectively.
between their periods. Among the most com- Another typical excitation signal is shown in
monly used influences of this kind are the dou- Fig. 2.34A. Its values are kept constant for all
blet (Fig. 2.32A), the triplet (Fig. 2.32B), and the time intervals [ti , ti+1 ), i = 0, 1, . . . , n − 1, and at
quadruplet (Fig. 2.32C). time instances ti it can be changed randomly. In
The doublet (also denoted as a signal of type more detail, a signal of this type will be consid-
1–1) is one complete rectangular wave with a pe- ered below by the example of solving the prob-
riod T = 2Tr equal to twice the duration of the lem of the ANN simulation of the longitudinal
rectangular pulse. angular motion of an aircraft.
84 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS
FIGURE 2.34 Test excitations as functions of time used in studying the dynamics of controlled systems. (A) A random
signal. (B) A polyharmonic signal. Here, ϕact is the command signal for the elevator actuator (full-movable horizontal tail) of
the aircraft from the example (2.123) on page 78. From [109], used with permission from Moscow Aviation Institute.
POLYHARMONIC EXCITATION SIGNALS FOR In the case where dynamical system param-
THE IDENTIFICATION OF SYSTEMS eter estimation is performed in real time, it is
To solve problems of identification of dy- desirable that the stimulating effects on the dy-
namic systems, including aircraft, frequency namical system are small. If this condition is
methods are successfully applied. The available met, then the response of the dynamical system
results show [104–107] that for a given frequency (in particular, aircraft) to the effect of the excit-
range, it is possible to effectively estimate the ing inputs will be comparable in intensity with
parameters of dynamical system models in real the reaction, for example, to atmospheric tur-
time. bulence. Then the test excitatory effects will be
The determination of the composition of the hardly distinguishable from the natural distur-
experiments for modeling the dynamical system bances and will not cause unnecessary worries
in the frequency domain is an important part of to the crew of the aircraft.
the identification problem solving process. The Modern aircraft, as one of the most impor-
experiments should be carried out with the aid tant types of simulated dynamical system, have
of excitation signals applied to the input of the a significant number of controls (rudders, etc.).
dynamical system covering a certain predeter- When obtaining the data required for frequency
mined frequency range. analysis and dynamical system identification, it
2.4 TRAINING SET ACQUISITION PROBLEM FOR DYNAMIC NEURAL NETWORKS 85
sized, or, if necessary, some frequency com- noted that to obtain a constant time shift of
ponents should be eliminated (in particular, all components uj their phase shifts will be
for fear of causing undesired reaction of the different in magnitude, since each of the com-
control object). In the paper [106] it was estab- ponents has its own frequency different from
lished empirically that if the sets of Ij indices the frequency of the other components. Since
are formed in such a way that they contain all components of the signal uj are harmon-
numbers greater than 1, multiples of 2 or 3 ics of the same fundamental frequency for the
(for example, k = 2, 4, 6 or k = 5, 10, 15, 20), period of oscillations T , if the phase angles ϕk
then the phase shift for them can be opti- of all components are changed so that the ini-
mized in such a way that the relative peak tial value of the input signal was zero, then
factor for the corresponding input action will its value at the final moment of time will also
be very close to 1, and in some cases it will be zero. In this case, the energy spectrum, or-
be even less than 1. For the distribution of thogonality, and relative peak factor of the in-
indices over subsets Ij , the following condi- put signals remain unchanged.
tions must be satisfied: 7. Go back to step 5 and repeat the appropri-
) * ate actions until either the relative peak fac-
Ij = K, K = {1, 2, . . . , M}, Ij = ∅. tor reaches the prescribed value, or the limit
j j number of iterations of the process is reached.
For example, the target value of the relative
Each index k ∈ K must be used exactly once. peak factor can be set as 1.01, the maximum
Compliance with this condition ensures mu- number of iterations 50.
tual orthogonality of the input actions both There are a number of methods that allow to
in the time domain and in the frequency do- optimize the frequency spectrum of input (test)
main. signals when solving the problem of estimating
4. Generate, according to (2.136), the input ac- the parameters of a dynamic system. However,
tion uj for each of the controls used, and then all these methods require a significant amount of
calculate the initial phase angle values ϕk ac- computation, as well as a certain level of knowl-
cording to the Schröder method, assuming edge about the dynamical being investigated,
the uniformity of the power spectrum. usually tied to a certain nominal state of the sys-
5. Find the phase angle values ϕk for each of the tem. With respect to the situation considered in
input actions uj which minimize the relative this chapter, such methods are useless because
peak factor for them. the task is to identify the dynamics of the system
6. For each of the input actions uj , perform a in real time for various modes of its functioning
one-dimensional search procedure to find a that vary widely. In addition, the solution of the
constant time offset value such that the cor- task of reconfiguring the control system in the
responding input signal starts at a zero value event of failures and damage of the dynamical
of its amplitude. This operation is equiva- system requires the solution of the problem of
lent to shifting the graph of the input signal identification with significant and unpredictable
along the time axis so that the point of inter- changes in the dynamics of the system. Under
section of this graph with the abscissa axis such conditions, the laborious calculation of the
(i.e., with the time axis) coincides with the input effect optimized for the frequency spec-
origin. The phase shift corresponding to such trum does not make sense, and in some cases it
a displacement is added to the values of ϕk is impossible, since it does not fit into real time.
of all sinusoidal components (harmonics) of Instead, the frequency spectrum of all generated
the considered input actions uj . It should be input influences is selected in such a way that it
88 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS
is uniform in a given frequency range in order to [15] Mandic DP, Chambers JA. Recurrent neural networks
exert a sufficient excitatory effect on the dynam- for prediction: Learning algorithms, architectures and
ical system. stability. New York, NY: John Wiley & Sons, Inc.; 2001.
[16] Medsker LR, Jain LC. Recurrent neural networks: De-
Step 6 of the process described above pro- sign and applications. New York, NY: CRC Press; 2001.
vides an input perturbation signal added to the [17] Michel A, Liu D. Qualitative analysis and synthesis of
main control action selected, for example, for recurrent neural networks. London, New York: CRC
balancing an airplane or for performing a pre- Press; 2002.
determined maneuver. [18] Yi Z, Tan KK. Convergence analysis of recurrent neural
networks. Berlin: Springer; 2004.
[19] Gupta MM, Jin L, Homma N. Static and dynamic neu-
ral networks: From fundamentals to advanced theory.
REFERENCES Hoboken, New Jersey: John Wiley & Sons; 2003.
[20] Lin DT, Dayhoff JE, Ligomenides PA. Trajectory pro-
[1] Ollongren A. Definition of programming languages by duction with the adaptive time-delay neural network.
interpreting automata. London, New York, San Fran- Neural Netw 1995;8(3):447–61.
cisco: Academic Press; 1974. [21] Guh RS, Shiue YR. Fast and accurate recognition of
[2] Brookshear JG. Theory of computation: Formal lan- control chart patterns using a time delay neural net-
guages, automata, and complexity. Redwood City, work. J Chin Inst Ind Eng 2010;27(1):61–79.
California: The Benjamin/Cummings Publishing Co.; [22] Yazdizadeh A, Khorasani K, Patel RV. Identification of
1989. a two-link flexible manipulator using adaptive time
[3] Chiswell I. A course in formal languages, automata delay neural networks. IEEE Trans Syst Man Cybern,
and groups. London: Springer-Verlag; 2009. Part B, Cybern 2010;30(1):165–72.
[4] Fu KS. Syntactic pattern recognition. London, New [23] Juang JG, Chang HH, Chang WB. Intelligent automatic
York: Academic Press; 1974. landing system using time delay neural network con-
[5] Fu KS. Syntactic pattern recognition and applications. troller. Appl Artif Intell 2003;17(7):563–81.
Englewood Cliffs, New Jersey: Prentice Hall, Inc.; 1982. [24] Sun Y, Babovic V, Chan ES. Multi-step-ahead model er-
[6] Fu KS, editor. Syntactic methods in pattern recog- ror prediction using time-delay neural networks com-
nition, applications. Berlin, Heidelberg, New York: bined with chaos theory. J Hydrol 2010;395:109–16.
Springer-Verlag; 1977. [25] Zhang J, Wang Z, Ding D, Liu X. H∞ state estima-
[7] Gonzalez RC, Thomason MG. Syntactic pattern recog- tion for discrete-time delayed neural networks with
nition: An introduction. London: Addison-Wesley randomly occurring quantizations and missing mea-
Publishing Company Inc.; 1978. surements. Neurocomputing 2015;148:388–96.
[8] Tutschku K. Recurrent multilayer perceptrons for iden-
[26] Yazdizadeh A, Khorasani K. Adaptive time delay neu-
tification and control: The road to applications. Uni-
ral network structures for nonlinear system identifica-
versity of Würzburg, Institute of Computer Science,
tion. Neurocomputing 2002;77:207–40.
Research Report Series, Report No. 118; June 1995.
[27] Ren XM, Rad AB. Identification of nonlinear systems
[9] Heister F, Müller R. An approach for the identifica-
with unknown time delay based on time-delay neural
tion of nonlinear, dynamic processes with Kalman-
networks. IEEE Trans Neural Netw 2007;18(5):1536–41.
filter-trained recurrent neural structures. University of
[28] Beale MH, Hagan MT, Demuth HB. Neural network
Würzburg, Institute of Computer Science, Research
toolbox: User’s guide. Natick, MA: The MathWorks,
Report Series, Report No. 193; April 1999.
[10] Haykin S. Neural networks: A comprehensive founda- Inc.; 2017.
tion. 2nd ed. Upper Saddle River, NJ, USA: Prentice [29] Čerňanský M, Beňušková L. Simple recurrent network
Hall; 1998. trained by RTRL and extended Kalman filter algo-
[11] Hagan MT, Demuth HB, Beale MH, De Jesús O. Neural rithms. Neural Netw World 2003;13(3):223–34.
network design. 2nd ed. PSW Publishing Co.; 2014. [30] Elman JL. Finding structure in time. Cogn Sci
[12] Graves A. Supervised sequence labelling with recur- 1990;14(2):179–211.
rent neural networks. Berlin, Heidelberg: Springer; [31] Elman JL. Distributed representations, simple recur-
2012. rent networks, and grammatical structure. Mach Learn
[13] Hammer B. Learning with recurrent neural networks. 1991;7:195–225.
Berlin, Heidelberg: Springer; 2000. [32] Elman JL. Learning and development in neural net-
[14] Kolen JF, Kremer SC. A field guide to dynamical recur- works: the importance of starting small. Cognition
rent networks. New York: IEEE Press; 2001. 1993;48(1):71–99.
REFERENCES 89
[33] Chen S, Wang SS, Harris C. NARX-based non- [49] Gill PE, Murray W, Wright MH. Practical optimization.
linear system identification using orthogonal least London, New York: Academic Press; 1981.
squares basis hunting. IEEE Trans Control Syst Tech- [50] Nocedal J, Wright S. Numerical optimization. 2nd ed.
nol 2008;16(1):78–84. Springer; 2006.
[34] Sahoo HK, Dash PK, Rath NP. NARX model based [51] Fletcher R. Practical methods of optimization.
nonlinear dynamic system identification using low 2nd ed. New York, NY, USA: Wiley-Interscience.
complexity neural networks and robust H∞ filter. ISBN 0-471-91547-5, 1987.
Appl Soft Comput 2013;13(7):3324–34. [52] Dennis J, Schnabel R. Numerical methods for uncon-
[35] Hidayat MIP, Berata W. Neural networks with ra- strained optimization and nonlinear equations. Society
dial basis function and NARX structure for mate- for Industrial and Applied Mathematics; 1996.
rial lifetime assessment application. Adv Mater Res [53] Gendreau M, Potvin J. Handbook of metaheuristics.
2011;277:143–50. International series in operations research & manage-
[36] Wong CX, Worden K. Generalised NARX shunting ment science. US: Springer. ISBN 9781441916655, 2010.
neural network modelling of friction. Mech Syst Sig- [54] Du K, Swamy M. Search and optimization by
nal Process 2007;21:553–72. metaheuristics: Techniques and algorithms in-
[37] Potenza R, Dunne JF, Vulli S, Richardson D, King P. spired by nature. Springer International Publishing.
Multicylinder engine pressure reconstruction using ISBN 9783319411927, 2016.
NARX neural networks and crank kinematics. Int J [55] Glorot X, Bengio Y. Understanding the difficulty
Eng Res 2017;8:499–518. of training deep feedforward neural networks. In:
[38] Patel A, Dunne JF. NARX neural network modelling Teh YW, Titterington M, editors. Proceedings of the
of hydraulic suspension dampers for steady-state and Thirteenth International Conference on Artificial Intel-
variable temperature operation. Veh Syst Dyn: Int J ligence and Statistics. Proceedings of machine learning
Veh Mech Mobility 2003;40(5):285–328. research, vol. 9. Chia Laguna Resort, Sardinia, Italy:
[39] Gaya MS, Wahab NA, Sam YM, Samsudin SI, Ja- PMLR; 2010. p. 249–56. http://proceedings.mlr.press/
maludin IW. Comparison of NARX neural network v9/glorot10a.html.
and classical modelling approaches. Appl Mech Mater [56] Nocedal J. Updating quasi-Newton matrices with lim-
2014;554:360–5. ited storage. Math Comput 1980;35:773–82.
[40] Siegelmann HT, Horne BG, Giles CL. Computa- [57] Conn AR, Gould NIM, Toint PL. Trust-region methods.
tional capabilities of recurrent NARX neural net- Philadelphia, PA, USA: Society for Industrial and Ap-
works. IEEE Trans Syst Man Cybern, Part B, Cybern plied Mathematics. ISBN 0-89871-460-5, 2000.
1997;27(2):208–15. [58] Steihaug T. The conjugate gradient method and trust
[41] Kao CY, Loh CH. NARX neural networks for nonlinear regions in large scale optimization. SIAM J Numer
analysis of structures in frequency domain. J Chin Inst Anal 1983;20(3):626–37.
Eng 2008;31(5):791–804. [59] Martens J, Sutskever I. Learning recurrent neural net-
[42] Billings SA. Nonlinear system identification: NAR- works with Hessian-free optimization. In: Proceedings
MAX methods in the time, frequency and spatio- of the 28th International Conference on International
temporal domains. New York, NY: John Wiley & Sons; Conference on Machine Learning. USA: Omnipress.
2013. ISBN 978-1-4503-0619-5, 2011. p. 1033–40. http://dl.
[43] Pearson PK. Discrete-time dynamic models. New acm.org/citation.cfm?id=3104482.3104612.
York–Oxford: Oxford University Press; 1999. [60] Martens J, Sutskever I. Training deep and recurrent
[44] Nelles O. Nonlinear system identification: From classi- networks with Hessian-free optimization. In: Neu-
cal approaches to neural networks and fuzzy models. ral networks: Tricks of the trade. Springer; 2012.
Berlin: Springer; 2001. p. 479–535.
[45] Sutton RS, Barto AG. Reinforcement learning: An in- [61] Moré JJ. The Levenberg–Marquardt algorithm: Imple-
troduction. Cambridge, Massachusetts: The MIT Press; mentation and theory. In: Watson G, editor. Numer-
1998. ical analysis. Lecture notes in mathematics, vol. 630.
[46] Busoniu L, Babuška R, De Schutter B, Ernst D. Rein- Springer Berlin Heidelberg. ISBN 978-3-540-08538-6,
forcement learning and dynamic programming using 1978. p. 105–16.
function approximators. London: CRC Press; 2010. [62] Moré JJ, Sorensen DC. Computing a trust region step.
[47] Kamalapurkar R, Walters P, Rosenfeld J, Dixon W. Re- SIAM J Sci Stat Comput 1983;4(3):553–72. https://doi.
inforcement learning for optimal feedback control: A org/10.1137/0904038.
Lyapunov-based approach. Berlin: Springer; 2018. [63] Bottou L, Curtis F, Nocedal J. Optimiza-
[48] Lewis FL, Liu D. Reinforcement learning and approx- tion methods for large-scale machine learn-
imate dynamic programming for feedback control. ing. SIAM Rev 2018;60(2):223–311. https://
Hoboken, New Jersey: John Wiley & Sons; 2013. doi.org/10.1137/16M1080173.
90 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS
[64] Griewank A, Walther A. Evaluating derivatives: Prin- [79] Graves A, Schmidhuber J. Framewise phoneme clas-
ciples and techniques of algorithmic differentiation. sification with bidirectional LSTM networks. In: Pro-
2nd ed. Philadelphia, PA, USA: Society for Industrial ceedings. 2005 IEEE International Joint Conference on
and Applied Mathematics. ISBN 0898716594, 2008. Neural Networks, 2005, vol. 4; 2005. p. 2047–52.
[65] Griewank A. On automatic differentiation. In: Math- [80] Greff K, Srivastava RK, Koutník J, Steune-
ematical programming: Recent developments and brink BR, Schmidhuber J. LSTM: A search space
applications. Kluwer Academic Publishers; 1989. odyssey. CoRR 2015;abs/1503.04069. http://
p. 83–108. arxiv.org/abs/1503.04069.
[66] Bishop C. Exact calculation of the Hessian ma- [81] Wang Y. A new concept using LSTM neural networks
trix for the multilayer perceptron. Neural Com- for dynamic system identification. In: 2017 American
put 1992;4(4):494–501. https://doi.org/10.1162/neco. Control Conference (ACC), vol. 2017; 2017. p. 5324–9.
1992.4.4.494. [82] Doya K. Bifurcations in the learning of recurrent neu-
[67] Werbos PJ. Backpropagation through time: What it
ral networks. In: Proceedings of 1992 IEEE Interna-
does and how to do it. Proc IEEE 1990;78(10):1550–60.
tional Symposium on Circuits and Systems, vol. 6;
[68] Chauvin Y, Rumelhart DE, editors. Backpropagation:
1992. p. 2777–80.
Theory, architectures, and applications. Hillsdale, NJ,
[83] Pasemann F. Dynamics of a single model neuron.
USA: L. Erlbaum Associates Inc.. ISBN 0-8058-1259-8,
Int J Bifurc Chaos Appl Sci Eng 1993;03(02):271–8.
1995.
[69] Jesus OD, Hagan MT. Backpropagation algorithms for http://www.worldscientific.com/doi/abs/10.1142/
a broad class of dynamic networks. IEEE Trans Neural S0218127493000210.
Netw 2007;18(1):14–27. [84] Haschke R, Steil JJ. Input space bifurcation mani-
[70] Williams RJ, Zipser D. A learning algorithm for contin- folds of recurrent neural networks. Neurocomput-
ually running fully recurrent neural networks. Neural ing 2005;64(Supplement C):25–38. https://doi.org/10.
Comput 1989;1(2):270–80. 1016/j.neucom.2004.11.030.
[71] Bengio Y, Simard P, Frasconi P. Learning long-term [85] Jesus OD, Horn JM, Hagan MT. Analysis of recurrent
dependencies with gradient descent is difficult. Trans network training and suggestions for improvements.
Neural Netw 1994;5(2):157–66. https://doi.org/10. In: Neural Networks, 2001. Proceedings. IJCNN ’01. In-
1109/72.279181. ternational Joint Conference on, vol. 4; 2001. p. 2632–7.
[72] Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J. Gra- [86] Horn J, Jesus OD, Hagan MT. Spurious valleys
dient flow in recurrent nets: The difficulty of learning in the error surface of recurrent networks: Anal-
long-term dependencies. In: Kolen J, Kremer S, editors. ysis and avoidance. IEEE Trans Neural Netw
A field guide to dynamical recurrent networks. IEEE 2009;20(4):686–700.
Press; 2001. p. 15. [87] Phan MC, Hagan MT. Error surface of recur-
[73] Kremer SC. A field guide to dynamical recurrent net- rent neural networks. IEEE Trans Neural Netw
works. 1st ed. Wiley-IEEE Press. ISBN 0780353692, Learn Syst 2013;24(11):1709–21. https://doi.org/10.
2001. 1109/TNNLS.2013.2258470.
[74] Pascanu R, Mikolov T, Bengio Y. On the difficulty of [88] Samarin AI. Neural networks with pre-tuning. In: VII
training recurrent neural networks. In: Proceedings All-Russian Conference on Neuroinformatics. Lectures
of the 30th International Conference on International on neuroinformatics. Moscow: MEPhI; 2005. p. 10–20
Conference on Machine Learning, vol. 28. JMLR.org; (in Russian).
2013. pp. III–1310–III–1318.
[89] Jategaonkar RV. Flight vehicle system identification: A
[75] Hochreiter S, Schmidhuber J. Long short-term mem-
time domain methodology. Reston, VA: AIAA; 2006.
ory. Neural Comput 1997;9:1735–80.
[76] Gers FA, Schmidhuber J, Cummins F. Learning to for- [90] Morozov NI, Tiumentsev YV, Yakovenko AV. An ad-
get: Continual prediction with LSTM. Neural Comput justment of dynamic properties of a controllable ob-
1999;12:2451–71. ject using artificial neural networks. Aerosp MAI J
[77] Gers FA, Schmidhuber J. Recurrent nets that time and 2002;(1):73–94 (in Russian).
count. In: Proceedings of the IEEE-INNS-ENNS Inter- [91] Krasovsky AA. Automatic flight control systems and
national Joint Conference on Neural Networks. IJCNN their analytical design. Moscow: Nauka; 1973 (in Rus-
2000. Neural computing: new challenges and perspec- sian).
tives for the New Millennium, vol. 3; 2000. p. 189–94. [92] Krasovsky AA, editor. Handbook of automatic control
[78] Gers FA, Schraudolph NN, Schmidhuber J. Learn- theory. Moscow: Nauka; 1987 (in Russian).
ing precise timing with LSTM recurrent networks. [93] Graupe D. System identification: A frequency domain
J Mach Learn Res 2003;3:115–43. https://doi.org/10. approach. New York, NY: R.E. Krieger Publishing Co.;
1162/153244303768966139. 1976.
REFERENCES 91
[94] Ljung L. System identification: Theory for the user. 2nd [103] Tischler M, Remple RK. Aircraft and rotorcraft system
ed. Upper Saddle River, NJ: Prentice Hall; 1999. identification: Engineering methods with flight-test ex-
[95] Sage AP, Melsa JL. System identification. New York amples. Reston, VA: AIAA; 2006.
and London: Academic Press; 1971. [104] Morelli EA, In-flight system identification. AIAA–
[96] Tsypkin YZ. Information theory of identification. 98–4261, 10.
Moscow: Nauka; 1995 (in Russian). [105] Morelli EA, Klein V. Real-time parameter estima-
[97] Isermann R, Münchhoh M. Identification of dynamic tion in the frequency domain. J Guid Control Dyn
systems: An introduction with applications. Berlin: 2000;23(5):812–8.
Springer; 2011. [106] Morelli EA, Multiple input design for real-time param-
[98] Juang JN, Phan MQ. Identification and control of me- eter estimation in the frequency domain, in: 13th IFAC
chanical systems. Cambridge, MA: Cambridge Univer- Conf. on System Identification, Aug. 27–29, 2003, Rot-
sity Press; 1994. terdam, The Netherlands. Paper REG-360, 7.
[99] Pintelon R, Schoukens J. System identification: A fre- [107] Smith MS, Moes TR, Morelli EA, Flight investigation
quency domain approach. New York, NY: IEEE Press; of prescribed simultaneous independent surface exci-
2001. tations for real-time parameter identification. AIAA–
[100] Berestov LM, Poplavsky BK, Miroshnichenko LY. 2003–5702, 23.
Frequency domain aircraft identification. Moscow: [108] Schroeder MR. Synthesis of low-peak-factor signals
Mashinostroyeniye; 1985 (in Russian). and binary sequences with low autocorrelation. IEEE
[101] Vasilchenko KK, Kochetkov YA, Leonov VA, Trans Inf Theory 1970;16(1):85–9.
Poplavsky BK. Structural identification of math- [109] Brusov VS, Tiumentsev YuV. Neural network model-
ematical model of aircraft motion. Moscow: ing of aircraft motion. Moscow: MAI; 2016 (in Rus-
Mashinostroyeniye; 1993 (in Russian). sian).
[102] Klein V, Morelli EA. Aircraft system identification:
Theory and practice. Reston, VA: AIAA; 2006.