Professional Documents
Culture Documents
Nonlinear Identification
and Control
A Neural Network Approach
With 88 Figures
. Springer
G.P. Liu, BEng, MEng, PhD
School of Mechanical Materials, Manufacturing Engineering and Management,
University of Nottingham, University Park, Nottingham, NG7 2RD, UK
http://www.springer.co.uk
© Springer-Verlag London 2001
Originally published by Springer-Verlag London Berlin Heidelberg 2001
Softcover reprint of the hardcover 1st edition 2001
The use of registered names, trademarks etc. in this publication does not imply, even in the absence of a
specific statement, that such names are exempt from the relevant laws and regulations and therefore
free for general use.
The publisher makes no representation, express or implied, with regard to the accuracy of the
information contained in this book and cannot accept any legal responsibility or liability for any errors
or omissions that may be made.
Typesetting: Electronic text files prepared by author
Series Editors
Dr D.C. McFarlane
Department of Engineering
University of Cambridge
Cambridge CB2 1QJ
United Kingdom
Professor B. Wittenmark
Department of Automatic Control
Lund Institute of Technology
PO Box 118
S-221 00 Lund
Sweden
Professor H. Kimura
Department of Mathematical Engineering and Information Physics
Faculty of Engineering
The University of Tokyo
7-3-1 Hongo
Bunkyo Ku
Tokyo 113
Japan
Dr M.K. Masten
Texas Instruments
2309 Northcrest
Plano
TX 75075
United States of America
The series Advances in Industrial Control aims to report and encourage technology
transfer in control engineering. The rapid development of control technology has an
impact on all areas of the control discipline. New theory, new controllers, actuators,
sensors, new industrial processes, computer methods, new applications, new
philosophies ... , new challenges. Much of this development work resides in
industrial reports, feasibility study papers and the reports of advanced collaborative
projects. The series otTers an opportunity for researchers to present an extended
exposition of such new work in all aspects of industrial control for wider and rapid
dissemination.
The time for nonlinear control to enter routine application seems to be
approaching. Nonlinear control has had a long gestation period but much ofthe past
has been concerned with methods that involve formal nonlinear functional model
representations. It seems more likely that the breakthough will come through the use
of other more flexible and amenable nonlinear system modelling tools. This
Advances in Industrial Control monograph by Guoping Liu gives an excellent
introduction to the type of new nonlinear system modelling methods currently being
developed and used. Neural networks appear prominent in these new modelling
directions. The monograph presents a systematic development of this exciting
subject. It opens with a useful tutorial introductory chapter on the various tools to
be used. In subsequent chapters Doctor Liu leads the reader through identification,
and then onto nonlinear control using nonlinear system neural network
representations. Each chapter culminates with some examples and the final chapter
is a worked-out case-study for combustion processes.
We feel the structured presentation of modern nonlinear identitication methods
and their use in control schemes will be of interest to postgraduate students,
industrial engineers and academics alike. We welcome this addition to the Advances
in Industrial Control monograph series.
It is well known that linear models have been widely used in system identi-
fication for two major reasons. First, the effects that different and combined
input signals have on the output are easily determined. Second, linear systems
are homogeneous. However, control systems encountered in practice possess
the property of linearity only over a certain range of operation; all physical
systems are nonlinear to some degree. In many cases, linear models are not
suitable to represent these systems and nonlinear models have to be considered.
Since there are nonlinear effects in practical systems, e.g., harmonic genera-
tion, intermodulation, desensitisation, gainj expansion and chaos, neither of
the above principles for linear models is valid for nonlinear systems. There-
fore, nonlinear system identification is much more difficult than linear system
identification.
Any attempt to restrict attention strictly to linear control can only lead to
severe complications in system design. To operate linearly over a wide range
of variation of signal amplitude and frequency would require components of an
extremely high quality; such a system would probably be impractical from the
viewpoints of cost, space, and weight. In addition, the restriction of linearity
severely limits the system characteristics that can be realised.
Recently, neural networks have become an attractive tool that can be used
to construct a model of complex nonlinear processes. This is because neu-
ral networks have an inherent ability to learn and approximate a nonlinear
function arbitrarily well. This therefore provides a possible way of modelling
complex nonlinear processes effectively. A large number of identification and
control structures have been proposed on the basis of neural networks in recent
years.
The purpose of this monograph is to give the broad aspects of nonlinear
identification and control using neural networks. Basically, the monograph
consists of three parts. The first part gives an introduction to fundamental
princi pIes of neural networks. Then several methods for nonlinear identification
using neural networks are presented. In the third part, various techniques for
nonlinear control using neural networks are studied. A number of simulated
and industrial examples are used throughout the monograph to demonstrate
the operation of the techniques of nonlinear identification and control using
neural networks. It should be emphasised here that methods for nonlinear
control systems have not progressed as rapidly as have techniques for linear
XII Preface
control systems. Comparatively speaking, at the present time they are still
in the development stage. We believe that the fundamental theory, various
design methods and techniques, and many application examples of nonlinear
identification and control using neural networks that are presented in this
monograph will enable one to analyse and synthesise nonlinear control systems
quantitatively. The monograph, which is mostly based on the author's recent
research work, is organised as follows.
Chapter 1 gives an overview of what neural networks are, followed by a
description of the model of a neuron (the basic element of a neural network)
and commonly used architectures of neural networks. Various types of neural
networks are presented, e.g., radial basis function networks, polynomial basis
function networks, fuzzy neural networks and wavelet networks. The function
approximation properties of neural networks are discussed. A few widely used
learning algorithms are introduced, such as the sequential learning algorithm,
the error back-propagation learning algorithm and the least-mean-squares al-
gorithm. Many applications of neural networks to classification, filtering, mod-
elling, prediction, control and hardware implementation are mentioned.
Chapter 2 presents a sequential identification scheme for nonlinear dynam-
ical systems. A novel neural network architecture, referred to as a variable neu-
ral network, is studied and shown to be useful in approximating the unknown
nonlinearities of dynamical systems. In the variable neural network, the num-
ber of basis functions can be either increased or decreased with time according
to specified design strategies so that the network will not overfit or underfit
the data set. The identification model varies gradually to span the appropri-
ate state-space and is of sufficient complexity to provide an approximation to
the dynamical system. The sequential identification scheme, different from the
conventional methods of optimising a cost function, attempts to ensure stabil-
ity of the overall system while the neural network learns the system dynamics.
The stability and convergence of the overall identification scheme are guaran-
teed by the developed parameter adjustment laws. An example illustrates the
modelling of an unknown nonlinear dynamical system using variable network
identification techniques.
Chapter 3 considers a recursive identification scheme using neural networks
for nonlinear control systems. This comprises a structure selection procedure
and a recursive weight learning algorithm. The orthogonal least squares algo-
rithm is introduced for off-line structure selection and the growing network
technique is used for on-line structure selection. An on-line recursive weight
learning algorithm is developed to adjust the weights so that the identified
model can adapt to variations of the characteristics and operating points in
nonlinear systems. The convergence of both the weights and estimation errors
is established. The recursive identification scheme using neural networks is
demonstrated by three examples. The first is identification of unknown sys-
tems represented by a nonlinear input output dynamical model. The second
is identification of unknown systems represented by a nonlinear state-space
dynamical model. The third is the identification of the Santa Fe time series.
Preface XIll
Guoping Liu
School of Mechanical, Materials, Manufacturing
Engineering and Management
University of Nottingham
Nottingham NG7 2RD
United Kingdom
May 2001
TABLE OF CONTENTS
The symbols and abbreviations listed here are used unless otherwise stated.
C field of complex numbers
diag{.} diagonal matrix
dim(.) dimension of a vector
exp(.) exponential function
GA genetic algorithm
GAs genetic algorithms
GRBF Gaussian radial basis function
g complex conjugate of 9
II f lin n-norm of the function f
< .,. > inner product
A(. ) eigenvalue of a matrix
Arnax (.) maximum eigenvalue of a matrix
Arnin (.) minimum eigenvalue of a matrix
MLVlO multi-input multi-output
MLVlS multi-input multi-state
Mol method of inequalities
MLP multilayer percept ron
max{-} maximum
min{·} mllllmum
1·1 modulus
NARMA nonlinear auto-regressive moving average
NARMAX NARMA model with exogenous inputs
NN neural network
NNs neural networks
N integer numbers
}/+ non-negative integer numbers
w angular frequency
a partial derivative with respect to x
ax
4J(. ) basis function
r reference input
RBF radial basis function
R field of real numbers (- 00, 00 )
R+ field of non-negative real numbers [0,(0)
sign(.) sign function
xx Symbols and Abbreviations
1.1 Introduction
The field of neural networks has its roots in neurobiology. The structure and
functionality of neural networks has been motivated by the architecture of
the human brain. Following the complex neural architecture, a neural network
consists of layers of simple processing units coupled by weighted interconnec-
tions. With the development of computer technology, significant progress in
neural network research has been made. A number of neural networks have
been proposed in recent years.
The multilayer percept ron (MLP)(Rumelhart et al., 1986) is a network
that is built upon the McGulloch and Pitts' model of neurons (McCulloch
and Pitts, 1943) and the perceptron (Rosenblatt, 1958). The perceptron maps
the input, generally binary, onto a binary valued output. The MLP uses this
mapping to real valued outputs for binary or real valued inputs. The decision
regions that could be formed by this network extend beyond the linear sepa-
rable regions that are formed by the perceptron. The nonlinearity inherent in
the network enables it to perform better than the traditional linear methods
(Lapedes and Farber, 1987). It has been observed that this input output net-
work mapping can be viewed as a hypersurface constructed in the input space
(Lapedes and Farber, 1988). A surface interpolation method, called the radial
basis functions, has been cast into a network whose architecture is similar to
that of MLP (Broomhead and Lowe, 1988). Other surface interpolation meth-
ods, for example, the multivariate adaptive regression splines (Friedman, 1991)
and B-splines (Lane et al., 1989), have also found their way into new forms
of networks. Another view presented in Lippmann (1987), and Lapedes and
Farber (1988) is that the network provides an approximation to an underlying
function. This has resulted in applying polynomial approximation methods
to neural networks, such as the Sigma-Pi units (Rumelhart et al., 1986), the
Volterra polynomial network (Rayner and Lynch, 1989) and the orthogonal
network (Qian et al., 1990). The application of wavelet transforms to neural
networks (Pati and Krishnaprasad, 1990) has also derived its inspiration from
function approximation.
While these networks may have little relationship to biological neural net-
works, it has become common in the neural network area to refer to them as
neural networks. These networks share one important characteristic that they
2 1. Neural N etwor ks
Yk = 'P(Vk) (1.2)
where Uj is the input signal, Wkj the weight of the neuron, Vk the linear com-
biner link, 'P(.) the activation function and Yk the output signal of the neuron.
The activation function defines the output of a neuron in terms of the
activity level at its input. There are many types of activation functions. Here
three basic types of activation functions are introduced: threshold function,
piecewise-linear function and sigmoid function.
When the threshold function is used as an activation function, it is de-
scribed by
if v:2:0
'P(V) ={ ~ if v < 0
(1.3)
where the amplification factor inside the linear region of operation is assumed
to be unity. This activation function may be viewed as an approximation to a
nonlinear amplifier. There are two special forms of the piecewise-linear func-
tion: (a) it is a linear combiner if the linear region of operation is maintained
without running into saturation, and (b) it reduces to a threshold function if
the amplification factor of the linear region is made infinitely large.
The sigmoid function is a widely used form of activation function in neural
networks. It is defined as a strictly increasing function that exhibits smoothness
and asymptotic properties. An example of the sigmoid is the logistic function,
described by
1
'P ( v) = -l-+-e---a-v (1.5)
where a is the slope parameter of the sigmoid function. By varying the pa-
rameter a, sigmoid functions of different slopes can be obtained. In the limit,
4 1. Neural N etwor ks
as the slope parameter approaches infinity, the sigmoid function becomes sim-
ply a threshold function. Note also that the sigmoid function is differentiable,
whereas the threshold function is not.
The multilayer network has a input layer, one or several hidden layers and
an output layer. Each layer consists of neurons with each neuron in a layer
1.3 Architectures of Neural Networks 5
connected to neurons in the layer below. This network has a feedforward ar-
chitecture which is shown in Figure 1.3. The number of input neurons defines
the dimensionality of the input space being mapped by the network and the
number of output neurons the dimensionality of the output space into which
the input is mapped.
In a feedforward neural network, the overall mapping is achieved via in-
termediate mappings from one layer to another. These intermediate mappings
depend on two factors. The first is the connection mapping that transforms
the output of the lower-layer neurons to an input to the neuron of interest and
the second is the activation function of the neuron itself.
A recurrent neural network has at least one feedback loop that distinguishes
itself from a feedforward neural network. The recurrent network may consist
of a single-layer or multilayer of neurons and each neuron may feed its output
signal back to the inputs of all the other neurons. A class of recurrent networks
with hidden neurons is illustrated in the architectural graph of Figure 1.4. In
the structure, the feedback connections originate from the hidden neurons as
well as the output neurons. The presence of feedback loops in the recurrent
networks has a profound impact on the learning capability of the network, and
on its performance. Moreover, the feedback loops use particular branches com-
posed of unit-delay elements, which result in a nonlinear dynamical behaviour
by virtue of the nonlinear nature of the neurons.
6 1. Neural N etwor ks
Outputs
Inputs
Radial basis functions (RBF) have been introduced as a technique for multi-
variable interpolation (Powell, 1987). Broomhead and Lowe demonstrated that
these functions can be cast into an architecture similar to that of the multilayer
network, and hence named the RBF network (Broomhead and Lowe, 1988).
In the RBF network, which is a single hidden layer network, its input
to the hidden layer connection transforms the input into a distance from a
point in the input space, unlike in the MLP, where it is transformed into a
distance from a hyperplane in the input space. However, it has been seen from
multilayer networks that the hidden neurons can be viewed as constructing
basis functions which are then combined to form the overall mapping. For
the RBF network, the basis function constructed at the k-th hidden neuron is
given by
(1.6)
where 11.112 is a distance measure, u the input vector, d k the unit centre in
the input space and g(.) a nonlinear function. The basis functions are radially
symmetric with the centre on d k in the input space, hence they are named
radial basis functions. Some examples of nonlinear functions used as a radial
basis function g(.) are the following:
(a) the local RBFs
The radial basis function network with Gaussian hidden neurons is named the
Gaussian radial basis function (GRBF) network, also referred to as a network
of localised receptive fields by Moody and Darken, who were inspired by the
biological neurons in the visual cortex (Moody and Darken, 1989). The GRBF
network is related to a variety of different methods (Niranjan and Fallside,
1990), particularly, Parzen window density estimation which is the same as
kernel density estimation with a Gaussian kernel, potential functions method
for pattern classification, and maximum likelihood Gaussian classifiers, which
all can be described by a GRBF network formalism.
Following (1.6) and (1.7), the GRBF network can be described in a more
general form. Instead of using the simple Euclidean distance between an input
and a unit centre as in the usual formalism, a weighted distance scheme is used
as follows:
(1.15)
(1.16)
where d and C represent the centres and the weighting matrices. Using the
same Ck for all the basis functions is equivalent to linearly transforming the
input by the matrix C;;1/2 and then using the Euclidean distance (u-dk)T (u-
d k ). In general, a different Ck is used.
The Gaussian RBF network mapping is given by
n
J(u;p) = L Wkipk(U; d, C) (1.17)
k=l
f(u) (1.18)
n n n
j(u;p) Wo +L WiUi +L L Wili2 U il Ui2 + ... +
i=l il=li2=il
n n n
+L L L Wili2 ... ikUilUi2·· .Uik
il=l i2=il ik=ik-l
N
L WjCPj(u) (1.19)
j=l
where p = {Wj} is the set of the concatenated weights and {cpj} the set of
basis functions formed from the polynomial input terms, N is the number of the
polynomial basis functions, k is the order ofthe polynomial expansion, O(Uk+l)
denotes the approximation error caused by the high order (:2: k+ 1) of the input
vector. The basis functions are essentially polynomials of zero, first and higher
orders ofthe input vector U E nn.
This method can be considered as expanding
the input to a higher dimensional space. An important difference between
polynomial networks and other networks like REF is that the polynomial basis
functions themselves are not parameterised and hence adaptation of the basis
functions during learning is not needed.
Fuzzy neural networks have their origin from fuzzy sets and fuzzy inference
systems, which were developed by Zadeh (1973). A survey of fuzzy sets in ap-
proximate reasoning is given in Dubois and Prade (1991). The fuzzy reasoning
is usually an "if-then" rule (or fuzzy conditional statement), for example,
If pressure is HIGH, then volume is SMALL
where pressure and volume are linguistic variables, and HIGH and SMALL
linguistic values. The linguistic values are characterised by appropriate mem-
bership functions. The "if" part of the rules is referred as the antecedent and
the "then" part is known as the consequent.
Another type of fuzzy if-then rule has fuzzy sets involved only in the an-
tecedent part. For example, the dependency of the air resistance (force) on the
speed of a moving object may be described as
10 1. Neural Networks
To construct a fuzzy reasoning mechanism, the firing strength of the i-th rule
may be defined as the T-norm (usually multiplication or minimum operator)
of the membership values on the antecedent part
(1.20)
or
(1.21 )
where /-LAi (Ui) and /-LEi (Ui) are usually chosen to be bell-shaped functions with
maximum equal to 1 (Jang and Sun, 1993) and minimum equal to 0, such as
(1.22)
~ ei(u)
f(u) = ~ m CPi(Ui) (1.23)
i=I2:CPj(Uj)
j=l
(1.24)
-CXl
u- t
g(u)7/J(-s-)du (1.25)
This transfer can decompose g(u) into its components at different scales in
frequency and space (location) by varying the scaling/dilation factor sand
the translation factor t, respectively.
The function g(u) can be reconstructed by performing the inverse opera-
tion, that is
12 1. Neural Networks
(1.26)
(1.27)
(1.28)
(1.29)
(1.30)
In practice, the orthonormal wavelet functions are widely used. For example,
the following Haar wavelet is one of such wavelets.
if O:S u < ~
if ~:Su<l (1.31 )
otherwise
Also, the orthonormal wavelet functions include the Gaussian derivative wavelet
m
g(u) = go + L Wi'Ij;(Si(U - ti)) (1.34)
i=l
where Si = diag(sil' ... , Sid), d is the dimension of the input, and go is intro-
duced to deal with nonzero mean functions on finite domains. The original
formulation of the wavelet network was based on the tensor product of one-
dimensional wavelets but recently the radial wavelet function was applied.
To obtain the orientation selective nature of dilations and to improve flexi-
bility, a rotation transform can be incorporated by
m
g(U) = go + L w(lj)((u - ti)/Si) (1.35)
i=l
Wavelet theory and networks have been widely employed in applications in
diverse areas, such as geophysics (Kumar and Foufoula-Georgiou, 1993) and
system identification (Sjoberg et al., 1995; Liu et al., 1999, 2000).
There are many other types of neural networks. Forms of neural networks
based on orthogonal polynomial expansions can be used, such as Hermite
polynomials, Legendre polynomials and Bernstein polynomials. Apart from
the polynomial expansion, orthogonal basis functions such as the Fourier se-
ries may also be employed. The surface interpolation method of splines has
been adopted in the development of spline networks (Friedman, 1991). Kernel
functions, which are commonly used in kernel density estimation procedures,
may also be introduced as forms of neural networks.
The mathematical formalism of the networks allows recent developments
in neural networks to deviate from the biological plausibility that served as an
impetus in the first place. This is not a cause for concern because the ultimate
aim of such developments is to build machines rather than to understand and
model biologically intelligent systems. What should be avoided is to refer to
them simply as neural networks. However, to avoid confusion in the terminol-
ogy we will continue to refer to these as neural networks with the emphasis
placed on the fact that they are no more than a special class of nonlinear
model.
The functional description of neural networks has a common form of ex-
pression. Essentially, neural networks are parametric and can be described as
a linear combination of basis functions. So, the neural network is generally
denoted by
m
where w is the parameter vector containing the coefficients Wk and the set
of parameters that define the basis function 'Pk(U), m is the number of basis
14 1. Neural N etwor ks
functions used in the overall mapping of the network. For each parameter
vector w E P, the network mapping f E F w , where P is the parameter set and
Fw the set of functions that can be described by the chosen neural network.
Neural networks learn from the examples presented to them, which are in
the form of input output pairs. To simplify the presentation, a single variable
function is taken into account. Let the input to the network be denoted by u
and the output by y. The neural network maps an input pattern to an output
pattern, described by
f:u--+y (1.37)
An assumption made about these examples is that they are consistent with an
underlying mapping, say j*. Then the relationship between the input and the
output can be stated as
y = j*(u) +v (1.38)
(1.39)
which contains the information that is available about the unknown mapping
j*.
Let the set Fw = {j(u;w) : for all w E P} describe all functions that
can be mapped by the neural network. The task of learning is to approximate
j*(u) by choosing a suitable f(u; w). This requires a measure of approximation
accuracy to be defined, whose simple example is the approximation error.
The basic approximation problem treated in this book can be stated as follows:
For a given f(u), find the function amongst the set Fw = {j(u;w) :
for all w E P} that has the least distance to f (u). This is equivalent to finding
the f(u; w) that has the least approximation error, i.e.,
It is not sufficient that the function f (u; w) to be found most closely ap-
proximates f (u) alone. To guarantee the approximation to be sufficiently good,
the least approximation error must be below a threshold. If the set F w , which
contains all the functions that can be mapped by the network, is sufficiently
large, then there is a reasonable chance of satisfying the above requirement.
1.5 Learning and Approximation 15
Generally, w appears nonlinearly in f (u; w). It is clear that the above problem
is a nonlinear optimisation problem, which can be solved by any of the standard
procedures or algorithms such as those in Luenberger (1984).
The function e(D; fw) can be viewed as an error surface defined over the
space of the parameter w, called the parameter space. This surface will either
have one or several minima which depend on how the parameter w appears
through f(u; w). If f(u; w) is linear in w, e(D; fw) is convex and has only one
minimum that is the global minimum of the error surface. On the other hand,
if f (u; w) is nonlinear in w the error surface may have several local minima
due to the non convexity of e(D; fw). One must bear in mind the effects caused
by the presence of local minima in choosing an optimisation procedure or
algorithm.
Neural networks with at least a single hidden layer have been shown to
have the capacity to approximate any arbitrary function in C(Rm) (continuous
function space) if there are a sufficiently number of basis functions (or hidden
nodes) (Cybenko, 1989). This property of neural networks is referred to as the
universal approximation property.
This approximation ability of neural networks can also be understood from
the geometric view in the function space. If the neural network consists of N
hidden neurons, then the function to be mapped is represented by a linear
combination of the N basis functions ¢k (u). For the case where these N basis
functions are linearly independent, the set of functions the network can map,
span a subspace of N-dimensions in the infinite dimensional Hilbert space H.
By increasing the number of linearly independent basis functions to infinity
the subspace spanned by the neural network mapping is extended to the entire
Hilbert space H. For the Gaussian RBF network, the linear independence of
the basis functions with different centres holds (Poggio and Girosi, 1990a,b),
which can also be extended for other types of neural networks to show that
these basis functions are linearly independent.
If having learned to map the examples in the data set D, a neural network
predicts the input output observations consistent with the underlying function
f(u), which are not in D. Then, this neural network is said to generalise well.
The generalisation ability of a network depends critically on its functional form
f(u; w) and the data set D.
In order that a network has the capacity to generalise, its functional form
f(u; w) must be able to provide a sufficiently good approximation to the un-
known underlying function f(u). This implies that the capacity ofthe network
and hence the number of parameters should be large. The universal approxi-
mation property of neural networks seems to suggest that the functional rep-
resentation is not important as long as a sufficiently large network is chosen.
After the functional form is chosen, the network parameters must be esti-
mated from the data set D. If the number of examples contained in this data
set is less than the number of parameters, infinitely many solutions for the pa-
rameters that will fit the data exist. The network will generalise poorly if the
learning algorithm cannot give consistent estimates and cannot find an esti-
mate not necessarily closest to the unknown f(u). The generalisation problem
of neural networks can also be understood from a statistical point of view. If
there are an infinite number of functions that can fit the data set D exactly,
the probability that the estimate found will be closer to f(u) will be very low.
With an increasing number of examples this probability is increased and in
turn the generalisation of the network is improved. Thus, the network size
that gives good generalisation depends on the number of examples that are
used to estimate the parameters. It has been shown that an upper bound on
the number of parameters of the network can be derived on the basis of the
size of the data set D (Baum and Haussler, 1989).
It has been observed that choosing large size networks is bound to exhibit
poor generalisation (Chauvin, 1989), which is referred to as overfitting. Impos-
ing smoothness constraints is a powerful way of reducing the dimensionality of
the functional representation problem. Good generalisation can be achieved by
choosing large networks with added penalty terms to provide smoother basis
functions (Hinton, 1987; Hanson and Pratt, 1989).
(1.44)
(1.45)
n
-2 L ek V' wf(Uk; W(j-l)) (1.46)
k=l
where
Yk-f(Uk;W) (1.47)
of (Uk; w)
ow IW=W(j-l) (1.48)
The parameter is adapted in the direction of decreasing J e (D; f w), where the
direction is averaged over all the samples. The iteration is repeated until the
squared error falls below a required threshold.
This algorithm could be efficiently implemented in feedforward networks
by back propagating the errors (Chan and Fallside, 1987). Further, it could be
implemented within the highly parallel architecture of neural networks.
The error back propagation learning algorithm has a characteristic feature
of slow rate of convergence. Such behaviour is caused by the shape of the error
1.5 Learning and Approximation 19
surface in the parameter space in which sharp valleys and long plateaux exist.
A scheme for adapting the step size or learning rate is proposed, based on the
angle of the previous gradient direction and the current gradient direction in
the parameter space (Chan and Fallside, 1987).
When the learning problem is viewed as one minimising a cost function,
the slow rate of the gradient descent procedure in the error back propagation
method becomes fairly obvious. The nonlinear optimisation procedure consid-
ers only the gradient of the current iteration. Methods that are faster but need
more computation have been developed, for example, the method of line search
along the gradient direction, the conjugate gradient descent method which
utilises information about previous descent directions, and the quasi-Newton
descent direction method which utilises the Hessian of the cost function along
with the gradient (Luenberger, 1984). These methods have also been applied
to neural network learning.
(1.49)
which is received sequentially, so that at time n the observation {(Uk, Yk); f( Uk)
= yd is received. The neural network or nonlinear model mapping is given by
f(u; w). Let the set of parameter values be w(n-l) before the n-th observation
is received, which is known as the a priori estimate. On learning v(n), let
the parameter values be modified to w(n), known as the a posteriori estimate.
The operation of the recursive learning algorithm is to provide a functional
relationship between the posterior estimate w(n), the prior estimate w(n-l),
and the n-th observation. In general, it can be described mathematically by
(1.50)
(1.51)
20 1. Neural Networks
In sequential learning, the prediction error can be calculated for each ob-
servation as it arrives and hence is a dynamic performance index that can be
used in evaluating different models and algorithms.
A commonly used algorithm for neural networks is the least mean square
(LMS) algorithm (Widrow and Hoff, 1960). It is a special case ofthe stochastic
approximation algorithm (Robbins and Munro, 1951). For the n-th observation
v(n), the parameter vector is adapted by
(1.52)
where en is the prediction error and T} the learning rate or the adaptation step
size. The above LMS learning is a recursive version of the stochastic gradient
descent procedure, with the gradient being estimated on the basis of the cur-
rent sample rather than the ensemble of examples as in the block estimation
procedure. It is shown that such a procedure minimises the least squares cost
function defined in equation (1.43) (block estimation cost function) and fur-
ther that the LMS algorithm converge slowly to the underlying set of optimal
parameters.
1.6.1 Classification
With the growth of information technology and the availability of cheap com-
puter systems, the rapid expansion of medical knowledge makes the develop-
ment of Computer-Aided Diagnostic (CAD) systems increasingly attractive.
Such systems assist clinicians to improve clinical decision-making. The con-
tribution of neural networks to such systems is no exception. For example,
RBF networks have been applied to classify various categories of low back
disorders (Bounds et at., 1990), which takes in many elements of information
and classifies the different cases of low back disorders. Besides using RBF net-
works, classification studies have been made with MLP networks, fuzzy logic,
k-nearest neighbours, closest class mean and also have been compared with
clinicians' diagnoses.
Classification and feature extraction of speech signals is the single most
applied and reported application of neural networks (see, for example, Re-
nals, 1989; Bengio, 1992). Primarily, neural networks are used to classify spo-
ken vowels based on speech spectrograms. It is worth noting that consistent
1.6 Applications of Neural Networks 21
1.6.2 Filtering
Neural networks have drawn considerable attention from the signal processing
community (Casdagli, 1989; Chen et al., 1990; LeCun et al., 1990). Remarkable
claims have been made concerning the superior performance of neural networks
over traditional methods in signal processing. One of the major areas of signal
processing application is filtering.
The filtering property of neural networks employing Gaussian radial basis
functions has been discussed and reported by researchers (Tattersall et at.,
1991) and applied in filtering chaotic data (Holzfuss and Kadtke, 1993). Gaus-
sian RBFs are a particularly good choice for this purpose because of the local
property of this network, which enables wild oscillations to be damped out.
Actually, the RBF method for multivariate approximation schemes is devel-
oped by imposing a smoothness constraint on the approximation function. This
smoothness constraint can be synthesised in the frequency domain by the use of
the generalised Fourier transform. Analysis and application of the generalised
inverse Fourier transform lead to a smooth approximating scheme. Moreover, it
has been shown that the neural network approach is a very promising method
for smoothing scattered data (Barnhill, 1983).
Neural networks as filters have been used in digital communications such
as channel equalisation and overcoming cochannel interference. Significant ro-
bustness and good filtering properties of neural networks for systems with high
signal to noise ratios have been reported (Holzfuss and Kadtke, 1993).
Since neural networks are used for nonlinear prediction of chaotic time series
(Casdagli, 1989), there has been a growing interest in using neural networks
for various prediction tasks (Leung and Haykin, 1991). Many prediction tasks
include various nonlinear time series, such as annual sunspots, Canadian lynx
data, ice ages, measles; chaotic data include Ikeda map, Lorenz equations,
Mackey-Glass delay differential equation, Henon map, logistic map, Duffing
oscillators, radar backscatter, fluid turbulence flow, electrochemical systems
(electrodissolution of copper in phosphoric acid) and many others. Neural
networks have become popular for prediction of a variety of different time
series, for example, chaotic time series (Platt, 1991), speech waveforms (Fall-
side, 1989) and economic data (Weigend et al., 1991). The interest in most of
22 1. Neural Networks
1.6.4 Control
Neural networks have also received widespread attention and have been ap-
plied to the control of dynamical systems. They are employed to adaptively
compensate for plant nonlinearities (Sanner and Slotine, 1992; Feng, 1994; Liu,
2001). Under mild assumptions about the degree of smoothness exhibited by
the nonlinear functions, it has been shown that the nonlinear optimal neural
control is globally stable with tracking errors converging to a neighbourhood
of zero. A variant of neural networks (with Gaussian RBFs) is used to opti-
mise and control a repetitively pulsed, small-angle negative ion source which is
designed to produce a high-current, low-emittance beam of negative hydrogen
ions for injection into various accelerators used in nuclear physics (Mead et al.,
1992). Neural networks have shown amongst other things, the versatibility of
nonlinear adaptive basis functions, simple and rapid training algorithms, and
a variety of optional capabilities that could be incorporated such as Kalman
noise filtering. Neural networks have been used to design more powerful feed-
back feedforward controllers for robotic applications (Parisini and Zoppoli,
1993). Apart from showing the desirable properties the neural network could
achieve, it is highlighted how much computational load is involved, particu-
larly how the computation increases rapidly with respect to the dimension of
the problem.
There are many other interesting applications of neural networks used in
the control of dynamical and industrial systems. Space however, precludes
mention of these but details can be found as follows: biomedical control (Nie
and Linkens, 1993), chemical and industrial processes (Roscheisen et al., 1992;
Liu and Daley, 1999a,c,200l), servomechanism (Lee and Tan, 1993).
1.7 Mathematical Preliminaries 23
(1.53)
In a normed linear space, the distance between the functions f (x) and 1* (x)
is given the shorthand description
and is the norm of the difference between the two functions, which is a suitable
distance function. Since the difference f - 1* is the error function, this measure
is the approximation error.
The commonly used norms are the 1-,2-, (Xl-norms. The L 1 -norm has
the property that the magnitude of error in the case of discrete data makes
no difference to the final approximation (Powell, 1981). The Loa-norm, also
known as the Chebyshev norm, is much used in approximation theory. The
norm can also be expressed as
which gives the maximum value of f(x). The (Xl-norm of the difference would
then give the maximum difference between the two functions for any point x,
which is also the maximum error of approximation.
The L 2 -norm or the Euclidean norm occurs naturally in theoretical studies
of Hilbert spaces (Powell, 1981). The practical reasons for considering the L 2 -
norm are even stronger. From a statistical point of view, if the errors in the
data have a normal distribution, the most appropriate choice of data fitting is
the L 2 -norm. Further, highly efficient algorithms can be developed to find the
best approximation. The L 2 - norm is given by
II f 112:= (
1
xERn
If(xWdx
)
1/2
(1.56)
The L 2 -norm defines the L2-space of functions, the square integrable real func-
tions. Since an inner product can be defined in this space, it is also the Hilbert
space of square integrable real functions, denoted by H (Linz, 1979). All con-
tinuous functions in C (R n), and therefore f and 1* , are a subset of this Hilbert
space.
Typically, for a function to be admitted into H, its L 2 -norm must be finite.
There exist continuous functions with infinite L 2 -norm in C(Rn). However,
for the input space D E Rn the norms of these functions can be made finite.
Since the input space is always finite, all continuous functions can be admitted
into H (see also, Linz, 1979).
Nand R denote the set of integers and real numbers, respectively. L2 (R)
denotes the vector space of measurable, square-integrable one-dimensional
functions f(x). For f,g E L 2 (R), the inner product and norm for the space
i:
L 2 (R) are written as
where g(.) is the conjugate of the function g(.). L 2 (R n ) is the vector space
of measurable, square-integrable n-dimensional functions f(X1' X2, ... , x n ). For
1.8 Summary 25
j,g E L2(Rn), the inner product of j(Xl,X2, ... ,Xn ) with g(Xl,X2, ... ,Xn ) is
written as
1.8 Summary
2.1 Introduction
The identification of nonlinear systems using neural networks has become a
widely studied research area in recent years. System identification mainly con-
sists of two steps: the first is to choose an appropriate identification model and
the second is to adjust the parameters of the model according to some adap-
tive laws so that the response of the model to an input signal can approximate
the response of the real system to the same input. Since neural networks have
good approximation capabilities and inherent adaptivity features, they pro-
vide a powerful tool for identification of systems with unknown nonlinearities
(Antsaklis, 1990; Miller et al. 1990).
The application of neural network architectures to nonlinear system iden-
tification has been demonstrated by several studies in discrete time (see, for
example, Chen et al., 1990; Narendra and Parthasarathy, 1990; Billings and
Chen, 1992; Qin et al., 1992; Willis et al., 1992; Kuschewski et al., 1993; Liu
and Kadirkamanathan, 1995) and in continuous time (Polycarpou and Ioan-
nou, 1991; Sanner and Slotine, 1992; Sadegh 1993). For the most part, much
of the studies in discrete-time systems are based on first replacing unknown
functions in the difference equation by static neural networks and then de-
riving update laws using optimisation methods (e.g., gradient descent/ascent
methods) for a cost function (quadratic in general), which has led to var-
ious back-propagation-type algorithms (Williams and Zipser, 1989; Werbos,
1990; Narendra and Parthasarathy, 1991). Though such schemes perform well
in many cases, in general, some problems arise, such as the stability of the
overall identification scheme and convergence of the output error. Alternative
approaches based on the model reference adaptive control scheme (N arendra
and Annaswamy, 1989; Slotine and Li, 1991) have been developed (Polycar-
pou and Ioannou, 1991; Sanner and Slotine, 1992; Sadegh, 1993), where the
stability of the overall scheme is taken into consideration.
Most of the neural network based identification schemes view the problem
as deriving model parameter adaptive laws, having chosen a structure for the
neural network. However, choosing structure details such as the number of
basis functions (hidden units in a single hidden layer) in the model must be
done a priori. This can often lead to an over-determined or under-determined
network structure which in turn leads to an identification model that is not
optimal. In discrete-time formulation, some approaches have been developed
28 2. Sequential Nonlinear Identification
in determining the number of hidden units (or basis functions) using decision
theory (Baum and Haussler, 1989) and model comparison methods such as
minimum description length (Smyth, 1991) and Bayesian methods (MacKay,
1992). The problem with these methods is that they require all observations
to be available together and hence are not suitable for on-line or sequential
identification tasks.
Yet another line of approach, developed for discrete-time systems, is to be-
gin with a larger network prune, as in Mozer and Smolensky (1989) or begin
with a smaller network growth as in Fahlman and Lebiere (1990) and Platt
(1991) until the optimal network complexity is found. Amongst these dynamic
structure models, the resource allocating network (RAN) developed by Platt
(1991) is an on-line or sequential identification algorithm. The RAN is essen-
tially a growing Gaussian radial basis function (GRBF) network whose growth
criteria and parameter adaptation laws have been studied (Kadirkamanathan,
1991) and applied to time-series analysis (Kadirkamanathan and Niranjan,
1993) and pattern classification (Kadirkamanathan and Niranjan, 1992). The
RAN and its extensions addressed the identification of only autoregressive sys-
tems with no external inputs and hence stability was not an issue. Recently,
the growing GRBF neural network has been applied to sequential identifi-
cation and adaptive control of dynamical continuous nonlinear systems with
external inputs (Liu et al., 1995; Fabri and Kadirkamanathan, 1996). Though
the growing neural network is much better than the fixed neural network in
reducing the number of basis functions, it is still possible that this network
will induce an overfitting problem. There are two main reasons for this: first,
it is difficult to known how many basis functions are really needed for the
problem and second the nonlinearity of a nonlinear function to be modelled is
different when its variables change their value ranges. Normally, the number of
basis functions in the growing neural network may increase to the one that the
system needs to meet the requirement for dealing with the most complicated
nonlinearity (the worst case) of the nonlinear function. Thus, it may lead to a
network which has the same size as fixed neural networks.
To overcome the above limitations, a new network structure, referred to
as the variable neural network, was proposed by Liu et al. (1996b). The basic
principle of the variable neural network is that the number of basis functions
in the network can be either increased or decreased over time according to a
design strategy in an attempt to avoid overfitting or underfitting. In order to
model unknown nonlinearities, the variable neural network starts with a small
number of initial hidden units and then adds or removes units located in a
variable grid. This grid consists of a number of subgrids composed of different
sized hypercubes which depend on the novelty of the observation.
This chapter introduces variable neural networks and considers a sequential
identification scheme for continuous nonlinear dynamical systems using neu-
ral networks. The nonlinearities of the dynamical systems are assumed to be
unknown. The identification model is a Gaussian radial basis function neural
network that grows gradually to span the appropriate state-space and of suf-
2.2 Variable Neural Networks 29
Two main neural network structures which are widely used in on-line iden-
tification and control are the fixed neural network and the growing neural
network. The fixed neural network usually needs a large number of basis func-
tions in most cases even for a simple problem. Though the growing network
is much better than the fixed network in reducing the number of basis func-
tions for many modelling problems, it is still possible that this network will
lead to an overfitting problem for some cases and this is explained in Section
2.1. To overcome the above limitations of fixed and growing neural networks,
a new network structure, called the variable neural network, is considered in
this section.
Due to some desirable features such as local adjustment of the weights and
mathematical tractability, radial basis functions were introduced to the neural
network literature by Broomhead and Lowe (1988) and have gained significance
in the field. Their importance has also greatly benefited from the work of
Moody and Darken (1989) and, Poggio and Girosi (1990a,b) who explore the
relationship between regularisation theory and radial basis function networks.
One of the commonly used radial basis function networks is the Gaussian radial
basis function (GRBF) neural network, also called the localised receptive field
network, which is described by
n
j(x;p) = L Wk'Pk(X; Ck, dk ) (2.1)
k=l
where Wk is the weight, p = {w-k, Ck, dk } is the parameter set and 'Pdx; Ck, d k )
is the Gaussian radial basis function
(2.2)
d k is the centre and Ck is the weighting matrix of the basis function. The good
approximation properties of the Gaussian radial basis functions in interpola-
tion have been well studied by Powell and his group (Powell, 1987). Thus, the
discussion on variable neural networks is based on the GRBF networks.
30 2. Sequential Nonlinear Identification
In GRBF networks, one very important parameter is the location of the centres
of the Gaussian radial basis functions over the compact set X, which is the
approximation region. Usually, an n-dimension grid is used to locate all centres
in the gridnodes (Sanner and Slotine, 1992). Thus, the distance between the
gridnodes affects the size of the networks and also the approximation accu-
racy. In other words, a large distance leads to a small network and a coarser
approximation, while a small distance results in a large size of network and
a finer approximation. However, even if the required accuracy is given, it is
very difficult to know how small the distance should be since the underlying
function is unknown. Also, the nonlinearity of the system is not uniformly
complex over the set X. So, here a variable grid is introduced for location of
the centres of all GRBFs in the network.
The variable grid consists of a number of different subgrids. Each subgrid is
composed of equally sized n-dimensional hypercuboids. This implies that the
number of subgrids can increase or decrease with time in the grid according to
a design strategy. All the subgrids are named, the initial grid is named the 1st
order subgrid, then the 2nd order subgrid and so on. In each subgrid, there
are a different number of nodes, which are denoted by their positions. Let Mi
denote the set of nodes in the i-th order subgrid. Thus, the set of all nodes in
the grid with m subgrids is
m
M=UMi (2.3)
i=l
To increase the density of the gridnodes, the edge lengths of the hypercubes
of the i-th order subgrid will always be less than those of the (i - l)-th order
subgrid. Hence the higher order subgrids have more nodes than the lower order
ones. On the other hand, to reduce the density of the gridnodes, always remove
some subgrids from the grid until a required density is reached.
Let all elements of the set M represent the possible centres of the network.
So, the more the subgrids, the more the possible centres. Since the higher
order subgrids probably have some nodes which are the same as the lower
order subgrids, the set of the new possible centres provided by the i-th order
subgrid is defined as
where Po is an empty set. This shows that the possible centre set Pi corre-
sponding to the i-th subgrid does not include those which are given by the
lower order subgrids, i.e.
(2.5)
2.2 Variable Neural Networks 31
For example, in the two-dimensional case, let the edge length of rectangles on
the i-th subgrid be half of the (i - l)-th subgrid. The variable grid with three
subgrids is shown in Figure 2.1.
DEBm
lSI subgrid 2nd subgrid 3rd subgrid
-+-
The variable neural network has the property that the number of basis func-
tions in the network can be either increased or decreased over time according
to a design strategy. For the problem of nonlinear modelling with neural net-
works, the variable network is initialised with a small number of basis function
units. As observations are received, the network grows by adding new basis
functions or is pruned by removing old ones. The adding and removing oper-
ations of a variable neural network are illustrated by Figure 2.2.
To add new basis functions to the network the following two conditions
must be satisfied: (a) The modelling error must be greater than the required
accuracy. (b) The period between the two adding operations must be greater
than the minimum response time of the adding operation.
To remove some old basis functions from the network, the following two
conditions must be satisfied: (a) The modelling error must be less than the
required accuracy. (b) The period between the two removing operations must
be greater than the minimum response time of the removing operation.
It is known that if the grid consists of the same size n-dimension hypercubes
with edge length vector P = [PI, P2, ... , Pnl, then the accuracy of approximating
a function is in direct proportion to the norm of the edge length vector of the
32 2. Sequential Nonlinear Identification
REMOV/
~ADD
CK ex Ilpll (2.6)
Therefore, based on the variable grid, the structure of a variable neural net-
work may be stated as the following. The network selects the centres from
the node set M of the variable grid. When the network needs some new basis
functions, a new higher order subgrid (say, (m + l)-th subgrid) is appended
to the grid. The network chooses the new centres from the possible centre set
Pm+! provided by the newly created subgrid. Similarly if the network needs to
be reduced, the highest order subgrid (say, m-th subgrid) is deleted from the
grid. Meanwhile, the network removes the centres associated with the deleted
subgrid. Since the novelty of the observation is tested, it is ideally suited to
on-line control problems. The objective behind the development is to gradually
approach the appropriate network complexity that is sufficient to provide an
approximation to the system nonlinearities and consistent with the observa-
tions being received. By allocating GRBF units on a variable grid, only the
relevant state-space traversed by the dynamical system is spanned, resulting
in considerable savings on the size of the network. How to locate the centres
and determine the widths of the GRBFs is discussed in the next section.
2.2 Variable Neural Networks 33
It is known that the Gaussian radial basis function has a localisation property
such that the influence area of the kth basis function is governed by the centre
Ck and width d k . In other words, once the centre Ck and the width d k are fixed,
the influence area of the Gaussian radial basis function cp(x; Ck, dk ) is limited
in state-space to the neighbourhood of Ck.
On the basis of the possible centre set M produced by the variable grid,
there are large number of basis function candidates, denoted by the set B.
During system operation, the state vector x will gradually scan a subset of the
state-space set X. Since the basis functions in the GRBF network have a lo-
calised receptive field, if the neighbourhood of a basis function cp(x; Ck, dk ) E B
is located 'far away' from the current state x, its influence on the approxima-
tion is very small and could be ignored by the network. On the other hand, if
the neighbourhood of a basis function cp(x; Ck, dk ) E B is near to or covers the
current state x, it will playa very important role in the approximation. Thus
it should be kept if it is already in the network or added into the network if it
is not in.
Given any point x, the nearest node xi = [x~, xi;, ... , xt,]T to it in the
i-th subgrid can be calculated by,
(2.7)
for j = 1,2, ... , n, where round(-) is an operator for rounding the number (-) to
the nearest integer, for example, round(2.51) = 3, and Oij is the edge length
of the hypercube corresponding to the j-th element of the vector x in the i-th
subgrid. Without loss of generality, let Oi = Oil = Oi2 = ... = Oin.
Define m hyperspheres corresponding to the m subgrids, respectively,
(2.8)
for i = 1,2, ... , m, where O"i is the radius of the i-th hypersphere. In order to
get a suitable sized variable network, choose the centres of the basis functions
from the nodes contained in the different hyperspheres Hi(xi, O"i), which are
centred in the nearest nodes xi to x in the different subgrids with radius
O"i, for i = 1,2, ... , m. For the sake of simplicity, it is assumed that the basis
function candidates whose centres are in the set Pi have the same width di
and di < di - 1 . Thus, for the higher order subgrids, use the smaller radius, i.e.
Usually, choose
(2.10)
34 2. Sequential Nonlinear Identification
where /1 is a constant and less than 1. Thus, the chosen centres from the set
Pi are given by the set:
(2.11)
In order that the basis function candidates in the set Pi which are less than
an activation threshold to the nearest grid node xi in the i-th subgrid are
outside the set Hi(xi, IJ"i), it can be deduced from (2.2) and (2.8) that the IJ"i
must be chosen to be
(2.12)
For example, in the 2-dimension case, the radii are chosen to be the same as
the edge lengths of the squares in the subgrids, that is,
The chosen centres in the variable grid with four subgrids are shown in Figure
2.3.
Now, consider how to choose the width dk of the kth basis function. The
angle between the two GRBFs cp(x; Ci, di ) and cp(x; Cj, dj ) is defined as
where < ., . > is the inner product in the space of square-integrable functions,
I: . . I:
which is defined as
(2.16)
B· = cos- 1
0
tJ ~
2~)
((- -
+1
"} In(c·
r J'
C
"
do) 2<+2
t
< ) (2.17)
where ~ = dUd]. The above shows that COS(Bij) depends on three factors: the
dimension n, the width ratio ~ and the output of a basis function at the centre
of the other basis function, cp( Cj; Ci, d i ).
2.2 Variable Neural Networks 35
=-...,-.._."". . :.; :. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
Fig. 2.3. Location of centres in the variable grid with four subgrids(the number i,
for i = 1,2,3,4, denotes the centres chosen from the i-th subgrid)
If the centres of the two basis functions are chosen from the same subgrid, i. e.
~ = 1, it is clear from (2.17) that
(2.18)
On the other hand, if the centres of the two basis functions are from dif-
ferent subgrids, it is possible that their centres are very close. The worst case
will be when cp(Cj; Ci, di ) is near to 1. In this case, the angle between the two
basis functions can be written as
cos(B·)
~J
<
-
-
~ +1
( 2VE,)
- "l- (2.19)
Given the centre Ck, in order to assign a new basis function cp(x; Ck, d k ) that is
nearly orthogonal to all existing basis functions, the angle between the GRBFs
should be as large as possible. The width d k should therefore be reduced.
However, reducing d k increases the curvature of cp(x; Ck, d k ) which in turn gives
a less smooth function and can lead to overfitting problems. Thus, to make
a trade-off between the orthogonality and the smoothness, it can be deduced
from (2.18) and (2.19) that the width d k , which ensures the angles between
GRBF units are not less than the required minimum angle Bmin , should satisfy
36 2. Sequential Nonlinear Identification
(2.20)
or
(2.21 )
and
(2.22)
For example, assume that ~o satisfies (2.20). If the width of the basis functions
whose centres are located in the set Ci , which corresponds to the i-th subgrid
with 6i = ~06i-l' is chosen to be d i = ~Odi-l and the width d 1 of the basis
functions associated to the initial grid satisfies
(2.23)
then the smallest angle between all basis functions is not less than the required
minimum angle Bmin.
For the sake of simplicity, we first discuss the modelling of single-input single-
state (SISS) continuous nonlinear dynamical systems. The multi-input multi-
state (MIMS) case will be detailed in Section 2.6. Consider the class of
continuous-time dynamical systems with an input-state representation given
by
x = f(x, u), x(O) = xo (2.24)
where f(x, u) is an unknown nonlinear function that must be estimated, u E
Rl is the input, and x E Rl is the state. Assume that the nonlinear system is
stable.
By subtracting and adding ax, where a is some positive constant, the sys-
tem (2.24) becomes
where
g(x, u) = f(x, u) + ax (2.26)
is still a nonlinear function. Since neural networks provide an input output
mapping, we construct a model based on equation (2.25) by replacing the
nonlinear part g(x, u) by a neural network. Consider the model (Landau, 1979)
where g is the output of the neural network, x denotes the state of the identi-
fication model, while P denotes the adjustable parameters of the network.
The nonlinear function g(x, u) is approximated by the GRBF network,
which is expressed by
K
(2.29)
(2.30)
where ax, bx , au, bu are positive constants, which can be chosen by the designer
(e.g., ax,bx,au,b u are 1). It is clear from (2.29) and (2.30) that x E [-bx,bxl
and U E [-bu,bul for x,u E (-00,+00). On the other hand, if x and u are
already bounded, we need only to set x = x and U = u. Thus
bxx
X~{ Ixl +ax
x
if x t/. [-b x , bxl
if x E [-b x , bxl
(2.31)
buu
u~ { lui +au
u
if u t/. [-bu , bul
if u E [-bu , bul
(2.32)
The above one-to-one mapping is illustrated in Figure 2.4, which shows that
in two-dimensional space the entire area can be transferred into a rectangular
one.
Replacing x and u by x and U in equation (2.28), the nonlinear function g
of the system model described by the GRBF network can be written as
K
where
38 2. Sequential Nonlinear Identification
(2.34)
The problem then becomes that of estimating the function g(x, u,p) based on
the variables x and u, which are in the bounded sets. A schematic diagram of
the identification framework is shown in Figure 2.5.
+
u
Since x E [-b x , bxl and u E [-b u , bul are bounded, the constant CK is finite.
From Equation 2.27 the identification model can also be described by
K
(2.39)
(2.40)
Hence, subtracting (2.35) from (2.38) gives the following dynamical expression
of the state error:
K
V (ex, z)
121
="2 ( 2
ex + ;; (; ~k K) (2.42)
where z = [6, ... '~Kf and a is a positive constant which will appear in the
sequential adaptation laws, also referred to as the learning or adaptation step
size. Using (2.42), the time derivative of the Lyapunov function V is given by
K 1 K .
-ae; + Lex~kipk(x,il;mk,Tk) + -a L~k~k + exc(t)
k=1 k=1
K
-ae; + ..!.a L(aex~kipk(X, il; mk, Tk) + ~ktk) + exc(t) (2.43)
k=1
(2.44)
V(e x , z) + exc(t)
-ae;
< + lexlEK
-ae;
-alexl(lexl - EK fa) (2.45)
for k = 1, ... , K.
It can be seen from the modified weight adjustment laws above that if
lex I ~ eo ~ EK la, the first derivative of the Lyapunov function with respect to
time t is always negative semidefinite. Although in the case where eo :::; lex I :::;
EK I a, the weights may increase with time because it is possible that V > 0,
it is clear from the estimation law (2.46) that the weights are still limited
by the bound VKM. If lexl > e max (the maximum tolerable accuracy) and
2.5 Sequential Nonlinear Identification 41
Ilwll = VKM, this means that more GRBF units are needed to approximate
the nonlinear function g. Therefore, the overall identification scheme is still
stable in the presence of modelling error. The Lyapunov function V depends
also on the parameter error and the negative semi-definiteness then implies
convergence of the algorithm.
The control of real-time systems with unknown structure and parameter in-
formation can be based on carrying out on-line or sequential identification
using nonparametric techniques such as neural networks. The sequential iden-
tification problem for continuous dynamic systems may be stated as follows:
given the required modelling error, the prior identification model structure
and the on-line or sequential continuous observation, how are these combined
to obtain the model parameter adaptive laws or the required neural network
approximation?
Here, a sequential identification scheme is considered for continuous-time
nonlinear dynamical systems with unknown nonlinearities using growing Gaus-
sian radial basis function networks. The growing GRBF network, which is
actually a type of variable neural network, starts with no hidden units and
grows by allocating units on a regular grid, based on the novelty of obser-
vation. Since the novelty of the observation is tested, it is ideally suited to
on-line identification problems. The parameters of the growing neural network
based identification model are adjusted by adaptation laws developed using
the Lyapunov synthesis approach.
The identification problem for the dynamical system of Equation 2.24 can
be viewed as the estimation of the nonlinear function g(x,u;p) as shown in
Section 2.4. If the modelling error is greater than the required one, according
to approximation theory more basis functions should be added to the network
model to get a better approximation. In this case, denote the prior identi-
fication structure of the function at time t as g(t) (x, u; p) and the structure
immediately after the addition of a basis function as g( t+) (x, u; p). Based on
the structure of the function g(x, u;p) in Equation 2.28, the identification
structure now becomes
(2.4 7)
where WK+1 is the weight of the new (K + l)th Gaussian radial basis function
cP K + 1· The sequential identification scheme using a neural network for the
nonlinear function g(x, u; p) is shown in Figure 2.6.
It is also known that the kth Gaussian radial basis function has a localisa-
tion property that the influence area of this function is governed by the centre
mk and width rk. In other words, once the centre mk and the width rk are
fixed, the influence area of the kth Gaussian radial basis function CPk is limited
in state-space to the neighbourhood of mk.
42 2. Sequential Nonlinear Identification
<fJK+l
u X
Let us first consider how to limit the number of the centres and hence the
size of the network. As shown in Figure 2.4, the observation pairs (x, u) are
in a rectangular set. An hx x hu grid, where hx and hu are odd integers, can
be produced by scaling the x and u axes by 2b x /(h x - 1) and 2b u /(h u - 1),
respectively, as shown in Figure 2.7. If the centres of the basis functions of
the network model are located on some of the crosspoints of the grid it is
clear that those centres will be equally distributed. For any point (x, u) in the
rectangular set, the nearest crosspoint (xm, um) can be calculated by
where round ( .) is an operator for rounding the number (.) to the nearest integer
and
clx=~ (2.50)
hx -1
cl -~ (2.51)
u - hu - 1
The main influence area D ofthe radial basis function with the centre (xm, um )
is also shown in Figure 2.7.
2.5 Sequential Nonlinear Identification 43
Now, consider how the width rk of the kth basis function are chosen. The angle
between the two GRBFs 'Pi and 'Pj with the same width ri = rj = ro is given
by (Kadirkamanathan, 1991; Kadirkamanathan and Niranjan, 1993)
(2.52)
To add a new basis function 'Pk that is nearly orthogonal to all existing ba-
sis functions, the angle between the GRBFs should be as large as possible.
This means that the width rk should be reduced. But, the curvature of the
basis function 'Pk will be increased by reducing rk and in turn leads to a less
smooth function. Thus, to make a compromise between the orthogonality and
the smoothness, a good choice for the width rk, which ensures the angles be-
tween GRBF units are approximately equal to some required angle Bmin , is
(Kadirkamanathan, 1991)
(2.53)
where 1
with Bmin being the required minimum angle between Gaussian radial basis
functions, and
mt = arg. min {llmk - mill}
z=l, ... ,K, mi#-mk
(2.55)
44 2. Sequential Nonlinear Identification
is the nearest (in the Euclidean space) centre to the kth centre. The above
assignments are the same as those for the resource allocating network (RAN)
(Platt, 1991) for which the equations are arrived at from the consideration of
observation novelty heuristics.
The growing network, which is the special one of variable networks without
removing operation, is initialised with no basis function units. As observations
are received the network grows by adding new units. The decision to add a new
unit depends on the observation novelty for which the following two conditions
must be satisfied:
6
(i) min
k=l •... ,K
Ix - mkll > -2x (2.56)
or
.
mm Iu- mk2 I >-
6u (2.57)
k=l, ... ,K 2
where 6x and 6u represent the scale of resolution in the input-state grid, and
e max is chosen to represent the desired maximum tolerable accuracy of the
state estimation. Criterion (i) says that the current observation must be far
from existing centres. Criterion (ii) means that the state error in the network
must be significant.
When a new unit is added to the network at time t 1 , the parameters asso-
ciated with the GRBF units are adapted as follows:
x(td) 6x ,
[round ( T (2.59)
(2.60)
(2.61 )
for k = 1, ... , K + 1 and WK+l (tt) = O. If no new GRBF unit is added, only
the weights are adapted by the law (2.62), for k = 1, ... , K.
It is known from approximation theory (Powell, 1981) that the approxima-
tion accuracy of a function by a set of basis functions, such as in neural net-
works, is proportional to the parameters 6x and 6u of the grid. In other words,
the smaller the parameters 6x and 6u , the more accurate the neural model. If
the tolerable accuracy of the state error is not reached, i.e., lexl > e max , then
the thresholds 6x and 6u on the criterion (i) should gradually be reduced by
halving their values (i.e., 6x /2 and 6u /2) at each time step until the minimum
2.6 Sequential Identification of Multivariable Systems 45
allowed values are reached. In this way, the state error will be reduced and the
existing centres of the basis functions of the network model are all still on the
crosspoints of the new grid as shown in Figure 2.8.
u u
x x
With the increase of the number of the GRBFs and the cross-points of the
grid, the approximation of a function by a GRBF network will be increasingly
more accurate, i.e.,
(2.63)
(2.64)
It has also been shown in section 2.4 that the overall identification scheme
is stable and that the model parameters converge to within some bound of
the optimal values. Therefore, the algorithm developed in this section still
guarantees the stability and convergence of the overall identification.
where U E n rx1 is the input vector, x E nnxl is the state vector and fe) E
nn x 1 is a nonlinear function vector. Following the same line of analysis as
for the single-input single-state case, the identification model for the system
(2.65) can be expressed by
where
g(x, u) = f(x, u) - Ax (2.67)
and A E nnxn is a Hurwitz or stability matrix (i.e., all the eigenvalues are
in the open left-half complex plane). Modelling the nonlinear function vector
g(x,u) E nnxl using GRBF neural networks gives the following identification
model:
i: = Ax + g(x,u;p), x(O) = Xo (2.68)
where x denotes the state vector of the network model and g is the output
vector of the GRBF neural network. Define the following one-to-one mappings
for the inputs and states:
bxiXi
Xi = for i=1,2, ... ,n (2.69)
IXil + axi
buiUi
Ui= for i=1,2, ... ,T (2.70)
IUil + aui
where axi, bxi , aui, bUi are positive constants. These mappings ensure that the
elements of the vectors X and u are all in bounded sets. The estimate of the
function g then is written as
(2.71 )
where
(2.72)
(2.73)
where Wi'( is the optimal weight matrix and e(t) = [el(t),e2(t), ... ,en (t)V is
the modelling error vector which is assumed to be bounded by
(2.76)
(2.77)
(2.78)
(2.79)
where tr(·) denotes the trace of the matrix (.). The first derivative of the
Lyapunov function V with respect to time t is
- - 1
V(e x, r K) = ex Aex + ex rK<PK(X, u) + -tr(rK r K) + ex e(t)
. T T ·T T
(2.80)
a
Since
(2.83)
V(ex,rK) e;Aex+e;e(t)
< -IAmax(A)le; ex + e; e(t)
< -IAmax(A)1 t lexil {le il - IAm::(A) I }
X
(2.84)
CK
. mm {Iexil} < .,--------,------,- (2.85)
~=l •... ,n IAmax(A)1
then it is possible that 11 > 0, which implies that the weights Wik may drift to
infinity over time. Following the analysis for the single-input single-state case,
this drift can be avoided by modifying the adaptation law as
where OXi and Ouj represent the scale of resolution in the input-state grid and
e max is chosen to represent the desired maximum tolerable accuracy of the
state estimation. When a new unit is added to the network at time tl the
parameters associated with the GRBF units are adapted as follows:
mt = arg. min
t=l, ... ,K+l,mi#mk
{llmk - mill}, for k = 1, ... , K + 1 (2.92)
(2.93)
2.7 An Example
(2.95)
where the input u is assumed to be cos(t) and the initial state x(O) = O.
The parameter values used in this example are as follows: eo = 0.001, em ax =
0.005,6 x = 6u = 0.05, a = 0.5, M = 1.5,0: = 1, ~ = 3.0, Xo = O.
The simulation was begun with no GRBF units in the network model and
the number of units increased with time, according to the growth criteria.
The final results after an operation over a period of 10 seconds gave a GRBF
network with 16 hidden units, for approximating the dynamical system. The
performance of the sequential identification scheme using the GRBF network
are shown in Figures 2.9-2.12, for a typical run ofthe algorithm observing that
similar plots obtained under different operational conditions.
1 .2 ,---,---,-----,----,-----,----,----,--,-----.,--------,
0.8
0.6
04
Real state
--- Estimated state
2 3 4 5 6 7 8 9 10
time t (sec)
Fig. 2.9. Real state x and estimated state :i; over time
The actual and estimated states and the state error of the dynamical system
against time t are shown in Figures 2.9 and 2.10, respectively. It can be seen
from Figure 2.9 that for much of the operation, the state error is constrained
within the maximum tolerable bound e max = 0.005. The network parameters
also converged to a set of values although they were oscillating around these
50 2. Sequential Nonlinear Identification
values. A plot of the actual state x and the estimated state x against the input
u is shown in Figure 2.11 which indicates the presence of the strong nonlinear-
ity in the dynamical system. Figure 2.12 shows the relationship between the
estimated state and its first derivative as they gradually approach the true set
of values.
0.04,----~-~-~-~-~-~-~-~-~_____,
0.03
0.02
0.01
-0.01
-0.02
-0.030~-~-~-~-~-~5-~----=--~-~-~10
time t (sec)
1.2,----~-~-~-~-~-~-~-~-~-~
0.8
0.6
,,
0.4
\
0.2
O~-~-~-~-~-~-~--L-~-~-~
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
the input u
Fig. 2.11. Actual state x (-) and estimated statei: (- -) against the input u
2.8 Summary 51
\ / ~ \
\
0.8 I
I
\
I \
\
I \
0.6 /
/
/ \
I
\
\/
'/ - "
0.4 I \ /
1/ ~ ~
II /
/
0.2 II
1/
-0.2
I
-0.4
\ "-/ \
/
/
/
-0.8
0 0.2 0.4 0.6 0.8 1.2
Fig. 2.12. Actual derivative of state x against state x (-) and estimated derivative
of state i against statei: (- -)
2.8 Summary
A variable neural network structure has been proposed, where the number of
basis functions in the network can be either increased or decreased over time
according to some design strategy to avoid either overfitting or underfitting.
In order to model unknown nonlinearities of nonlinear systems, the variable
neural network starts with a small number of initial hidden units, then adds
or removes units on a variable grid consisting of a variable number of subgrids
with different sized hypercubes, based on the novelty of observation.
A sequential identification scheme for continuous nonlinear dynamical sys-
tems with unknown nonlinearities using neural networks has been developed.
The main feature of this scheme is the combination of the growing Gaussian
radial basis function network with that of Lyapunov synthesis techniques in
developing the adaptive or estimation laws that guarantee the stability of the
system. The idea of growing the network, similar to the resource allocating net-
work (RAN), overcomes the problem of having to choose the neural network
structure a priori, a difficult task which often results in an overdetermined
network. The network begins with no radial basis function units and with
increasing time, the model grows gradually to approach the appropriate com-
plexity of the network that is sufficient to provide the required approximation
accuracy. The stability of the overall identification scheme and convergence of
the model parameters are guaranteed by parameter adjustment laws developed
using the Lyapunov synthesis approach. To ensure that the modelling error is
52 2. Sequential Nonlinear Identification
3.1 Introduction
The system identification procedure mainly consists of model structure selec-
tion and parameter estimation. The former is concerned with selecting which
class of mathematical operator is to be used as a model. The latter is con-
cerned with an estimation algorithm and usually requires input output data
from the process, a class of models to be identified and a suitable identifica-
tion criterion. A number of techniques have been developed in recent years
for model selection and parameter estimation of nonlinear systems. Forward
and backward regression algorithms were analysed in Leontaritis and Billings
(1987). Stepwise regression was used in Billings and Voon (1986) and a class of
orthogonal estimators were discussed in Korenberg et al. (1988). Algorithms
with the objective of saving memory and allowing fast computation have been
proposed in Chen and Wigger (1995). Methods to determine the a priori struc-
ttlral identifiability of a model have also been studied (Ljung and Glad, 1994).
A survey of existing techniques of nonlinear system identification prior to the
1980s is given in Billings (1980), a survey of the structure detection of input
output nonlinear systems is given in Haber and Unbehauen (1990) and a sur-
vey of nonlinear black-box modelling in system identification can be found in
Sjoberg et al. (1995).
An area of rapid growth in recent years has been neural networks. This
approach makes few restrictions on the type of input output mapping that can
be learnt. The majority of nonlinear identification techniques using neural net-
work are off-line which means the structure and parameters of the model are
fixed after off-line identification based on a set of input output data. However,
if there is a change in the system operation or the real system input space is
different from the one which was used for off-line identification, this will lead
to changes in the parameters of the neural network based model, causing a de-
terioration in the performance of the identification. Therefore, in order to have
good identification performance, both the structure and the parameters of the
model need to be modified in response to variations of the plant characteristics
and operating point. Recently, new algorithms have been developed which op-
erates on a window of data and which can be used on-line to adaptively track
the variations of both model structure (Fung et at., 1996; Luo and Billings,
1995) or topology (Luo et al., 1996; Luo and Billings, 1998) and update the
estimated parameters or weights on-line.
54 3. Recursive Nonlinear Identification
where f (.) is some nonlinear function, nand m are the corresponding maxi-
mum delays.
It is well known that neural networks provide a good nonlinear function
approximation techniques. A nonlinear identification structure by neural net-
works is shown in Figure 3.1. Here it assumes that the nonlinear function f(.)
in the NARMA model is approximated by a single-layer neural network, which
consists of a linear combination of basis functions.
N
where Xt = [Yt-l,Yt-2, ... ,Yt-n,Ut-l,Ut-2, ... ,Ut-m], IPk(Xt) is the basis func-
tion and Wk the weight.
3.2 Nonlinear Modelling by VPBF Networks 55
System
u
(3.5)
['PI, 'P2, 'P3, ... , 'Pn+I, 'Pn+2, ... , 'Pn+m+I, 'Pn+m+2, 'Pn+m+3, ... , 'P N ](Xt)
= [1, Yt-I, Yt-2, ... , Yt-n, Ut-I, ... , Ut-m, yLll Yt-IYt-2, ... , u~-ml (3.6)
and the number of polynomial basis function is given by
N= (n+m+l)!
(3.7)
l!(n + m)!
Using the VPBF network, the nonlinear function f(.) can be obtained by
(3.8)
Increasing the order I, the number N of basis functions becomes larger and
larger. Thus, the problem is how to estimate the function j(Xt) using a proper
sized neural network so that the approximation accuracy is within the required
bound. The structure selection and the weight learning of the neural network
are discussed in the following sections.
There are many ways to select the basis functions. Here, off-line structure
selection using the orthogonal least squares algorithm (Billings et al., 1988) and
on-line structure selection using growing network techniques are introduced for
the basis function selection of Volterra polynomial networks.
It is assumed that a set of input output data (Yt, Ut, t = 1,2, ... , M) of the
system is given. Based on (3.5), the input output relation may compactly be
written in the following vector form:
(3.9)
where the input vector Y E n MXI , the weight vector W E n NXI , the
approximation error vector O(xl) E n MXI and the basis function matrix
n
p(x) E MxN are, respectively,
r",'PI (x,)
(X2)
'P2(xd
'P2 (X2) "N(X,)
'PNi X2 )
1
p(x) = . (3.13)
W = argmin
W
II Y - p(x)W 112 (3.14)
transformation of the set of basis vectors {pd into a set of orthogonal basis
vectors, and thus makes it possible to calculate the individual contribution to
the desired output from each basis vector. An orthogonal decomposition of the
matrix p(x) gives
p(x) = PQ (3.15)
0 0 0 1
Y= PV + O(xl) (3.19)
W = Q- 1 V (3.20)
where V = [VI, V2, ... , V N jT ERN XI. It can be seen that the optimal estimate
V= [Vl,V2+, ... ,VNjT of the vector V is
yT~ £ N
= plpi = 1,2, ... , (3.21)
A
Vi ' or i
(3.22)
The classical Gram Schmidt and modified Gram Schmidt methods can be used
to derive the above and thus to solve the least squares estimate of W.
The output variance can be expressed as
(3.23)
V;
Note that 2:;:1 pl Pi / M is the part of the desired output variance which can
be explained by the basis functions and OT 0 / M is the unexplained variance
58 3. Recursive Nonlinear Identification
v;
of y(t). So, pl Pi is the increment to the explained desired output variance
introduced by Pi and the error reduction ratio due to Pi may be defined by
0;ply
Ti = yTy (3.24)
This ratio offers a simple and effective means of seeking a subset of significant
basis functions. This implementation based on the classical Gram Schmidt is
given as follows (Billings et at., 1988, 1989):
(a) At the first step, for i = 1,2, ... , N, calculate
(3.25)
(3.26)
(Pii))T p;i)
(V/i))2(pi i ))Ty
yTy (3.27)
Find
and select
(3.29)
(b) At the k-th step, where k > 2 for i = 1,2, ... ,N, i -::f. Sl,···,~ -::f. Sk-1,
compute
<I>T Pj
j = 1,2, ... ,k (3.30)
(Pj)TP/
(3.31 )
v:( i) (3.32)
k
(p~i))T p~i)
(Vk(i))2 (p~i)) Ty
yTy (3.33)
Find
Sk = arg maX{Tk( i) , i = 1,2, ... ,N, i -::f. Sl, ... ,i -::f. sk-d (3.34)
and select
(3.35)
3.3 Structure Selection of Neural Networks 59
where 0 < eo < 1 is a chosen tolerance. This gives a subset model contain-
ing L significant terms.
It is clear from (3.21) and (3.24) that ri :::: O. Changing the order of the
VPBFs will lead to a change in the error reduction ratio rio For N VPBFs,
there are N! sorting possibilities. Let the r}j) denote the error reduction ratio ri
corresponding to the j-th sorting of the VPBFs. The classical Gram Schmidt
method introduced above can be used to find the a-th sorting of the basis
functions ipdXt), ip2(Xt), ... , ipN(Xt), which is the best sorting, such that
k k
' " ' r(o)
~ 1,
>
-~
'"' r(j)
1,
for j i- a,j = 1,2, ... , N!, k = 1,2, ... , N (3.37)
i=l i=l
In this way, the priority of all candidates is determined. Thus, the best sort-
ing of VPBFs is denoted by ip~(Xt), ip~(Xt), ... , ip'N(Xt) and the corresponding
weight vector is WO.
For nonlinear systems, the system operation can change with time or the real
system input space is different from the one which was used for off-line identifi-
cation. In order to produce good identification performance, both the structure
and the weights of the neural network model may need to be modified in re-
sponse to variations in the plant characteristics. Here, the modification of the
neural network structure will be taken into account. The adaptation of the
weights will be discussed in the next section.
According to approximation theory, adding more independent basis func-
tions to the network will improve approximation. In off-line structure selection,
the VPBFs are reordered in terms of their priority. Here it is assumed that at
time t - 1 the basis functions of the VPBF network consist of the first L best
candidates ip~ (Xt), ip~ (Xt), ... , ipL (Xt). To improve the approximation accuracy,
the growing network technique (Liu et al., 1995, 1996c) is applied. This means
that one more VPBF, which is chosen from and is of the highest priority in
the remaining basis function candidates ipL+l (Xt), ipL+2(Xt), ... , ip'N(Xt) , needs
to be added to the network. In this case, denote the structure of the VPBF
network at time t - 1 as Pt-1) (Xt) and the structure immediately after the
addition of a basis function at time t as pt) (Xt). Based on the growing network
technique and the structure of the function J(Xt) in (3.4), the structure of the
VPBF network now becomes
where w L+1 is the weight corresponding to the new (L + l)th Volterra poly-
nomial basis function 'PL+1 (Xt).
The growing VPBF network is initialised with a small set of Volterra poly-
nomial basis function units. As observations are received, the network grows
by adding new units. This is called the addition operation. The decision to
add a new unit depends on two conditions. The first is that the following must
be satisfied:
(3.39)
In the previous section, the structure selection of the VPBF network model
was considered to reach a good approximation accuracy. This section takes
into account the parameter adaptation laws which ensure that the estima-
tion error converges to the desired range when the plant characteristics and
the system operating point change. Here, it is assumed that the basis func-
tions 'PI (Xt), 'P2 (Xt), ... , 'PL (Xt) are given. The estimated function Jt (Xt) in the
N ARMA model can also be expressed by
(3.40)
where the weight vector Wt - I and the basis function vector Pt-I are
Yt = Pi-I W* + Ct (3.43)
the minimal upper bound of the modelling error c(t) is given by a constant
h, which represents the accuracy of the model and this is defined as
(3.44)
The estimation problem is then to find a vector W belonging to the set defined
by
(3.45)
W; = W t- I + atf3tPt<pt-Iet (3.47)
Pt = Pt- I - f3tit Pt-1 <Pt-I <Pi-I Pt- I (3.48)
(3.50)
f3 -
t -
{I, 0,
letl > 0
let I :S 0
(3.51)
(3.54)
+ atet<pLI Pt W t- I + Ct
s = 1+ 2 (3.55)
II atetPt<Pt-1 112
Next, the properties of the above learning algorithm are analysed using
Lyapunov techniques. To ensure the convergence of the algorithm, consider
the Lyapunov function:
(3.56)
where Wt = W* - W t . Let
The above implies that bt is used to reduce the effect of at in the weight vector
w;
II 112 > M. The Lyapunov function Vi in (3.56) is now extended to
W t if
(3.60)
where
(3.61 )
which uses f3t = 13;' Following the matrix inverse theorem (Goodwin and
Mayne, 1987), the inverse of the matrix Pt is obtained by
(3.63)
(3.64)
3.4 Recursive Learning of Neural Networks 63
From (3.63) and (3.64), the first term on the right side of (3.62) is expressed
as
~T I~ (~T)2
Wt - I Pt-=-I Wt- I + atf3t Wt - I Pt-I
Vi-I + atf3t (e; + e; - 2etet) (3.65)
(3.66)
(3.67)
(3.68)
Since it is assumed that the approximation error et satisfies let I ::; h ::; 6,
then from the above
For let I :2: 6, it is not difficult to show that let 13 - 2621et I+ 63 :2: let 12 (let I - 6).
Hence
(3.71)
It is known from (3.48) that Amax(Pt ) ::; Amax(Pt-d ::; ... ::; Amax(PO). As a
result,
(3.72)
64 3. Recursive Nonlinear Identification
(3.73)
lim et
t-+oo
=6 (3.74)
Also, it can be seen from (3.72) that the weights converge as time t approaches
infinity. In addition, Equation 3.73 implies that the weights will never drift to
infinity over time. Thus, if M is chosen to satisfy
The analysis of the algorithm for Case 1 shows that if 6 < 6L, W; may be
greater than the bound M. In addition, in the case where (3.75) is not satisfied,
it cannot be simply assumed that T)t = 0 since W; may also be greater than the
bound M. So, W t = W; - T)tf3tOOtPt<Pt-Iet will be used for weight adjustment.
This leads to
II W t- I + at + bt II;
II W t- I II; + (at + bt f(2Wt- I + at + bt )
II W t- I II; + (1- T)t)af(2Wt- I + (1 - T)t)at) (3.76)
(3.77)
(3.78)
and s- and s+ are given by (3.54) and (3.55). There are still an infinite number
of possibilities for T)t. Hence, the question of what is the optimal solution of T)t
3.4 Recursive Learning of Neural Networks 65
arises. To answer this question, let us consider (3.61). The first term on the
right side of (3.61) can be calculated as
(3.81)
where
(3.82)
Now, dt consists of two parts. The first is 2T1tat(3tetet, which is the uncertain
part because the modelling error is unknown. The second is g(Tlt) which is
computable. It is also known from the Lyapunov technique that the more
negative dt is, the faster the reduction of the function Vi is. Thus, the function
g(Tlt) is used as the performance index for choosing the optimal solution of Tit.
The function gt(Tlt) is a concave parabola and has only one minimum. The
optimal TI; which minimises gt and the minimum gt(TI;) are given by
(3.83)
(3.84)
the second and third terms on the right-hand side of (3.86) will be negative.
Using (3.50) and (3.52) gives
atlt- I = (1 + (2 - bletl- l )pL I P t - I Pt-d(l + pLIPt-IPt-d- 1 :::; 2 (3.88)
As a result, if the following condition
(3.89)
is satisfied, then the weights converge to their optimal values since L1 Vi :::; O.
On the other hand, if the above condition is not satisfied, it is possible that
L1 Vi > O. This implies that the weight vector Wt may drift away over time. In
this case, the weight learning algorithm given by (3.46) avoids divergence of the
weight vector because IIWt l1 2 will not be greater than M for T)t E [8-, 8+].
Thus the error let I always converges.
If T); tf. [8-, 8+], let
g(8+) :::; g(8-)
(3.90)
g(8+) > g(8-)
Then
Vi < Vi-I + ati3t (bE - atIr)'tei) + gt(T)i) + 2atftITJtetlbL
Vi-I + atft ( bL2 - at I rtet2)
+2ftatlT)tetl(bL - (1 - (1 - 0.5T)i)atpLI ptpt-d sgn (T)i)letl)
(3.91 )
Similarly, if the following condition is satisfied
letl > max{ V2bL , (1- (1 - 0.5T)i)atpLI ptpt_d- I sgn (T)i)bL} (3.92)
then the weights converge to their optimal values since L1 Vi :::; O. However, it
is possible that L1 Vi > 0 if the above condition is not satisfied. This indicates
that the weight vector W t may drift away over time. But, the weight learning
algorithm given by (3.46) constrains IIWt l1 2 to be not greater than M. So, the
error let I always converges.
In the light of the above analysis, the design of T)t may be given by
(3.93)
The analysis of the algorithm for the weight adaptation laws clearly shows
that if the minimal upper bound h of the approximation error is not known
both the weights and the estimation error are still bounded.
3.5 Examples
Three examples will be used to illustrated recursive nonlinear identification.
The first is a system described by an input output model. The second is a
system described by a state-space model. The third is the data set of the
Santa Fe times series prediction and analysis competition.
3.5 Examples 67
Example 3.1
Yt-lYt-2Yt-3Ut-2(Yt-3 - 1)
Yt = 1 (3.94)
+ Yt-2 + Yt-3
2 2
The input Ut was set to be a random sequence between -0.5 and 0.5. Based
on the input output data, the orthogonal least algorithm was used in off-line
structure selection of the VPBF network. Their order of selection and the
corresponding weights are given in Table 3.1.
On-line structure selection was then applied and the recursive weight learning
algorithm was used. The input was defined as
-0.5
o 100 ::ro :?ill 400 500 00) 700 00) 9Xl 1000
Fig. 3.2. Output Yt and estimated output Yt by on-line identification (Example 3.1)
0.08,----,-----.--~-,----_,----_r~--_,----,_----,_----,_--_,
0.06
0.04
0.02
o
·0.02
·0.04
·0.06
·0.08
·0.1
·0.1 2 L -_ _---'_ _ _ _---'_ _ _ _---'._ _ _ _--'-_ _ _ _--'-_ _ _ _--'--_ _ _ _-'--_ _ _ _"""'---_ _ _ _-'---_ _~
o 100 200 300 400 500 600 700 800 900 1000
0.9
0.8
o
hl 0.7
>
.c
~06
;,:
Q)
£ 0.5
'0
E
004
c
I
N
1? 0.3
f-
0.2
0.1
O~ __- L____L -__- L____L -__- L____L -__- L____L -__- L__ ~
o 100 200 300 400 500 600 700 800 900 1000
timet
Fig. 3.4. 2-norm of the weight vector W t using on-line identification (Example 3.1)
0.5
c
o
~-15
.~
Q)
~ -2
f-
-2.5
-3
-3.5
_4L----L----L----L--~--~~--~----~--~----~--~
o 100 200 300 400 500 600 700 800 900 1000
timet
Fig. 3.5. Estimation error et using off-line identification with 20 VPBFs (Example
3.1)
70 3. Recursive Nonlinear Identification
Example 3.2
(3.96)
1 + xi(t) + x§(t)
1. 8X I(t)X2(t)
,(
X2 t + 1) = 14
. u 3(t ) _ 2( )
1 + Xl t
(3.97)
The input u was set to be a random sequence between -0.5 and 0.5, as in
Example 3.1. Using the input output data, the priority of the VPBFs was
obtained using the orthogonal least squares algorithm. The order of the VPBFs
and the corresponding weights were given in Table 3.2.
The on-line structure selection technique and the recursive weight learning
algorithm were applied with the input given by
0.6,-----~-----.------,-----_,------,_----_,----_,,_--__.
0.4
0.3
0.2
Fig. 3.6. System output Yt and estimated output fit using on-line identification
(Example 3.2)
3.5 Examples 71
0.08nr------,-------,-------,-------,-------,-------,-------,-------,
0.06
0.04
0.02
o
-0.02
-004
-0.06
-0.08
-0.1 L -______L -______L -______L -______L -______L -______L -______L -____ ~
8,-------,--------,-------,--------,-------,--------,-------,-------,
Fig. 3.8. 2-norm of the weight vector W t using on-line identification (Example 3.2)
3.5 Examples 73
1.4,-----,------,------,------,------,-----,------,------,
1.2
0.8
0.6
0.4
0.2
Fig. 3.9. Estimation error et using off-line identification with 30 VPBFs (Example
3.2)
Example 3.3
The algorithm developed in this chapter is applied to the data set D of the
Santa Fe times series prediction and analysis competition. The data set was
obtained from ftp.cs.colorado.edu/pub /Time-Series/SantaFe. Using the first
500 data of the data set, the priority of the VPBFs was obtained using the
orthogonal least squares algorithm. The order of the VPBFs and the corre-
sponding weights are given in Table 3.3.
The on-line structure selection technique and the recursive weight learning
algorithm were applied to the first 1000 items of the data set. The growing
VPBF network started with the first three best VPBFs, and stopped when the
number of VPBFs reached six. The simulation results are shown in Figures
3.10-3.13. In Figure 3.10, the sub-figure (b) is a larger scale version of the
sub-figure (a).
To test the disturbance rejection of the algorithm, a uniformed random
noise (its magnitude is 0.05) was added to the data set of the Santa Fe Time
series. The estimation error is shown in Figure 3.13. It is clear that the algo-
rithm still gives good estimation.
1.4,-----,-----,-----,-----,-----,-----,----,,----,-----,-----,
0.8
0.6 H~IIIIIIIIIII""nlllllll·"'·iV
0.4
0.2
o~----~----~----~----~----~----~--~~--~----~----~
o 100 200 300 400 500 600 700 800 900 1000
Fig. 3.10. System output Yt and estimated output Yt using on-line identification
(Example 3.3)
1.4,-----,-----,-----,-----,-----,-----,-----,----,,----,,----,
0.5,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----,
0.4
0.3
0.2
0.1
·0.1
·0.2
·0.3
·0.4
·0.5 '-------"--------"--------"--------"--------"--------"---------'--------'--------'---------'
o 100 200 300 400 500 600 700 800 900 1000
Fig. 3.12. 2-norm of the weight vector W t using on-line identification (Example 3.3)
6.5
Fig. 3.13. Estimation error et for the data with random noise (Example 3.3)
The results of the above three examples show that in terms of the estimation
error the performance of the proposed recursive identification scheme is much
better than an off-line approach. Although the minimal upper bound of the
76 3. Recursive Nonlinear Identification
3.6 Summary
4.1 Introduction
The modelling of nonlinear systems has been posed as the problem of selecting
an approximate nonlinear function between the inputs and the outputs of
the systems. For a single-input single-output system, it can be expressed by
the nonlinear auto-regression moving average model with exogenous inputs
(NARMAX) (Chen and Billings, 1989), that is,
y(t) = f(y(t - 1), y(t - 2), ... , y(t - ny), u(t - 1), u(t - 2), ... , u(t - nu) + e(t)
(4.1)
where
x = [y(t -l),y(t - 2), ... ,y(t - ny),u(t -l),u(t - 2), ... ,u(t - nu)] (4.3)
CPk(X,dk ) (k = 1,2, ... , N) is the basis function and p is the parameter vector
containing the weights Wk and the basis function parameter vectors d k . If the
basis functions cpdx, d k ) do not have the parameters d k , then it is denoted by
4.2 Multiobjective Modelling with Neural Networks 79
CPk(X). Two sets of basis functions are used: a set of Volterra polynomial basis
functions (VPBF) and a set of Gaussian radial basis functions (GRBF).
Multivariate polynomial expansions have been suggested as a candidate
for nonlinear system identification using the N ARMAX model (Billings and
Chen, 1992). The Volterra polynomial expansion (Schetzen, 1980) has been
cast into the framework of nonlinear system approximations and neural net-
works (Rayner and Lynch, 1989). A network whose basis functions consist
of the Volterra polynomials is named the Volterra polynomial basis function
network. Its functional representation is given by
L WkCPk(X) (4.5)
k=l
where the parameter vector represents the weights of the networks and
p = {wd is the set of parameters or linear weights and {cpdx)} the set of
basis functions being linearly combined, o(x 3 ) denotes the approximation error
caused by the high order (2: 3) of the input vector. The basis functions are
essentially polynomials of zero, first and higher orders of the input vector
X E nn.
Radial basis functions were introduced as a technique for multivariable
interpolation (Powell, 1987), which can be cast into an architecture similar to
that of the multilayer perceptron (Broomhead and Lowe, 1988). Radial basis
function networks provide an alternative to the traditional neural network
architectures and have good approximation properties. One commonly used
radial basis function network is the Gaussian radial basis function (GRBF)
neural network. The nonlinear function approximated by the GRBF network
is expressed by
(4.8)
where C k is the weighting matrix of the k-th basis function, and p is the
parameter vector containing the weights Wk and the centres d k (k = 1,2, ... , N).
For the sake of simplicity, it assumes that C k = I.
80 4. Multiobjective Nonlinear Identification
where 11.112 and 11.llao are the L2- and Lao-norms of the function (.), 1J(j(x;p))
is the complexity measurement of the model.
For model selection and identification of nonlinear systems, there are good
reasons for giving attention to the performance functions ¢i(p) (i = 1,2,3).
The practical reasons for considering the performance function ¢1 (p) is even
stronger than the other performance functions ¢2 (p) and ¢3 (p). Statistical con-
siderations show that it is the most appropriate choice for data fitting when
errors in the data have a normal distribution. Often the performance function
¢1 (p) is preferred because it is known that the best approximation calculation
is straightforward to solve. The performance function ¢2 (p) provides the foun-
dation of much of approximation theory. It shows that when this is small, the
performance function ¢1 (p) is small also. But the converse statement may not
be true. A practical reason for using the performance function ¢2 (p) is based
on the following. In practice, an unknown complicated nonlinear function is
often estimated by one that is easy to calculate. Then it is usually necessary to
ensure that the greatest value of the error function is less than a fixed amount,
which is just the required accuracy of the approximation. The performance
function ¢3 (p) is used as a measure of the model complexity. A smaller per-
formance function ¢3 (p) indicates a simpler model in terms of the number
of unknown parameters used. Under similar performances in ¢1 (p) and ¢2 (p)
by two models, the simpler model is statistically likely to be a better model
(Geman et at., 1992).
In order to give a feel for the usefulness of the multiobjective approach
as opposed to single-objective design techniques, let us consider the minimi-
sation of the cost functions ¢i(P) (i = 1,2,3). Let the minimum value of ¢i
be given by ¢t, for i = 1,2,3, respectively. For these optimal values ¢t there
exist corresponding values given by ¢j[¢tl (j i- i,j = 1,2,3), for i = 1,2,3,
respectively, and the following relations hold:
There are many methods available to solve the above multiobjective op-
timisation problem (Liu et al., 2001). Following the method of inequalities
(Zakian and AI-Naib, 1973; Whidborne and Liu, 1993), we reformulate the
optimisation into a multiobjective problem as
where the positive real number Ci represents the numerical bound on the per-
formance function CPi(P) and is determined by the designer. Generally speak-
ing, the number Ci is chosen to be a reasonable value corresponding to the
performance function CPi according to the requirements of the practical sys-
tem. For example, Cl should be chosen between the minimum of CPl and the
practical tolerable value on CPl. The minimum of CPl can be known by the least
squares algorithm. The practical tolerable value means if CPl is greater than it,
the modelling result cannot be accepted. In addition, if Ci is chosen to be an
unreachable value, Section 4.4 will show how to deal with this problem.
Many different techniques are available for optimising the design space as-
sociated with various systems. Recently, direct-search techniques, which are
problem-independent, have been proposed as a possible solution for the diffi-
culties associated with the traditional techniques. One direct-search method is
the genetic algorithm (GA) (Goldberg, 1989). Genetic algorithms are search
procedures which emulate the natural genetics. They are different from tradi-
tional search methods encountered in engineering optimisation (Davis, 1991).
In Goldberg (1989), it is stated that (a) the GA searches from a population
of points, not a single point and (b) the GA uses probabilistic and not deter-
ministic transition rules.
(a) The evolution process operates on chromosomes rather than on the living
beings which they encode.
(b) The natural selection process causes the chromosomes that encode suc-
cessful structures to reproduce more often than ones that do not.
(c) The reproduction process is the point at which evolution takes place. The
recombination process may create quite different chromosomes in children
by combining material from the chromosomes of two parents. Mutations
may result in the chromosomes of biological children being different from
those of their biological parents.
(d) Biological evolution has no memory. Whatever it knows about producing
individuals that will function well in their environment is contained in
the gene pool, which is the set of chromosomes carried by the current
individuals, and in the structure of the chromosome decoders.
In the early 1970s, the above features of natural evolution intrigued the sci-
entist John Holland (1975). He believed that it might yield a technique for
solving difficult problems to appropriately incorporate these features in a com-
puter algorithm in the way that nature has done through evolution. So, he
began the research on algorithms that manipulated strings of binary digits (Is
and Os) that represent chromosomes. Holland's algorithms carried out simu-
lated evolution on populations of such chromosomes. Using simple encodings
and reproduction mechanisms, his algorithms displayed complicated behaviour
and solved some extremely difficult problems. Like nature, they knew nothing
about the type of problems they were solving. They were simple manipulators
of simple chromosomes. When the descendants of those algorithms are used
today, it is found that they can evolve better designs, find better schedules
and produce better solutions to a variety of other important problems that we
cannot solve using other techniques.
When Holland first began to study these algorithms, they did not have a
name. As these algorithms began to demonstrate their potential, however, it
was necessary to give them a name. In reference to their origins in the study
of genetics, Holland named them genetic algorithms. A great amount of re-
search work in this field has been carried out to develop genetic algorithms.
Now, the genetic algorithm is a stochastic global search method that mimics
the metaphor of natural biological evolution. Applying the principle of sur-
vival of the fittest to produce better and better approximations to a solution,
genetic algorithms operate on a population of potential solutions. A new set
of approximations at each generation is created by the process of selecting
individuals, which actually are chromosomes in GAs, according to their fitness
level in the problem domain and breeding them using operators borrowed from
natural genetics, for example, crossover and mutation. This process results in
the evolution of populations of individuals that are better suited to their envi-
ronment than the individuals that they were created from, just as in natural
adaptation.
It is well known that natural phenomena can be abstracted into an algo-
rithm in many ways. Similarly, there are a number of ways to embody the
4.3 Model Selection by Genetic Algorithms 83
begin
t=t+l;
select P(t) from P(t - 1);
reproduce pairs in P( t) by
begin
crossover;
mutation;
reinsertion;
end
evaluate P (t) ;
end
end
If all goes well through this process of simulated evolution, an initial population
of unexceptional chromosomes will improve as the chromosomes are replaced
by better and better ones. The best individual in the final population produced
can be a highly evolved solution to the problem.
The genetic algorithm differs substantially from more traditional search
and optimisation methods, for example, gradient-based optimisation. The most
significant differences are the following.
(a) GAs search a population of points in parallel rather than a single point.
(b) GAs do not require derivative information on an objective function or
other auxiliary knowledge. Only the objective function and corresponding
fitness levels influence the directions of search.
(c) GAs use probabilistic transition rules, not deterministic ones.
(d) GAs can work on different encodings of the parameter set rather than the
parameter set itself.
It is important to note that the GA provides many potential solutions to a
given problem and the choice of the final solution is left to the designer. In
cases where a particular optimisation problem does not have one individual
solution, then the G A is potentially useful for identifying these alternative
solution simultaneously.
Recently, genetic algorithms have been applied to control system design (see,
e.g., Davis, 1991; Patton and Liu, 1994; Liu and Patton, 1998). GAs have
also been successfully used with neural networks to determine the network
parameters (Schaffer et al., 1990; Whitehead and Choate, 1994), with NAR-
MAX models (Fonseca et al., 1993) and for nonlinear basis function selection
for identification using Bayesian criteria (Kadirkamanathan, 1995). Here the
GA approach is applied to the model selection and identification of nonlinear
systems using multiobjective criteria as the basis for selection.
The model selection can be seen as a subset selection problem. For the
model represented by the VPBF network, the principle of model selection using
4.3 Model Selection by Genetic Algorithms 85
the genetic algorithms can be briefly explained as follows: For the vector x E
R n , the maximum number ofthe model terms is given by N = (n+1)(n+2)/2.
Thus, there are N basis functions which are the combination of 1 and the
elements of the vector x. Then there are 2N possible models for selection.
Each model is expressed by an N-bit binary model code c, i.e., a chromosome
representation in genetic algorithms. If some bits of the binary model code c
are zeros, it means that the basis functions corresponding to these zero bits
are not included in the model.
For example, if the vector x E R 3 , the maximum number of the model terms
is 10. Then there are 1024 possible models. Each model can be expressed by a
lO-bit binary model code. Thus the Volterra polynomial basis functions are
j(x;p) pT diag(c)<p(X)
[WI, W4, W7, W9][<Pl, <P4, <P7, <pg]T
For the model represented by GRBF networks, the maximum number of the
model terms is given by N, the number of the Gaussian functions, and there
are 2N possible models for selection and also N possible radial basis functions
with their centres d k • Thus a chromosome representation in genetic algorithms
consists of an N-bit binary model code c and N real number basis function
centres dk (k = 1,2, ... , N), i.e.,
(4.19)
(4.20)
j=1 j=1
It is evident from the above that only the basis functions corresponding to the
nonzero bits of the binary model code c are included in the selected model.
Given a parent set of binary model codes and basis function parameter vectors,
a model satisfying a set of performance criteria is sought by the numerical
algorithm.
86 4. Multiobjective Nonlinear Identification
With three objectives (or cost functions) for model selection and identification,
the numerical algorithm for this multiobjective identification problem is not
a straightforward optimisation algorithm, such as for the least squares algo-
rithm. This section develops a multiobjective identification algorithm which
uses genetic algorithm approaches and the method of inequalities to get a
numerical solution satisfying the performance criteria.
Now, let us normalise the multiobjective performance functions as the fol-
lowing.
(4.21 )
Let ri be the set of parameter vectors p for which the i-th performance criterion
is satisfied:
ri = {p: 1/Ji(P) :::; I} (4.22)
Then the admissible or feasible set of parameter vectors for which all the
performance criteria hold is the intersection
(4.23)
(4.24)
which shows that the search for an admissible p can be pursued by optimisa-
tion, in particular by solving
subject to (4.24).
The optimisation needs to be carried out using iterative schemes. Now, let
pq be the value of the parameter vector at the q-th iteration step in optimisa-
tion, and define
where
(4.27)
and also define
r q = r1q n ri n rl (4.28)
Eq = 1/Jdpq) + 1/J2(pq) + 1/J3(pq) (4.29)
rq is the q-th set of parameter vectors for which all performance functions
satisfy
(4.30)
4.4 Multiobjective Identification Algorithm 87
and
(4.34)
so that the boundary of the set in which the parameters are located has been
moved towards the admissible set, as shown in Figure 4.1.
The process of finding the optimisation solution is terminated when both Llq
and Eq cannot be reduced any further. But the process of finding an admissible
parameter vector p stops when
(4.35)
Ej = L 'l/Ji(Sj) ( 4.37)
i=l
Step 4: Selection
According to the fitness of the performance functions for each chromo-
some, delete the (M -1) /2 weaker members of the population and reorder the
chromosomes. The fitness of the performance functions is measured by
4.4 Multiobjective Identification Algorithm 89
Step 5: Crossover
Offspring binary model codes are produced from two parent binary model
codes so that their first half elements are preserved. The second half elements in
each parent are exchanged. The average crossover operator is used to produce
offspring basis function parameter vectors. The average crossover function is
defined as
. 1
for J = 1, 2, ... , "2 (M - 1) (4.39)
Step 6: Mutation
A mutation operator, called creep (Davis, 1991), is used. For the binary
model codes, it randomly replaces one bit in each offspring binary model code
with a random number 1 or O. For the offspring basis function parameter
vectors, the mutation operation is defined as
. 1
for J = 1, 2, ... , "2 (M - 1) (4.40)
Step 7: Elitism
The elitist strategy copies the best chromosome into the succeeding gen-
eration. It prevents the best chromosome being lost in the next generation. It
may increase the speed of domination of a population by a super individual,
but on balance it appears to improve genetic algorithm performance. The best
chromosome is defined as one satisfying
where
(4.42)
Em and E z correspond with .1 m and .1 z , which are defined in (4.36) and (4.37),
a > 1 and 6 < < a is a small positive number, which are given by the designer.
a and 6 are chosen such that a6 > 6, e.g., a = 1.1 and 6 = 0.05. This means
that sacrificing .1 m a little gives significant improvement in E b . Thus, the best
chromosome is one that has the smallest Eb in the neighbourhood of Em.
Take the best solution in the converged generation and place it in a second
"initial generation". Generate the other M - 1 chromosomes in this second
initial generation at random and begin the cycle again until a satisfactory
solution is obtained or Llb and Eb cannot be reduced any further. In addition,
for mixed noise distribution, the least squares algorithm in Step 3 should be
replaced by a more robust modified least squares algorithm as suggested in
Chen and Jain (1994).
4.5 Examples
This section introduces two applications. The first one considers identification
of a real system. The second one demonstrates approximation of a nonlinear
function by a mixed noise with different variance.
variable vector x
y(t - 3)
y(t - 4) [Y(t-l l
y(t - 2) ]
u(t - 1) u(t - 1)
u(t - 2) u(t - 2)
u(t - 3)
u(t - 4)
El 1.5 1.5
E2 0.3 0.3
E3 7 7
4.5 Examples 91
Example 4.1
We use the data generated by a large pilot-scale liquid level nonlinear system
with zero mean Gaussian input signal (Fonseca et al., 1993): 1000 pairs of
input output data were collected. The first 500 pairs were used in the model
selection and identification of the system, while the remaining 500 pairs were
used for validation tests. The Volterra polynomial basis function network and
the Gaussian radial basis function network were applied to select and iden-
tify the model of the system using the multiobjective identification algorithm
developed in Section 4.4.
The time lags ny and nu were obtained by a trial and error process based on
estimation of several models. During the simulation, it was found that for the
VPBF network, if ny and nu were greater than 4, the performance functions
improved very little. Similarly, for the GRBF network, if ny and nu were
greater than 2 the performance functions did not reduce significantly. It is clear
that the time lags ny and nu for the VPBF network are different from those for
the GRBF network. The main reason is that those two networks use different
kinds of basis functions which have different properties. The parameters for
the algorithm are given in Table 4.1.
VPBF Network
Since the maximum number of model terms is 45, there are 245 possible models
for selection. But, after 210 generations, an optimal model has been found by
the algorithm. The performance functions are
(PI (p) = 1.8000 (4.44)
(P2 (p) = 0.3965 (4.45)
¢3(p) = 3 (4.46)
The model represented by the VPBF network is
y(t) = 1.3234y(t - 1) - 0.3427y(t - 2) + 0.075y(t - 4)u(t - 2) (4.4 7)
The convergence of the performance functions with respect to generations
is given in Figure 4.2. It shows that the performance functions converge in
about 100 generations. In fact, in generation 94, the performance functions are
¢I(p) = 1.8119, ¢2(P) = 0.4071, and ¢3(P) = 3. After that, no improvement is
made until in generation 208 ¢1 (p) = 1.8, ¢2 (p) = 0.3965 and ¢3 (p) = 3. The
measured and estimated outputs, and the residual error of the system for the
training data are shown in Figure 4.3. The measured and estimated outputs,
and estimation error of the system for the validation test of the model identified
via the VPBF network are illustrated in Figure 4.4. Clearly, the performance
functions ¢1 (p) and ¢2 (p) are very close to the desired requirements. But they
do not satisfy them. This may result from the general drawback (premature
convergence) of genetic algorithms.
92 4. Multiobjective Nonlinear Identification
2.5,------------, 3,------------,
2 ~I 2.5
1.5 Il f psi2(p)
''=' ,-_1 f - - - - - , - - - -
2 phi1(p)
1~ psi1(p) ,------------1
1.5~
I
0.5 psi3(p) ~-
OL---~---~--~ 1L---~---~---~
0.7,------------, 20,------------,
0'6\~
0.5
0.4
0.3 L-_ _
o
~
~-P;-:hi~2(:-:P)~'--------1
___ __
100 200 300
~ ~
::1\
OL---~---~--~
o 100
phi3(p)
L -_ _ _ _ _ _--j
200 300
Fig. 4.2. Convergence of the performance functions using the VPBF network
-1
-2 _ _ Measured output
---- Estimated output
-3
0 50 100 150 200 250 300 350 400 450 500
The measured and estimated outputs of the system.
0.4
0.2
-0.2
_0.4L--~---L--L--~---L--~-~L---L--~-~
Fig. 4.3. Training results for the system using the VPBF network
4.5 Examples 93
-1
-2 _ Measured output
---- Estimated output
-3
500 550 600 650 700 750 800 850 900 950 1000
The measured and estimated outputs of the system.
0.6
0.4
-0.4 '------'-------'---'-------'.---'-------'----'-------'---'------'
500 550 600 650 700 750 800 850 900 950 1000
The estimation error of the system.
Fig. 4.4. Validation results for the system using the VPBF network
GRBF Network
Although the maximum number of model terms is only 10 (i.e., 1024 possible
models for selection), the search dimension ofthe basis function centre param-
eters is 40 in real number space (i.e., infinite possibilities for selection). After
700 generations the performance criteria are almost satisfied. At this stage,
cP3(p) = 5 (4.50)
In order to obtain a better performance, the basis function parameter vector
was searched for another 100 generations using the algorithm with a fixed num-
ber of model terms, i.e., let cP3(p) = 5 for this case. Finally, the performance
functions are
cPdp) = 1.2957 (4.51 )
cP3(p) = 5 (4.53)
The model represented by the GRBF network is
94 4. Multiobjective Nonlinear Identification
s
y(t) = ~ Wi exp
{2
- ~(y(t - j) - dij )2 -
2
~(u(t - j) - dij )2
} (4.54)
where
W1
W2 1 -1.2470
r-2.63631
W3 = -1.7695 (4.55)
r W4 0.9437
Ws -0.5341
¢3(p) =4 (4.58)
Wl] = [1.2394]
[W2
W3
-2.4092
-2.8293
(4.59)
W4 -2.5141
4,-------------------, 6,-------------------,
5
3
.- - -I.-----!=.1,...-::.
psi2(p) -
0
0 200 400 600 800 200 400 600 800
0.8 8
1fU
0.6 7 -
0.4 6
phi2(p)
0.2 5 phi3(p) -
0 4
0 200 400 600 800 0 200 400 600 800
Fig. 4.5. Convergence of the performance functions using the GRBF network
-1
-2 __ Measured output
---- Estimated output
_3L----L----L----L----~--~----~--~----~----L---~
0.2,----,----__---,----,---__----,----,----__----,---~
_0.2~---L----L----L----~--~----~--~----~----~--~
o 50 100 150 200 250 300 350 400 450 500
The estimation error of the system.
Fig. 4.6. Training results for the system using the GRBF network
96 4. Multiobjective Nonlinear Identification
-1
-2 __ Measured output
---- Estimated output
_3L--~---L--L--~---L--L--~---L--L--~
500 550 600 650 700 750 800 850 900 950 1000
The measured and estimated outputs of the system.
0.4 ,------,--,-----,--,-------,--,------,----,--,------,
-0.2
_0.4L--~---L--L--~---L--L--~---L--L--~
500 550 600 650 700 750 800 850 900 950 1000
The estimation error of the system.
Fig. 4.7. Validation results for the system using the GRBF network
3.5,----,----,----,------, 9
3 8
1L--~--~--~--~
o 200 400 600 800
5
4
0 200
In 400
phi3(p)
600 800
Fig. 4.8. Convergence of the performance functions using the GRBF network with-
out (/J2
4.5 Examples 97
-1
-2 __ Measured output
---- Estimated output
_3L----L----L----L----~--~----~--~----~----L---~
0.2
-0.2
Fig. 4.9. Training results for the system using the GRBF network without (P2
-1
-1.5
___ Measured output
-2 ---- Estimated output
_2.5L----L----L----L----L----L----~--~----~--~--~
500 550 600 650 700 750 800 850 900 950 1000
The measured and estimated outputs of the system.
0.4,----,----,----,----,----,----,----,----,----,----,
-0.4
_0.6L----L----L----L----L----L----~--~----~--~--~
500 550 600 650 700 750 800 850 900 950 1000
The estimation error of the system.
Fig. 4.10. Validation results for the system using the GRBF network without (P2
The simulation results are shown in Figures 4.8-4.10. It is clear from the re-
sults that although the performance functions CPl (p) and CP3 (p) are reduced,
98 4. Multiobjective Nonlinear Identification
the maximum difference cP2 (p) of the approximation for the identification and
validity test is much greater than the previous case. So, it shows that if the per-
formance functions cPdp) and cP3 (p) are sacrificed somewhat, the performance
function cP2 (p) is improved significantly.
The selection, identification and validation results for the large pilot-scale
liquid level nonlinear system show that the VPBF network is simpler than the
GRBF network, but the performance of the latter is better than that of the
former. However, it is difficult to conclude that the GRBF model is better than
the VPBF model or vice versa. On the same set of experiments, the Bayesian
method selection and identification with Gaussian noise assumptions leads to
very similar performance as the above but needed 11 and 16 basis functions
(hidden units) for the VPBF and GRBF networks (Kadirkamanathan, 1995).
The identified model here is much simpler.
Example 4.2
Consider the following underlying nonlinear function to be approximated.
(4.61)
where x is a variable. A random sampling of the interval [-4,4] is used in
obtaining the 40 input output data items for approximation.
In order to see the effect of noise, the output of the function f to a given
input x is given by
f(x) = j*(x) + e (4.62)
where e is a mixed noise. The noise consists of uniformly and normally dis-
tributed noises, i.e.,
(4.63)
where eU[O,O"] is a zero mean uniform noise with finite variance (J" and eD[O,O"]
is a zero mean normal noise with finite variance (J". It is assumed that the
uniform noise eU[O,O"] and the normal noise eD[O,O"] are uncorrelated. Thus, the
mean and variance of the mixed noise e are zero and (J", respectively.
Here, the Gaussian radial basis function network was used to approximate
the nonlinear function by the multiobjective identification algorithm developed
in Section 4.4. Three cases were considered in this simulation. The first used
three performance functions during approximation. The second considered two
performance functions. The third used only one performance function. Actu-
ally, the following cases were taken into account: (a) Case 1: [cPdp) , cP2(P),
cP3(P)], (b) Case 2: [cPl(P), cP3(P)] and (c) Case 3: [cPl(P)]'
The effects of the mixed noise with different variance on the performance
functions cPl (p), cP2 (p) and cP3 (p) for the above three cases are illustrated in
Figures 4.11-4.13, respectively. It can be seen from the simulation results that
the performance of the approximation of the nonlinear function changes little
at low level variance of noise and the multiobjective case using three perfor-
mance criteria gives a good approximation even though the three performance
functions conflict with each other.
4.5 Examples 99
OL-----~------~--------~------~--------~~~--~
10-2 10- ' 100
The noise variance
Fig. 4.11. Performance function 1>1 (p) against noise variance (J'
10-1
The noise variance
Fig. 4.12. Performance function 1>2 (p) against noise variance (J'
100 4. Multiobjective Nonlinear Identification
18
~ 16
D.
c
o
t5
.2 14 -'_X_ [phi 1, phi2, phi3]
_0_0_ [phi1, phi3]
_'_'_ [phi1]
6L------------~~~~-----~~~--~~
10-2 10-' 10°
The noise variance
Fig. 4.13. Performance function CP3 (p) against noise variance (]'
4.6 Summary
This chapter has addressed the problems of model selection and identification
of nonlinear systems using neural networks, genetic algorithms and multiob-
jective optimisation techniques. Three performance functions that measure
approximation accuracy and model complexity are proposed as the multiob-
jective criteria in the identification task. They are the L 2 - and Loa-norms of
the difference measurements between the real nonlinear system and the non-
linear model, and the number of nonlinear units in the nonlinear model. The
optimisation is carried out using genetic algorithms which select the nonlinear
function units to arrive at the simplest model necessary for approximation,
along with optimising the multiobjective performance criteria. Volterra poly-
nomial basis function networks and Gaussian radial basis function networks
are subjected to the algorithm in the task of a liquid level nonlinear system
identification. The model selection procedure results in determining the rel-
evant linear and second order nonlinear terms for the VPBF model and in
selection of the basis function centres for the GRBF model. The experimental
results demonstrate the convergence of the developed algorithm and its ability
to arrive at a simple model which approximates the nonlinear system well.
The approach discussed in this chapter can also be extended in many ways,
for example, to adaptively modify the numerical bounds on the performance
functions. Furthermore, cross-validation techniques can be used to guide the
optimisation and also in the adaptation of bounds on the performance func-
tions.
CHAPTERS
WAVELET BASED NONLINEAR IDENTIFICATION
5.1 Introduction
The approximation of general continuous functions by nonlinear networks has
been widely applied to system modelling and identification. Such approxima-
tion methods are particularly useful in the black-box identification of nonlinear
systems where very little a priori knowledge is available. For example, neu-
ral networks have been established as a general approximation tool for fitting
nonlinear models from input output data on the basis of the universal approx-
imation property of such networks. There has also been considerable recent
interest in identification of general nonlinear systems based on radial basis
networks (Poggio and Girosi, 1990a,b), fu",,,,y sets and rules (Zadeh, 1994),
neural-fuzzy networks (Brown and Harris, 1994; Wang et al., 1995) and hing-
ing hyperplanes (Breiman, 1993).
The recently introduced wavelet decomposition (Grossmannand and Mor-
let, 1984; Daubechies, 1988; Mallat, 1989a; Chui, 1992; Meyer, 1993; IEEE,
1996) also emerges as a new powerful tool for approximation. In recent years,
wavelets have become a very active subject in many scientific and engineer-
ing research areas. vVavelet decompositions provide a useful basis for localised
approximation of functions with any degree of regularity at different scales
and with a desired accuracy. Recent advances have also shown the existence of
orthonormal wavelet bases, from which follows the variability of rates of con-
vergence for approximation by wavelet based networks. Wavelets can therefore
be viewed as a new basis for representing functions. Wavelet based networks
(or simply wavelet networks) are inspired by both feedforward neural networks
and wavelet decompositions. They have been introduced for the identification
of nonlinear static systems (Zhang and Benveniste, 1992) and nonlinear dy-
namical systems (Coca and Billings, 1997; Liu et al., 1998, 2000).
This chapter presents a wavelet network based identification scheme for
nonlinear dynamical systems. Two kinds of wavelet networks are studied: fixed
and variable wavelet networks. The former are used for the case where the esti-
mation accuracy is assumed to be achieved by a known resolution scale. But, in
practice, this assumption is not realistic because the nonlinear function to be
identified is unknown and the system operating point may change with time.
Thus, variable wavelet networks are introduced to deal with this problem. The
basic principle of the variable wavelet network is that the number of wavelets
in the network can either be increased or decreased over time according to a
102 5. Wavelet Based Nonlinear Identification
Wavelets are a class of functions that have some interesting and special prop-
erties. Some basic concepts about orthonormal wavelet bases will be intro-
duced initially. Then the wavelet series representation of one-dimensional and
multidimensional functions will be considered. Finally, wavelet networks are
introduced.
(5.1)
5.2 Wavelet Networks 103
which satisfies
Wj n Wi = {0}, j i- i (5.2)
Any wavelet generates a direct sum decomposition of L2(R). For each j EN,
let us consider the closed subspaces:
Vj = ... EB Wj - 2 EB Wj - 1 (5.3)
of L 2 (R), where EB denotes the direct sum. These subspaces have the following
properties:
(ii) closeL2 (U
JEN
Vj) = L 2 (R)
where ¢jok(X) = 2jo/2¢(2jox - k), 'l/Jjdx) = 2j / 2'l/J(2 j x - k), and the wavelet
coefficients ajok and bjk are
(5.5)
(5.6)
104 5. Wavelet Based Nonlinear Identification
(5.7)
(5.8)
(5.11)
For system identification, f(x) is unknown. Then the wavelet coefficients ajok
and b;2 cannot be calculated simply by (5.12) and (5.13). As (5.8) shows,
constructing and storing orthonormal wavelet bases involves a prohibitive cost
for large dimensions n. In addition, it is not realistic to use an infinite number
of wavelets to represent the function f (x). So, we consider the following wavelet
representation of the function f(x):
(5.14)
j=jo kEBj i=l
where A jo ' Bj E Nn are the finite vector sets of integers and N E R 1 is a finite
integer. Since the convergence of the series in (5.11) is in L 2 (R n ),
5.3 Identification Using Fixed Wavelet Networks 105
Hence, given E > 0, there exists a number N* and vector sets A;o' B;o' B;0+1' ... ,
BN such that for N:2: N*, Ajo :;2 A;o and Bjo :;2 B;o' Bjo+1 :;2 B;o+1'···'
BN :;2 B N,
where n = r + d, and Ajok' Bj~ E nn are the wavelet coefficient vectors, the
scaling function P jo k (x, u) and the wavelet functions lJr; ~ (x, u) are similarly
defined as p( x) and lJrj (x), respectively, by replacing x with (x, u).
Here, it is assumed that the number N and the vector sets A jo , Bj are given.
So, the wavelet network (5.18) for the estimation of the nonlinear function
f (x, u) is called a fixed wavelet network. Based on the estimation j (x, u) by
the fixed wavelet network, the nonlinear function f(x, u) can be expressed by
N 2n-1
f( x,u ) = '"' '" (X,U ) + '"'
~ A*jok'¥jok ~ '"'~ '"' jk '£'jk (X,U ) +EN
~ B(i)*,T,(i) (5.19)
j=jo kEBj i=1
where the optimal wavelet coefficient vectors A;ok and Bj~* are
106 5. Wavelet Based Nonlinear Identification
()
< fd(X, u), tJrj~ (x, u) >
EN = [E Nl , E N2, ... , ENd] T is the modelling error vector which is assumed to be
bounded by
(5.22)
Modelling the nonlinear function vector f(x, u) using wavelet networks gives
the following identification model for the nonlinear dynamical system (5.17):
where x denotes the state vector of the network model and A E n dxd is a
Hurwitz or stability matrix (i.e., all the eigenvalues are in the open left-half
complex plane).
Define the state error vector and wavelet coefficient error vectors as
ex x-x (5.24)
Ajok Ajok - Ajok (5.25)
ex = A ex + '"'
~ A- jok'¥jok
J. ( x, U ) + '
~ "' ~ '~
'"'"' B- jk '¥jk ( x, U )
(i),T,(i) + EN (5.27)
j=jo kEBj i=l
(5.28)
(5.29)
(5.30)
(5.32)
(5.33)
(5.36)
108 5. Wavelet Based Nonlinear Identification
B'(i)
jk -- {
(5.37)
where Mjok, Mj~ are the allowed largest values of IIAjok11 and IIB3~11, respec-
tively. It is clear that if the initial parameter vectors are chosen such that
Ajok(O) E F-(Pjok,Mjok)U F+(Pjok, Mjok) and B3~(0) E F-(l]/j~,Mj~)U
F+(l]/j~), Mj~), then the vectors Ajok and B3~ are confined to the sets
J '¥Jok, M.Jok ) UJr+(",.
r-(",. Jok an d Jr-('T,(i)
'¥Jok, M·) jk U Jr+('T,(i)
'¥jk' M(i)) 't'jk' M(i))
jk' respec-
tively. Using the adaptive laws (5.36) and (5.37), (5.30) becomes
d d
(5.39)
For nonlinear systems, the system operation can change with time. This will
result in an estimation error for the fixed wavelet network that is beyond the
5.4 Identification Using Variable Wavelet Networks 109
Generally speaking, a variable wavelet network has the property that the num-
ber of wavelons in the network can be either increased or decreased over time
according to a design strategy. For the problem of nonlinear modelling, the
variable wavelet network is initialised with a small number of wavelons. As
observations are received, the network grows by adding new wavelons or is
pruned by removing old ones.
According to the multiresolution approximation theory, increasing the res-
olution of the network will improve the approximation. To improve the ap-
proximation accuracy, the growing network technique (Kadirkamanathan and
Niranjan, 1993; Liu et al., 1996c) is applied. This means that the wavelets at a
higher resolution need to be added to the network. Here it is assumed that at
the resolution 2N the approximation of the function f by the wavelet network
is denoted as j(N). Based on the growing network technique and the structure
J
of the function in (5.18), the adding operation is defined as
2n_l
where ffi denotes the adding operation. Equation 5.40 means that wavelets at
the resolution 2N+1 are added to the network. To add new wavelons to the
network the following two conditions must be satisfied: (a) The modelling error
must be greater than the required accuracy. (b) The period between the two
adding operations must be greater than the minimum response time of the
adding operation.
The removing operation is defined as
(5.41)
110 5. Wavelet Based Nonlinear Identification
where 8 denotes the removing operation. Equation 5.41 implies that wavelets
at the resolution 2N are removed from the network. Similarly, to remove some
old wavelons from the network, the following two conditions must be satisfied:
(a) The modelling error must be less than the required accuracy. (b) The pe-
riod between the two removing operations must be greater than the minimum
response time of the removing operation.
In both the adding and the removing operations, condition (a) means that
the change of the modelling error in the network must be significant. Condition
(b) says the minimum response time of each operation must be considered.
From the set 8(EN) which gives a relationship between the state error ex
and the modelling errors EN, it can be shown that the state error depends on
the modelling error. If the upper bound EN of the modelling error is known,
then the set 8 (EN ) to which the state error will converge is also known. How-
ever, in most cases the upper bound EN is unknown.
In practice, systems are usually required to keep the state errors within
prescribed bounds, that is,
(5.42)
(5.43)
is then tried, where 1]f(t),r//(t) are mono decreasing functions of time t, re-
spectively. For example,
where f3u, f3L are positive constants, 1]; (0), 1]f (0) are the initial values. It is
clear that 1]; (t), 1]f (t) decrease with time t. As t -+ 0, 1]; (t), 1]f (t) approach O.
Thus, in this way the state errors reach the required accuracy given in (5.42).
From the relationship between the modelling error and the state error and
given the lower and upper bounds 1];(t),1]f(t) + ei of the state errors the
corresponding modelling error should be
(5.46)
From (5.39) the area that the set 8(() covers is a hyperellipsoid with centre
.. " (5.4 7)
5.4 Identification Using Variable Wavelet Networks 111
It can also be deduced from the set 8(EN(t)) that the upper bound eu(t) and
the lower bound edt) are given by
(5.49)
Thus, given the upper and lower bounds of the state error, the corresponding
values for the modelling error can be estimated by (5.48) and (5.49).
To smooth the identification performance when the adding and removing op-
erations are used, the decomposition and reconstruction algorithms of a mul-
tiresolution decomposition are applied to the initial calculation of the wavelet
coefficients. Here two important relations are introduced. First, since the fam-
ily {Pk} spans Va, then {p(2x - k)} spans the next filter scale VI = Va ttl Wa.
Both the scaling function and the wavelet function can be expressed in terms
of the scaling function at the resolution 2j = 21, i.e.,
(5.50)
kEN
where Ck and d~i) are known as the two scale reconstruction sequences. Second,
any scaling function p(2x) in VI can alternatively be written using the scaling
function p(x) in Va and wavelet function tP(x) in Wa as
(5.52)
where ak and b~i) are known as the decomposition sequences, and I E Nn.
In addition, in terms of multiresolution decompositions, the approximation
of the function f(x, u) at the resolution 2j can be written as
112 5. Wavelet Based Nonlinear Identification
j(j)(x,u) = L A(j-l)kP(j-l)k(X,U)
kEAj-l
(5.53)
where
Hence, if the state error ex rf. 8(EU(t)), the network needs more wavelets. Add
the wavelets at the resolution 2N+1 into the network. Following the adding
operation (5.40) and the expression (5.53) of j(j) (x, u) at j = N, the structure
of the approximated function j(x, u) is of the form
(5.55)
The parameter vectors ANk and B}Yk are adapted by the laws (5.36) and (5.37).
U sing the sequences Ck and d~i), the initial values after the adding operation
are then given by the reconstruction algorithm (Mallat, 1989b) below.
(5.56)
(5.57)
where A(N-l)k and B~2-1)k are the estimated values before the adding oper-
ation.
If the state error ex E 8(EL(t)), some wavelets need to be removed because
the network may be overfitted. In this case remove the wavelets associated
with the resolution 2N. In terms of the removing operation (5.41) and the
expression (5.53) of j(j)(x,u) at j = N, the structure of the approximated
function j (x, u) is of the following form:
The adaptive laws for the parameters A(N-2)k and B~2-2)k are still given
by (5.36) and (5.37). But, using the sequences ak and b~i), the initial values
5.5 Identification Using B-spline Wavelets 113
after the removing operation are then changed by the decomposition algorithm
(Mallat, 1989b) as follows:
(5.60)
where ANk and B~~ are the estimated values before the removing operation.
Clearly, in both the above cases, the adaptive laws of the parameters are
still given in the form of (5.36) and (5.37), based on the above changed param-
eters. It also follows that the convergence area of the state error vector begins
with 8(c:u(0)) - 8(c:L(0)) and ends with 8(s), where s = c:u(oo).
The determination of the vector sets Aj and B j, for j = jo, ... , N is also
important but simple. The basic rule for choosing these sets is to make sure
2 j x - k, for k E Aj or Bj , is not out of the valid range of the variables of the
scaling function <1>(.) or !J)i (.), respectively.
xE(O,l)
(5.62)
otherwise
114 5. Wavelet Based Nonlinear Identification
Let the m-th B-spline function be the scaling function, that is,
(5.63)
Then both the scaling function and the wavelets can be expressed in terms of
the scaling function at the resolution 2j = 21
m
(5.64)
k=O
3m-2
1fJ(x) = L dk B m(2x - k) (5.65)
k=O
where the two scale reconstruction sequences Ck and d k are given by (Chui,
1992)
Ck = 21 - m (~) (5.66)
dk = (_I)k21-m f (7)
z=o
B 2m (k + l-l) (5.67)
Also, the relationship between the scaling functions B m (2x) and Bm(x) and
the wavelet 1fJ (x) can be expressed as
1
ak = 2g-k (5.69)
1
bk = 2h_k (5.70)
G( Z) -- ~ '~gkZ
" k -_ Z-1 (I+Z)m E 2m - 1(Z) (5.71 )
2 k 2 E 2m - 1 (Z2)
(
H Z
) =~"'h"
2 ~ kZ
k= __ l(I-Z)m (2m-I)!
Z 2 E 2m _ 1(Z2) (5.72)
k
which leads to
In this case,
where d~2) = dk, Ck2 dk3 .. .dkn . Thus, all sequences {d~i)} can be calculated in
the same way as d~2).
With (5.68), the relationship between the scaling functions p(2x) and p(x)
and wavelets tPi (x) can be expressed as
n
p(2x -l) = II L(ali-2kiBm(Xi - ki ) + bli-2ki?j!(Xi - ki)), l E Nn (5.78)
i=1 ki
which results in the following compact form
(5.79)
116 5. Wavelet Based Nonlinear Identification
5.6 An Example
(5.80)
where the input u = 0.5( cos(1.2t) sin(l. 7t) + exp( - sin( t 4 ))). Since n = 2,
we will need 2-D B-spline wavelets for the wavelet network to identify this
nonlinear dynamical system. Fourth order B-splines were used as the scaling
function. Thus, the 2-D scaling function is given by
(5.81)
10
'ljJ(x) = {; ~
4
-=s- (4)
( l)k
I Bs(k + 1 -1)B4(2x - k) (5.85)
The 2-D scaling function p( x, u) and the three 2-D wavelets tJf1 (x, u), tJf2 (x, u),
tJf3 (x, u) are shown in Figures 5.1-5.4. The state x and the nonlinear function
f(x, u) (or the state derivative x) are shown in Figures 5.5 and 5.6, respectively.
Wavelet networks at the resolutions 2j , for j = 0,1,2,3 were used for the
identification with 16, 81, 146 and 278 wavelons, respectively. The state errors
and the modelling errors for different resolutions are shown in Figures 5.7-5.14.
All figures denoted (b) are larger-scale versions of the figures denoted (a).
As expected, at the beginning of the identification larger state errors and
modelling errors exist. After a while, these errors become smaller and smaller,
and finally they converge to certain ranges. It is clear from the simulation
results that the whole identification scheme is stable from the beginning to
the end. It has also been shown that the state error and the modelling error
5.6 An Example 117
decrease with increase in the resolution of the wavelet networks. But, the state
error and the modelling error are improved only slightly when the resolution
becomes adequate. Thus, for nonlinear dynamical system identification using
wavelet networks, a proper resolution should be chosen so as to achieve the
desired practical identification requirements.
0.5
0.4
0.1
o
4
4
-1 -1
x
0.15
0.1
0.05
S 0
:6-
'iii
D..
-0.05
-0.1
-0.15
-0.2
4
3 6
2 5
4
3
o 2
-1 0
x
0.15
0.1
0.05
S" 0
x
~ -0.05
0..
-0.1
-0.15
-0.2
6
4 4
3
2 2
o
o
-2 -1
x
0.1
0.05
S"
:6- 0
'"'iii
0..
-0.05
-0.1
6
6
4 5
4
2 3
2
o 0
0.2,-------,--------,--------,-------,-----,------,
0.1
~ -0.1
(ii
1i)
~ -0.2
~Q)
F -0.3
-0.4
-0.5
_0.6L----L----~----L----L---~---~
o 5 10 15 20 25 30
timet
0.5
0.4 A
A
0.3
S
~
0.2 A A A
1\ A
c
0 A
nc 0.1
A
.2
~
Q)
.!; 0
c
0
c
i!! -0.1 V V
f-
V V v
-0.2
V
-0.3 V
V v
V
-0.4
0 5 10 15 20 25 30
timet
0.1
0.05
eQ;
fil
11)
0
Q)
.c
f- -0.05
-0.1
0 5 10 15 20 25 30
(a) timet
0.03
0.02
eQ;
Q)
0.01
(ii
11)
Q)
0
.c
f-
-0.01
-0.02
0 5 10 15 20 25 30
(b) timet
eQ; 0.5
Ol
§
Qi 0
"0
0
E
~ -0.5
f-
-1
0 5 10 15 20 25 30
(a) timet
0.1
g 0.05
Q)
OJ
.!';
Qi 0
"0
0
E
~ -0.05
f-
-0.1
0 5 10 15 20 25 30
(b) timet
0.1
0.05
eQ;
fil
11)
0
Q)
.c
f- -0.05
-0.1
0 5 10 15 20 25 30
(a) timet
x 10- 3
10
eQ; 5
Q)
'"
11)
Q)
.c 0
f-
_5L-----~------~------~-------L-------L--~--~
o 5
(b) timet
eQ;
Ol
§
Qi
"0
0
E
~ -1
f-
-2
0 5 10 15 20 25 30
(a) timet
0.1
g
Q)
OJ 0.05
.!';
Qi
"0
0
E 0
Q)
.c
f-
-0.05
0 5 10 15 20 25 30
(b) timet
0.1
eQ;
fil
11)
-0.1
0 5 10 15 20 25 30
(a) timet
x 10- 3
5
eQ;
Q)
'"
11)
Q)
.c
0
f-
-5
0 5 10 15 20 25 30
(b) timet
eQ; 2
Ol
§
Qi
"0
0
E -1
-3
0 5 10 15 20 25 30
(a) timet
0.1
g 0.05
Q)
OJ
.!';
Qi 0
"0
0
E
j!! -0.05
f-
-0.1
0 5 10 15 20 25 30
(b) timet
0.1
eQ;
fil
11)
-0.1
0 5 10 15 20 25 30
(a) timet
4 x 10- 3
2
eQ;
Q)
'"
11)
Q)
.c
0
f- -2
-4
0 5 10 15 20 25 30
(b) timet
eQ;
Ol
§
Qi
"0
0
E
-4
0 5 10 15 20 25 30
(a) timet
0.1
g 0.05
Q)
OJ
.!';
Qi 0
"0
0
E
j!! -0.05
f-
-0.1
0 5 10 15 20 25 30
(b) timet
5.7 Summary
A wavelet network based identification scheme has been presented for nonlinear
dynamical systems. Two kinds of wavelet networks, fixed and variable wavelet
networks, were studied. Parameter adaptation laws were derived to achieve
the required estimation accuracy for a suitable sized network and to adapt to
variations of the characteristics and operating points in nonlinear systems. The
parameters of the wavelet network were adjusted using laws developed by the
Lyapunov synthesis approach. The identification algorithm was performed over
the network parameters by taking advantage of the decomposition and recon-
struction algorithms of a multiresolution decomposition when the resolution
scale changes in the variable wavelet network. By combining wavelet networks
with Lyapunov synthesis techniques, adaptive parameter laws were developed
which guarantee the stability of the whole identification scheme and the con-
vergence of both the network parameters and the state errors. The wavelet
network identification scheme was realised using B-spline wavelets and the
calculation of the decomposition and reconstruction sequences using variable
wavelet networks was given. A simulated example was used to demonstrate
the operation of the identification scheme.
CHAPTER 6
NONLINEAR ADAPTIVE NEURAL CONTROL
6.1 Introduction
Neural networks are capable of learning and reconstructing complex nonlinear
mappings and have been widely studied by control researchers in the design
of control systems. A large number of control structures have been proposed,
including supervised control (Werbos, 1990), direct inverse control (Miller et
ai., 1990), model reference control (Narendra and Parthasarathy, 1990), inter-
nal model control (Hunt and Sbararo, 1991), predictive control (Hunt et aL,
1992; Willis et ai., 1992), gain scheduling (Guez et aZ., 1988), optimal deci-
sion control (Fu, 1970), adaptive linear control (Chi et aZ., 1990), reinforce-
ment learning mntrol (Anderson, 1989; Barto, 1990), indirect adaptive mntrol
(Narendra and Parthasarathy, 1990; Liu et aZ., 1999a) and direct adaptive con-
trol (Polycarpou and Ioannou, 1991; Sanner and Slotine, 1992; Karakasoglu et
ai., 1993; Sadegh, 1993; Lee and Tan, 1993). The principal types of neural net-
works used for control problems are the multilayer percept ron neural networks
with sigmoidal units (Psaltis et ai., 1988; Miller et ai., 1990; Narendra and
Parthasarathy, 1990) and the radial basis function neural networks (Powell,
1987; Niranjan and Fallside, 1990; Poggio and Girosi, 1990a).
Most of the neural network based control schemes view the problem as
deriving adaptation laws using a fixed structure neural network. However,
choosing network structure details such as the number of basis functions (hid-
den units in a single hidden layer) in the neural network must be done a prior'l,
which often leads to either an overdetermined or an underdetermined network
structure. The problem with these control schemes is that they require all
observations to be available and hence are difficult for on-line control tasks,
especially adaptive control. In addition, fixed structure neural networks often
need a large number of basis functions even for simple problems.
This chapter is concerned with the adaptive control of continuous-time
nonlinear dynamical systems using variable neural networks. In variable neural
networks, the number of basis functions can be either increased or decreased
with time according to specified design strategies so that the network will not
overfit or underfit the data set. Based on Gaussian radial basis function variable
neural networks, an adaptive control scheme is presented. 'iVeight adaptive laws
developed using the Lyapunov synthesis approach ensure the overall control
scheme is stable, even in the presence of modelling error. The tracking errors
between the reference inputs and outputs converge to the required accuracy
126 6. Nonlinear Adaptive Neural Control
Theorem 6.2.1. Let the function V(x, t) : nn+1 -+ n satisfy the following
conditions:
6.2 Adaptive Control 127
(a) V(O, t) = ° Vt E R.
(b) V(x, t) is differentiable in x ERn and t E R.
(c) V(x, t) is positive definite.
A sufficient condition for uniform asymptotic stability of the system in
(6.1) is then that the function V(x, t) is negative definite.
The proof of the theorem can be found in Vidyasagar (1978). When applying
Lyapunov stability theory to an adaptive control problem, we will get a time
derivative of the Lyapunov function V(x, t), which depends on the control
signal and other signals in the system. If these signals are bounded, system
stability can be ensured by the condition that V is negative semidefinite.
To illustrate that the Lypunove stability theorem can be used to design an
adaptive control law that guarantees the stability of the closed-loop system,
consider a linear system described by
where y E Rand u E R are the output and the input of the system, respec-
tively, y(i) is the i-th derivative of the output with respect to time, ai and bi
are the unknown coefficients of the system.
Also, it is assumed that the reference model is
where Ym E R is the output of the model, O!i and f3i are the known coefficients
of the model.
Let the error be defined as
Subtracting (6.3) from (6.2) results in the following error differential equation:
n-l n-l m
(6.5)
i=O i=O j=o
(6.10)
A = [ 0 (6.11)
-ao
(6.12)
Llb = [ 0 ... 0
m
~ b~u
~. (j)
(t)
1
T
(6.13)
where A E nnxn and r(m+1) x (m+1) are the weighting matrices that are posi-
tive definite and are of a diagonal form:
A = diag[ Ao A2 (6.15)
r = diag[ /0 /1 (6.16)
(6.21)
is negative. Thus the function V will decrease as long as the error x is different
from zero. It can be calculated that the error will go to zero. This means that
the closed loop adaptive control system is stable.
where
(6.25)
130 6. Nonlinear Adaptive Neural Control
where
m
K=Lmi (6.28)
i=l
6.3 Adaptive Neural Control 131
ci+j is the j-th element of the set Ci , mi the number of its elements, ft+j and
gi+ j the optimal weights, x = [Xl, X2, .•. , xnf the variable vector,
Ck the k-th
centre, d k the k-th width, c(K) the modelling error, and K the number of
basis functions. The nonlinear function G(x)u - F(x) approximated by neural
networks is shown in Figure 6.2. So, the next step is how to obtain estimates
of the weights.
G(x) u-F(x)
Fig. 6.2. Modelling of the nonlinear function G(x)u - F(x) using neural networks
Thus the nonlinear part G(x)u - F(x) of the system can be described by the
following compact form:
G(x)u - F(x) = (g*(K)u - j*(K)f <fJ(x, K) + c(K) (6.30)
where
(6.31)
(6.33)
It is known from approximation theory that the modelling error can be
reduced arbitrarily by increasing the number K, i.e., the number of linear
independent basis functions i.p(x; Ci, di ) in the network model. Thus, it is rea-
sonable to assume that the modelling error c(K) is bounded by a constant CK,
which represents the accuracy of the model and this is defined as
132 6. Nonlinear Adaptive Neural Control
e x - Yd (6.35)
j(K) j*(K) - f(K) (6.36)
g(K) g*(K) - g(K) (6.37)
where f(K) and g(K) are the estimated weight vectors. From (6.23), it can
be shown that
Ax + b(gT(K)u - fT(K))P(x; K)
+b(gT (K)u - jT (K))P(x; K) + bc(K) (6.38)
(6.40)
where the vector a = [al, a2, ... , anV makes the following matrix
6.3 Adaptive Neural Control 133
1 0
o 1
(6.41 )
stable, i. e., all the eigenvalues are in the open left plane. The control in-
put consists of a linear combination of the tracking errors aTe, the adaptive
part fT(K)p(x,K) which will attempt to estimate and cancel the unknown
function F(.), and y~n) is a feedforward of the n-th derivative of the desired
trajectory.
Consider the following Lyapunov function
(6.42)
In the presence of a modelling error c( K), to ensure the stability of the system,
many algorithms, e.g., the fixed or switching O"-modification (Ioannou and
Kokotovic, 1983; Ioannou and Tsakalis, 1986), c-modification (Narendra and
Annaswamy, 1987), the dead-zone methods (Narendra and Annaswamy, 1989;
Sastry and Bodson, 1989) and projection algorithm (Goodwin and Mayne,
1987; Ioannou and Datta, 1991; Polycarpou and Ioannou, 1991), can be applied
to modify the above standard adaptation laws.
Define the following sets:
(6.4 7)
It is clear that if the initial weights are chosen such that f(K,O) E F1 U F2
and g(K,O) E ~h U Q2, then the weight vectors f and 9 are confined to the
sets F1 UF2 and Q1 UQ2, respectively. With use of the adaptive laws (6.50)
and (6.51), Equation 6.43 becomes
n
V(e,J,g) ::; -eTQe + 2 L IPnilleilcK (6.52)
i=l
For the sake of simplicity, the positive definite matrix Q is assumed to be
diagonal, i.e., Q = diag[q1,q2, ... ,qn], where qi > 0, for i = 1,2, ... ,n. Also
define
(6.53)
The above clearly shows that V is negative semidefinite. Hence the stability
of the overall identification scheme is guaranteed and
e -+ 0, j -+ 0, g -+ 0 (6.55)
On the other hand, in the presence of modelling error, Equation 6.52 can be
expressed as
. ~ ~ n
V(e,f,g)::; - Lqi
( IPni I
leil- - . cK
)2 + Ln -2. -2
PnicK
(6.56)
i=l q~ i=l q~
6.4 Adaptation Algorithm with Variable Networks 135
It is easy to show from the above that if e rf- 8(EK), V is still negative and
the tracking errors will converge to the set 8(EK)' But, if e E 8(EK), it is
possible that V > 0, which implies that the weight vectors f(K) and g(K)
may drift to infinity over time. The adaptive laws (6.50) and (6.51) avoid this
drift by limiting the upper bounds of the weights. Thus the tracking error
always converges to the set 8(EK) and the overall control scheme will remain
stable in the case of modelling error.
The set 8(EK), which gives a relationship between the tracking and modelling
errors, indicates that the tracking error depends on the modelling error. If the
modelling error EK is known, then the set 8(EK) to which the tracking error
will converge can be worked out. However, in most cases the upper bound EK
is unknown.
In practice, control systems are usually required to keep the tracking errors
within prescribed bounds, that is,
where Llf (t) and Llf (t) are mono decreasing functions of time t, respectively.
Those bounds are usually defined as
where f3u and f3L are constants and less than 1, Llf (0) and Llf (0) are the initial
values. It is clear that Llf (t) and Llf (t) decrease with time t. As t -+ 00, Llf (t)
and Llf(t) approach O. Thus, in this way the tracking errors reach the required
accuracy given in (6.57).
The relationship between the modelling error and the tracking error shows
that, given the lower bound Llf (t) and upper bound Llf (t) +EiO of the tracking
errors, the modelling error corresponding to the above should be
Since the area that the set 8( () covers is a hyperellipsoid with the centre
136 6. Nonlinear Adaptive Neural Control
... , (6.62)
it can be deduced from the set 8(cK(t)) given by (6.53) that the upper bound
cu(t) and the lower bound cdt) are given by
cdt) =. ~ax
z=1,2, ... ,n
(IPnil
qi
+ (t P;j) 0.5)_0.5
.
J=1
qiqJ·
L1f(t) (6.63)
cU(t) = . ~in
z=1,2, ... ,n
(IPnil
qi
+ (t P;j ) 0.5) -0.5
j=1 qiqj
(L1f (t) + ciO) (6.64)
It has been shown in Section 2.2 that a variable GRBF network is determined
by a set of parameters, which are as follows:
(a) the possible centre set Pi provided by the i-th order subgrid,
(b) the chosen centres from the centre set Pi,
(c) the total number K of the network units,
(d) the radius 0" i of the i-th hypersphere corresponding to the i-th su bgrid to
choose the centres of the basis functions,
(e) the edge length 6i of the hypercube corresponding to the i-th subgrid,
(f) the width d i of the basis functions associated with the i-th subgrid.
Hence, if the tracking error e ~ 8(cu(t)), the network needs more basis func-
tions. Add the (m+ l)-th order subgrid to the grid. The parameters associated
with the GRBF units are then changed as follows:
K= Lmi (6.70)
i=1
where Ii, for i = 1,2,3, is a constant and less than 1.
But, if the tracking error e E 8(cdt)), the network needs to remove some
basis functions. Just remove the units associated with the m-th subgrid. The
parameters associated with the GRBF units are then changed as follows:
6.5 Examples 137
m-l
P= U Pi (6.71)
i=l
m-l
C= UC i (6.72)
i=l
m-l
K= Lmi (6.73)
i=l
In both above cases, the adaptive laws of the weights are still given in the
form of (6.50) and (6.51), based on the above changed parameters. For the
two-dimensional case, the convergence area is shown in Figure 6.3. At the be-
ginning, the convergence area of the tracking area is Eo. Finally it approaches
the expected convergence area E, that is, leil :S CiD, for i = 1,2.
6.5 Examples
This section considers two examples. The first is concerned with adaptive con-
trol of a time-invariant nonlinear system. The second considers adaptive con-
trol of a time-variant nonlinear system.
Example 6.1
The dynamical system used in the simulation example is given by (Sanner and
Slotine, 1992)
138 6. Nonlinear Adaptive Neural Control
ii - 4 (sin~:7fY)) Cin;;y) r
= (2 + sin(37fY -
which is a second-order time-invariant nonlinear system.
1.57f))u (6.74)
The parameter values used in this example are as follows: the reference
input Yd = sin( t); the initial value of the output y(O) = 0.5; the initial value
of the output derivative y(O) = 0; the required accuracy of the tracking error
vector [ElO, E20] = [0.05,0.1]; the constants f3u = f3L = 0.96; the initial val-
ues Llf (0) = 0.005, Llf(O) = 0.05, for i = 1,2; the required minimum angle
between the GRBFs COS(Bmin) = 0.951; the edge length of the rectangles in
the first subgrid is (h = 0.5; the radius of centre selection in the first subgrid
0"1 = 0.99; the width of the GRBF units corresponding to the first subgrid
d1 = 1.11; activation threshold 6m in = 0.45; the initial number of variable
networks is 45; vector a = [1,1]; matrix P = [[0.75, 0.5jT, [0.5, IjT]; adaptation
rates 0: = 1.5 and f3 = 3.
The parameters associated with the variable network are
6i = 0.6186i-1 (6.75)
O"i = 0.6180"i-1 (6.76)
d i = 0.618di - 1 (6.77)
for i = 2,3, ... , m. The maximum of m (the number of subgrids) is limited to
be 11.
The weights are adaptively adjusted by the laws (6.50) and (6.51). The
adaptive control law is given by (6.40). The results of the simulation are shown
in Figures 6.4-6.6. Though the difference between the system output and the
desired output is very large at the beginning, the system is still stable and the
tracking error asymptotically converges to the expected range, which is also
shown in Figure 6.5. As can be seen from Figure 6.6, the number of GRBF
units in the neural network also converges in a period of time.
Example 6.2
Consider a time-variant nonlinear dynamical system given by
This plant is different from that in Example 6.1. The functions F(.) and G(.) in
Example 6.1 are time-invariant nonlinear functions. While, here the functions
F and G are time-variant.
All parameter values, the structure of variable networks, the weight learning
laws, and the adaptive control laws used in this example are exactly the same as
Example 6.1. The tracking error between the reference input and the output of
the system is shown in Figure 6.7. Although the plant to be controlled is time-
variant, the convergence of the tracking error in this example is still similar
to that in Example 6.1. This shows that the adaptive control scheme using
variable neural networks also works well for time-variant nonlinear systems.
6.5 Examples 139
.~ 1.5
g
~
<J)
i5
-g
'"
"5 -0.5
Q.
~ -1
<J) _ the system output --- the reference input
F -15o"--------'------1-"-o----1"-5----2LO---~25----30
time t (sec)
.~ 3
-g
'=>"
%
o
'0
~ -1
Fig. 6.4. Reference input Yd(t), output y(t), reference input derivative Yd(t) and
output derivative y(t) of the system
0.5
0.4
0.3
2
Q;
g'
~ 0.2
jg
~
f-
0.1
o 10 15 20 25 30
time t (sec)
100~
~
90
80
II~
I
J'l 70
"§
LL
'"a:CD
60
I
a 50
:v
.0
~ 40
Jl'
>- 30
20
10
0
0 10 15 20 25 30
time t (sec)
Fig. 6.6. The number K of GRBF units in the variable neural network
0.5
0.4
0.3
~
0>
C
'"~ 0.2
1='"
0.1
-0.1 '-----------'------'------"----------"-------'----------'
o 10 15 20 25 30
time t (sec)
6.6 Summary
Nonlinear adaptive neural control has been studied in this chapter. After the
introduction of adaptive control for linear continuous-time systems, adaptive
neural control was presented by combining the variable Gaussian radial ba-
sis function network and Lyapunov synthesis techniques. This guarantees the
stability of the control system and the convergence of the tracking errors.
The number of GRBF units in the variable neural network also converges by
introducing mono decreasing upper and lower bounds on the tracking errors.
Simulation examples illustrate the operation of the variable neural network for
adaptive nonlinear system control.
CHAPTER 7
NONLINEAR PREDICTIVE NEURAL CONTROL
7.1 Introduction
Predictive control is now widely used by industry and a large number of imple-
mentation algorithms, including generalised predictive control (Clarke et al.,
1987), dynamic matrix control (Cutler and Ramaker, 1980), extended predic-
tion self-adaptive control (Keyser and Cauwenberghe, 1985), predictive func-
tion control (Richalet et al., 1987), extended hori:wn adaptive control (Ydstie,
1984) and unified predictive control (Soeterboek et al., 1990), have appeared
in the literature. Most predictive control algorithms are based on a linear
model of the process. However, industrial processes usually contain complex
nonlinearities and a linear model may be acceptable only when the process is
operating around an equilibrium point. If the process is highly nonlinear, a
nonlinear model will be necessary to describe the behaviour of the process.
Recently, neural networks have been used in some predictive control al-
gorithms that utilise nonlinear process models (Hunt et al., 1992; Willis et
al., 1992; Liu and Daley, 2001). Alternative design of nonlinear predictive
control algorithms has also been studied (McIntosh et al., 1991; Morningred
et al., 1991; Proll and Karim, 1994; Liu et al., 1996a, 1998b). However, in
most algorithms for nonlinear predictive control their performance functions
are minimised using nonlinear programming techniques to compute the future
manipulated variables in on-line optimisation. This can make the realisation
of the algorithms very difficult for real-time control.
This chapter considers neural network based affine nonlinear predictors so
that the predictive control algorithm is simple and easy to implement. The use
of nonlinear programming techniques to solve the on-line optimisation problem
is avoided and a neural network based on-line weight learning algorithm is
given for the affine nonlinear predictors. It is shown that using this algorithm,
both the weights in the neural networks and the estimation error converge and
never drift to infinity over time.
The chapter is organised as follows. Section 7.2 gives a brief introduction
to linear predictive control. Section 7.3 presents the structure of the affine
nonlinear predictors using neural networks. The predictive neural controller
is described in Section 7.4. Section 7.5 develops the on-line weight learning
algorithm for the neural networks used for the predictors and includes analysis
of the properties of the algorithm. The design of nonlinear predictive control
using 'growing' neural networks is illustrated in Section 7.6. Finally, Section
144 7. Nonlinear Predictive Neural Control
7.7 gives a simulated example to show the operation of the neural network
based predictive control.
Based on an assumed model of the process and on assumed scenario for the
future control signals, predictive control gives a sequence of control signals for
discrete systems. Only the first control signal is applied to the process and
a new sequence of control signals is calculated when new measurements are
obtained. For continuous systems, the predictive control concept is also sim-
ilar. Clearly, predictive control belongs to the class of model-based controller
design concepts, where a model of the process is explicitly used to design the
controller.
One of the important features of predictive control is that its controller is
relatively easy to tune. This makes predictive control very attractive to a wide
class of control engineers and even for people who are not control engineers.
Predictive control has other features as follows:
(a) The predictive control concept can be used to control a wide variety of
processes without taking special precautions, for example, 8180 or M1MO
processes, stable or unstable processes, minimum or non minimum phase
processes, and linear or nonlinear processes.
(b) Predictive control can handle process constraints in a systematic way dur-
ing the design of the controller, which is rather important for industrial
process control.
(c) Within the framework of predictive control there are many ways to design
predictive controllers, for example, generalised predictive control, dynamic
matrix control, and unified predictive control.
(d) Feedforward control action is introduced to predictive control in a nat-
ural way to compensate measurable disturbances and to track reference
trajectories.
(e) Predictive control can easily deal with pre-scheduled reference trajectories
or set points of processes by making use of prediction.
The way predictive controllers operate for single-input single-output systems is
illustrated by Figure 7.1. It shows that the control sequences 1 and 2 designed
using the past input output data produce different output sequences 1 and
2, respectively. This implies that if the future controller sequence is planned
correctly at time t the system output will be very close to or exactly the desired
reference trajectory. Predictive controllers are usually used in discrete time.
It is also possible to design predictive controllers for use in continuous time.
This section gives a brief introduction to predictive control for linear discrete
systems.
Let us consider the following single-input single-output discrete-time linear
system:
7.2 Predictive Control 145
Past Future
y 2
time t
(7.1)
(7.5)
where
R t +L" rt+L " i1Ut +M, are vectors of the future reference input Tt, predicted
output Yt and control input Ut, respectively, L1 = d + L - 1, M1 = M - 1, L
the output horizon, M the control horizon and a the weight.
The future reference input is the desired process output, which is often
called the reference trajectory, and can be an arbitrary sequence of points.
Then the predictive controller calculates the future controller output sequence
so that the predictive output of the process is close to the desired process
output.
Now the optimal controller output sequence u* over the predictive horizon
is obtained by minimisation of the performance function Jp with respect to u,
that is
u* = argminJp (7.9)
u
B
YtH = i1A i1Ut-dH (7.11)
, B A
Yt+k = i1A LJ.Ut-d+k (7.12)
Now the output y(t + k) for k :::: 1 can be computed recursively using (7.13),
starting with the following equation for k = 1:
The k-step-ahead predictor (7.13) and (7.14) runs independently ofthe process.
This predictor is not suitable for practical purposes because there always exist
7.2 Predictive Control 147
differences between the prediction and the real process output. For example,
model mismatch or a disturbance at the output of the process may result
in a prediction error. One way to improve the predictions is to calculate the
predictions using (7.13) and (7.14) with fit in the right-hand side of (7.14)
replaced by the measured process output Yt. Thus equation (7.14) becomes
(7.15)
(7.17)
Several methods can be used to solve the above equation, for example, a re-
cursive approach (Clarke et al., 1987).
The optimal controller output sequence over the prediction horizon is ob-
tained by minimising the performance index J p with respect to the control
input vector. This can be carried out by setting
(7.18)
In predictive control the assumption is made that all the future control incre-
ments L1Ut+i, for i < M is non-zero. Since, in practice, the control horizon in
predictive control need not be taken to be large, here set M = 2. Let
d+m+k-l
Pk = BFk = L Pk,iq
-i
(7.19)
i=O
(7.20)
where
d+m+k-l
Qk = EkYt + L Pk.iq-i L1Ut-l (7.21 )
i=k+l
gk = Pk,k-l (7.22)
hk = Pk,k (7.23)
with P k .- 1 = O.
Application of (7.18) results in the following predictive controller
148 7. Nonlinear Predictive Neural Control
[~ r[ f[
£-1 £-1 £-1
ex + 2:= g~ 2:= gkhk 2:= h+dH - Qk)gk
Ut = Ut-1 + k=O k=O k=O
£-1 £-1 £-1
k=O
2:= gk hk ex + 2:=
k=O
h~ 2:=
k=O
(rHdH - Qk)hk
1
(7.24)
It is clear from the above that the predictive controller only involves the in-
version of a 2 x 2 matrix. This makes the implementation of the predictive
control very easy.
It has been shown in the previous section that the fundamental idea in predic-
tive control is to predict the vector of future tracking errors and minimise its
norm over a given number of future control moves. It is therefore clear that
predictive controller design mainly consists of two parts: prediction and min-
imisation. This section discusses the prediction part. The minimisation part
will be considered in the next section.
Only discrete-time affine nonlinear control systems will be considered with
an input output relation described by
where F(.) and G(.) are nonlinear functions, Y is the output and U the control
input, respectively, the vector Yt = [Yt-1, Yt-2, ... , Yt-n], n is the order of Yt
and d is the time delay of the system. It is assumed that the order n and the
time delay d are known but the nonlinear functions F(.) and G(.) are unknown.
Clearly, the future output can generally be expressed by the N ARMA model
(Leontaritis and Billings, 1985; Narrendra and Mukhopadhyay, 1997)
(7.26)
for i = 0, 1, ... , L, where Fi(Xt) and G ij (Xt) are nonlinear functions ofthe vector
Xt to be estimated, and the vector Xt = [Yt, Yt-l, ... , Yt-n+l, Ut-l, Ut-2, ... ,
Ut-dl. The key feature of these predictors is that the present and future control
inputs Ut, Ut+l, ... , uHi occur linearly in (7.27). It can be seen from (7.27)
that linearised predictors for nonlinear system which are widely used in the
literature (see, e.g., Wang et al., 1995; Xie and Evans, 1984) are a special case
of the above.
Due to the arbitrary approximation feature of neural networks, the nonlin-
ear functions Fi(xt) and Gij(Xt) can both be approximated by single hidden
layer networks. This is expressed by
Ni
Fi(xt) = L fi,k'Pi,k(Xt) (7.28)
k=l
N ij
for j :S i and i, j = 0,1, ... , L, where 'Pi,dxt) and 'Pij.k (Xt) are basis functions
of the networks, Ni and N ij denote the size of the networks. Define the weight
and basis function vectors of the neural networks as
for i = 0, 1, ... , L.
It is well known from the universal approximation theory for neural net-
works that the modelling error of the predictor can be reduced arbitrarily
by properly selecting the basis functions and adjusting the weights. There are
many types of basis functions which can be selected, including radial functions,
sigmoid functions, polynomial functions and so on. Section 7.6 will discuss the
selection of basis functions using a radial basis function network. An on-line
learning algorithm for the weight adjustment of the networks used in the pre-
dictors will be given in Section 7.5.
150 7. Nonlinear Predictive Neural Control
To define how well the predicted process output tracks the reference trajectory,
a number of cost functions are employed for predictive control. This section
uses a cost function which is of the following quadratic form.
(7.35)
where
Rt+d+L, rt+d+L and UHd+L are the future reference input, predicted output
and control input vectors, respectively, L is the control horizon, L + d is the
prediction horizon, and a > 0 is the weight.
The optimal controller output sequence over the prediction horizon is ob-
tained by minimising the performance index J np with respect to UHL . This
can be carried out by setting
(7.39)
7.4 Predictive Neural Control 151
(7.40)
Using the neural network based predictors (7.34), the derivatives of Yt+d+L
with respect to the control input vector Ut +L are given by
r
o
(7.41 )
Let
(7.42)
(7.43)
where h = [1,0, ... , 0] is an identity vector and the matrix DL is of the form
o
n, ~ [ -; 1
(7.44)
-1
It is clear from (7.43) that the controller input vector Ut +L can be calculated
by
(7.45)
(7.46)
The predictive neural controller is therefore relatively simple and easy to im-
plement using the affine nonlinear predictors. There is no need to solve a
nonlinear programming problem to obtain the optimal control input Ut unless
additional constraints are imposed on the control signal and/or output of the
system.
152 7. Nonlinear Predictive Neural Control
Here, we consider the on-line adjustment of the weights of the i-th predictor.
The weight estimations of the other predictors are similar. It will be assumed
that the basis functions of all the networks which are used in the predictors
are given and the required prediction accuracy can be achieved by adjusting
the corresponding weights to those functions.
Using the available output data Yt-d-i, ... , Yt-d-i-n+l and the input data
Ut-d-i-l, ... , Ut-2d-i, the output of the i-th predictor at time t can be written
as
where Pt and Gij are the optimal estimates of the weight vectors Pi and Gij ,
for j = 0,1, ... , i, respectively, Et is the approximation error of the predictor
using the neural network and is assumed to be bounded by a positive number
t5 for all time, that is
(7.48)
where the weight vector W t and the basis function vector <Pt are
GT.]T
~,
(7.50)
<Pi (Xt-d-i)
<PiO (Xt-d-i)Ut-d-i
<Pt-l = <Pil (Xt-d-i)Ut-d-i+l (7.51)
The estimation problem is then to find a vector W belonging to the set defined
by
(7.52)
Theorem 7.5.1. Consider the i-th predictor and the learning algorithm:
Wt = W t- I + at!3tPt-IPt-Iet (7.53)
Pt = Pt- I - (3titPt-IPt-IPLIPt-I (7.54)
(7.57)
(7.58)
Then
(i) (7.59)
(ii) lim
t-+oo
IWt - Wt-Il = 0 (7.60)
(iii) (7.61 )
where
(7.62)
and Amin (.) denote the maximum and the minimum eigenvalues of the
Am ax (.)
matrix (.), respectively, and W* is the optimal estimate of the weight vector
Wt·
Proof: (i) Consider the Lyapunov function
(7.63)
(7.64)
Since it is assumed that the approximation error Et of the predictor satisfies
IEtl :S 6, it is known from the above that
(7.65)
154 7. Nonlinear Predictive Neural Control
(7.66)
(7.68)
II W t - W t - I II;
< (7.69)
It is clear from (7.54) that Amax(Pt ) ::; Amax(Pt-d ::; ... ::; Amax(PO)' Then
(7.69) can be written as
Then
Amin (Pt- I ) :2: Am in (pt-=-II) :2: ... :2: Am in (PO-I) (7.72)
Equation 7.67, together with the above, gives
(7.73)
which results in
(7.74)
Thus
(7.75)
7.6 Sequential Predictive Neural Control 155
This section discusses predictive neural controller design using growing neural
networks. Consider the i-th predictor to show how to design the predictive con-
trol. For the sake of simplicity, the basis function vectors of the i-th predictor
are assumed to be
(7.76)
This means all neural networks for the i-th predictor have the same basis
functions, which are of the form
(7.77)
where Ni is the number of basis functions and ipidXt) is the Gaussian radial
basis functions, i. e.
(7.78)
rik is the width of the (ik)-th basis function and Cik is its centre.
The i-th predictor is now given by
Ni i Ni
If the prediction error of the i-th predictor is greater than required, according
to approximation theory more basis functions should be added to the networks
to improve approximation. Based on the structure of the function Yt+d+i in
Equation 7.34, the structure of the i-th predictor using the growing neural
networks now becomes
i
where Yi~~~i denotes the structure of the i-th predictor at time t - 1 and
Y~2d+i the structure after the addition of a basis function at time t, fi,N i +1
and gij.NiH are the weights corresponding to the new (Ni + l)th Gaussian
radial basis function ipi(Ni+ 1) (Xt).
156 7. Nonlinear Predictive Neural Control
(i) min
k=l, ... ,Ni
II Xt - Cik 112 > clc (7.81)
(ii) (7.82)
where ei(t) is the prediction error of the i-th predictor which may approxi-
mately be measured by et defined by (7.57), clc is the required distance between
the basis functions and cl max is chosen to represent the desired maximum tol-
erable accuracy of the predictor estimation. Criterion (i) says that the current
observation must be far from existing centres. Criterion (ii) means that the
approximation error in the network must be significant.
If the above conditions are satisfied, the new centre is set to be Ci(Ni+1) =
Xt. To assign a new basis function 'Pi(Ni+1) (Xt) that is nearly orthogonal to all
existing basis functions, the angle between the GRBFs should be as large as
possible by reducing the width ri(Ni+ 1 )' However, the smaller ri(Ni+1) increases
the curvature of 'Pi(Ni+1) (Xt) which in turn gives a less smooth function and can
lead to overfitting problems. Thus, to have good orthogonality and smoothness,
a choice for the width ri(Ni+l) , which ensures the angles between GRBF units
are approximately equal to the required angle B min , is (Liu et ai., 1998b)
(7.83)
where Bmin is the required minimum angle between Gaussian radial basis func-
tions.
When a new unit is added to the network at time t, the dimension of the
vectors W t and <Pt and the matrix Pt should increase by 1. The on-line learning
algorithm for the i-th predictor is still the same as that given in Section 7.5.
After the above consideration, the matrices QLand H L are of the following
form:
r
o
(7.84)
(7.85)
The predictive controller are still given by the form (7.46). In this way, the
design of the nonlinear predictive neural control is completed by growing net-
works.
7.7 An Example 157
7.7 An Example
In this section, consider the following affine nonlinear system (Chen and Khalil,
1995):
Yt = 2.5Yt-1Yt-2
2 2 + 0.3 cos (0.5 (Yt-l + Yt-2 )) + 1. 2U t-l (7.86)
1 + Yt-l + Yt-2
The reference input r(t) = sin(7ft/500). The initial condition of the plant is
(Y-l, Y-2) = (0,0).
The goal is to control the plant (7.86) to track the reference input r( t) using
a predictive control strategy so that the following quadratic cost function is
minimised.
(7.87)
6 = 0.008 (7.89)
6max = 0.012 (7.90)
2,---,----,----,---,----,----,----,---,----,---,
1.5
"5
Q.
"5
o
C1>
i= -0.5
_1.5L----L----~--~-----L----L----L----~--~-----L--~
o 100 200 300 400 500 600 700 800 900 1000
timet
0.6,----,----,----,----,-----,----,----,----,----,----,
0.4
g
C1>
OJ
"
:s;:
~C1>
.c
f-
-0.4
_0.6L----L----~--~-----L----L----L----~--~-----L--~
o 100 200 300 400 500 600 700 800 900 1000
timet
0.8,----,----,----,----,-----,----,----,----,----,----,
0.6
0.4
e 0.2
Cii
CJl
§
a;
"0
o
E
C1>
F -0.2
-0.4
-0.6
-0.8 L -__--'-____'---__--'--__----'____-'--__----'____-'---__----'____-'--__---.J
o 100 200 300 400 500 600 700 800 900 1000
timet
0.8
0.6
e
Cii
CJl 0.4
.E
a;
"0
0
E 0.2
C1>
.<::
f-
-0.4~--~-----'--------'-----'------~--~-----'--------'-----'----~
o 100 200 300 400 500 600 700 800 900 1000
timet
140
120
g? 100
0
U
"
.2
en 80
'iii
ttl
.0
15
Q;
.0 60
E
::>
"
'"
.<::
f- 40
100 200 300 400 500 600 700 800 900 1000
timet
Fig. 7.7. Number of basis functions in the growing neural network for the two-step-
ahead predictor
In addition, the simulation results using the design procedure discussed in this
chapter were compared with those provided by other neural network based
predictive control techniques, for example, neural network based adaptive pre-
dictive control (Tan and Keyser, 1994) and robust nonlinear self-tuning pre-
dictive control using neural networks (Zhu et ai., 1997). It was shown that the
design procedure given in this chapter has three significant advantages. The
first is that the computation for optimisation in the new design procedure is
simpler and faster than other techniques because it uses a simple analytical
solution to the minimisation of the performance function. The second is that
the new design procedure has better reference tracking performance as a result
of the use of a set of affine nonlinear predictors. The third is that the design
procedure provides an appropriate sized neural network by introducing the
growing network technique.
7.8 Summary
This chapter has discussed the neural network based predictive controller de-
sign of nonlinear systems. The fundamental principle of predictive control was
explained on the basis of linear discrete-time systems. A set of affine nonlinear
neural predictors was used to predict the output of the nonlinear process so
that the difficulty of minimising the performance function for nonlinear pre-
dictive control is avoided, which is usually carried out by the use of nonlinear
7.8 Summary 161
8.1 Introduction
Variable structure control with sliding modes was first proposed in the early
1950s (Utkin, 1964; Ernelyanov, 1967; Itkis, 1976) and has subsequently been
used in the design of a wide spectrum of system types including linear and
nonlinear systems, large-scale and infinite-dimensional systems, and stochastic
systems. It has also been applied to a wide variety of engineering systems.
The most distinguished feature of variable structure control based on sliding
modes is the ability to improve the robustness of systems which are subject
to uncertainty. If, however, the uncertainty exceeds the values allowed for the
design, the sliding mode cannot be attained and this results in an undesirable
response (Utkin, 1964). In the continuous-time case this problem was solved by
combining variable structure and adaptive control (Slotine and Li, 1991), but
this requires that all the system variables are available and can be measured.
This case has also been discussed for linear discrete systems using input output
plant models (Furuta, 1990, 1993; Hung et al., 1993; Pan and Furuta, 1995)
and for nonlinear discrete systems where the input output model is unknown
(Liu et al., 1997b. 1999b).
This chapter presents a neural network based variable structure controller
design procedure for unknown nonlinear discrete systems. A neural network
based affine nonlinear predictor is introduced so that the control algorithm is
simple and easy. Two performance functions are considered for the design of
variable structure neural control. The first performance function is concerned
with minimisation of the prediction error. The second performance function
includes minimisation of the prediction error and the control input. A recursive
learning algorithm for neural networks for the neural network affine nonlinear
predictor is also discussed. This algorithm can be used for both on-line and off-
line weight training. It is shown that both the weights of the neural networks
and the estimation error converge.
A brief introduction to variable structure control for linear systems is given
in Section 8.2. Then, Section 8.3 considers the structure of the affine nonlinear
predidors, which is based on neural networks. Variable strudure neural control
is studied for nonlinear systems in Section 8.4. Generalised variable structure
neural control is discussed in Section 8.5. Section 8.6 develops the recursive
learning algorithm for the neural networks used for the d-step-ahead predictor
164 8. Variable Structure Neural Control
and the properties of the algorithm are analysed. Finally, simulation results
are given in Section 8.7.
(8.1)
where Yt is the output, Ut is the control input, d is the time delay, A and B
are polynomials in the backward shift operator q-l:
nand m are the orders of the polynomials. It assumes that the above system
is minimum phase, that is, all zeros of the polynomial B are within the unit
disk.
The structure of a variable structure control system is governed by the
sign of the switching function. A switching function is generally assumed to
be linear. For discrete time systems, a simple switching function is defined as
(8.4)
(8.5)
8.2 Variable Structure Control 165
SHd = 0 (8.6)
To transfer the system state onto the switching surface, let us construct a
Lyapunov function as
(8.7)
(8.8)
where
(8.9)
(8.10)
(8.11)
which implies that as the time t approaches infinity, the system will reach the
switching surface, i. e.,
lim
t--+oo
St =0 (8.12)
(8.13)
(8.14)
AE+q-dF = C (8.15)
166 8. Variable Structure Neural Control
(8.16)
Thus
(8.17)
We consider use of the above conventional minimum variance control and the
variable structure control on the inside or outside of the sector defined later.
The following control input is considered:
(8.18)
St+d = Vt + St (8.19)
The auxiliary control input Vt is chosen as the output feedback with the vari-
ance coefficients
(8.20)
where H is a polynomial
(8.21 )
The following control law gives the stable closed-loop control system.
Theorem 8.2.1. For the plant (8.1) with the control law (8.18), the closed-
loop system is stable if the coefficients of the polynomial H are chosen as
(8.24)
Proof: By substituting (8.20) into (8.18), the control law can be rewritten as
(8.25)
It has been shown that using the above control law yields
(8.26)
which gives
(8.27)
For the choice of the coefficients of the polynomial H, if Stet-k tj. 5(6k) there
exists the following:
St L hket-k
k=O
n-l
< - L 6kl h kl
k=O
< -~ (~Ie<-'II"'I)'
a 2
< --(L1s t+d) (8.28)
2
A Lyapunov function is chosen as (8.7). Using the above inequality leads to
< (8.29)
As a > 1,
(8.30)
This concludes that St decreases when Stet-k tj. 5(6k), for k = 0,1, ... , n - 1
and L1s t - d -+ 0 as t -+ 0 yields either that if Stet-k tj. 5(6k),
lim et =0 (8.31 )
t--+o
(8.32)
For the latter case, the closed-loop system is controlled so that St+d = O. Since
(8.33)
the Schur polynomial C will make the error et decrease to zero. Therefore, the
closed-loop system becomes stable.
168 8. Variable Structure Neural Control
(8.42)
It is well known from the universal approximation theory for neural networks
that the modelling error of the predictor can be reduced arbitrarily by properly
choosing the basis functions and adjusting the weights. There are many ba-
sis functions available, e.g., radial functions, sigmoidal functions, polynomial
functions and so on. This chapter does not intend to discuss how to choose
between these. But a recursive learning algorithm for the weight adjustment
of the networks used in the predictors will be presented in Section 8.5.
Based on the d-step-ahead affine nonlinear predictor modelled using the
neural networks described above, we now consider the variable structure neural
control using sliding modes. It will be assumed that all the basis functions
in the neural network predictor are given but the weights of the network are
unknown. The objective of the control is to minimise the following performance
function.
J S = "2l(A*
Yt+d - rt+d
)2 (8.43)
where r is the reference input and iJ;+d is the optimal d-step-ahead prediction
of the output Yt.
For the given neural network structure, the optimal d-step-ahead predictor
is given by
(8.44)
170 8. Variable Structure Neural Control
where F* and G* are the optimal estimates of the weights which yield a pre-
diction error within the required accuracy.
Based on the optimal d-step-ahead predictor given by (8.44), the control
input to minimise J s can be solved analytically and is expressed as
(8.45)
(8.46)
(8.4 7)
(8.48)
(8.49)
(8.50)
where
(8.51)
-dosignbk'!)!tst) St ~ S(O"t)
bk = { 0 otherwise
(8.53)
(8.54)
N,
~t = L Irdxt)(cF f't)-lIM + 1 (8.55)
k=l
(8.56)
(8.57)
(8.60)
where
(8.61 )
(8.64)
The above relation implies that .1s Hd converges to zero as t approaches infin-
ity. This shows that SHd is brought into the inside of the set 2'(Ut).
In the previous section, the performance function J s involves only the difference
between the reference and the optimal prediction. For many practical systems,
the control input of the system should be taken into account in the performance
function. Thus, the objective of the control in this section is to minimise the
performance function below, which includes the control input.
(8.65)
(8.66)
To avoid the difficulty of finding the optimal weight vectors F* and G* in the
affine nonlinear predictor, the use of a predictive neural controller and variable
structure control are considered. Similar to the previous section, the following
control input is used:
(8.67)
(8.68)
where
8.4 Generalised Variable Structure Neural Control 173
For this case, the following theorem gives the design of the auxiliary control
input Vt so that St+d converges from the outside to the inside of the set 2.
St tj. 2(Tt)
Cl ={ otherwise
(8.74)
(8.75)
Nl
"'t = L IrdXt)(CF Ft )-IIp, + 1 (8.76)
k=1
( and do are positive numbers, then SHd converges to the set 2(Tt).
(8.78)
It is shown from the proof of Theorem 8.2 that the difference of the above
Lyapunov function is given by (8.58). With (8.46) and (8.67), St+d can be
expressed as
-aT/t(rHd - F Pt -
-T -
+(G r t ) )T/tVt
2
+ St (8.79)
Moving the term St from the right side to the left side in the equation above
gives
174 8. Variable Structure Neural Control
where
(8.81 )
The upper bound of the absolute value of Lls t+d is estimated by
No
ILlsHdl < L IUk + ak(t)'PdXt) I
k=l
Nl
where F* and G* are the optimal estimates of the weights, et is the approxima-
tion error of the predictor which is assumed to be bounded, i. e., max let I :::; 6L,
but the upper bound 6L is not known exactly.
The estimated d-step-ahead predictor can be written compactly as
(8.86)
where the weight vector Wt - 1 and the basis function vector Pt-1 are
Based on the recursive least squares algorithm for a bounded noise (Whidborne
and Liu, 1993; Wang et al., 1995), the recursive weight learning algorithm for
the neural network is proposed as
W t = W t- 1 + AtPtPt-1 et (8.89)
Pt = Pt- 1 - AntPt-1Pt-1PL1Pt-1 (8.90)
At = ,st(6epL1Pt-1Pt-d-1(letl- 6e ) (8.91 )
it = 6e letl- 1 (8.92)
et = Yt - Yt (8.93)
,st = { ~ (8.94)
where the positive number 6e is assumed not to be less than the upper bound
h of the approximation, P(O) is a positive finite matrix and Am ax (.) is the
maximum eigenvalue of its argument matrix.
Consider the Lyapunov function
(8.95)
(8.96)
176 8. Variable Structure Neural Control
(8.97)
Thus
(8.99)
where
(8.100)
Since the bound h of the estimation error Edt) is assumed not to be greater
than 6e, it is easy to show that f(6 e ) > 0 until the error letl = 6e. So, the
error letl converges to 6e. On the other hand, if letl < 6e, it is possible that
L1 Vi > O. This implies that the weight vector Wt may drift away over time.
In this case, set ft = 0 in the weight learning algorithm given by (8.89)-(8.94)
to avoid divergence of the weight vector. Thus the error letl converges to the
range [0,6 e ].
The analysis above shows that if the upper bound 6L is known, then the
error letl will converge to 6L by simply setting 6e = 6L. In the case where the
upper bound 6L of the estimation error Edt) is not known exactly, the error
letl still converges to 6e if 6e is set to be greater than 6L. Thus, the closer the
number 6e is chosen to the upper bound 6L, the more accurate the estimation
of the predictor is.
8.6 An Example
In this section, consider the following affine nonlinear system (Chen and Khalil,
1995):
Yt = 2.5Yt-lYt-2
2 2 + 0.3 cos (0.5 (Yt-l + Yt-2 )) + 1. 2U t-l (8.101 )
1 + Yt-l + Yt-2
The initial condition of the plant is (Y-l, Y-2) = (0,0) and the reference input
(t) = { 6 cos(1Ttj80) 0 < t ::; 160 (8.102)
r O t > 160
Since the structure and parameters of the functions F(Xt) and G(Xt) in the
affine nonlinear system are assumed to be unknown, a growing Gaussian radial
basis function (GRBF) neural network was used to approximate the functions.
The growing GRBF network was initialised with no basis function units. As
observations are received the network grows by adding new units. The decision
8.6 An Example 177
to add a new unit depends on the observation novelty for which two condi-
tions must be satisfied. The first condition states that the approximation error
between the real output and the estimated output must be significant. The
second condition states that the new centre of the GRBF must be far away
from existing centres. In this way, the approximation accuracy of the functions
F(Xt) and G(Xt) will converge to the required bound.
1o,-----,------,------,-----,------,----~
-2
-4
-60L-----~50,---,-1~00----,-15~0-----2~0~0----~2~50,---~300
timet
In the simulation, the recursive weight algorithm was used for off-line train-
ing of the growing GRBF network. When the variable structure neural control
was applied, the recursive weight algorithm was then used for on-line train-
ing of the growing GRBF network. The generalised variable structure neural
control strategy was used. The parameters were 0: = 0.5, T = 1.1, JL = 0.1,
do = 0.5. The performance of the system is shown in Figures 8.2 and 8.3 for
neural network based variable structure control. Figure 8.2 shows the output
Yt and the reference input Tt of the system. The tracking error Tt - Yt is shown
in Figure 8.3. The results of the simulation indicate that the tracking error of
the system using variable structure control is small and converges rapidly.
8.7 Summary
This chapter has considered a neural network based variable structure con-
troller design for unknown nonlinear discrete systems. The basic idea of vari-
able structure control was discussed for linear discrete-time systems. A neural
network based affine nonlinear predictor was introduced to predict the outputs
of the nonlinear process, and a variable structure control algorithm was devel-
oped which is simple and easy to implement. In order to improve the stability
and robustness performance of the system, a discrete sliding mode control
technique was applied. Two cases were considered for the variable structure
neural control. The first was based on minimisation of the square prediction
error. The second was based on combined minimisation of both the squared
prediction error and the squared control input. A recursive weight learning
algorithm for the affine nonlinear predictors was also developed which can be
used for both on-line and off-line weight training. Analysis of the weight learn-
ing algorithm demonstrated that both the weights of the neural networks and
the estimation errors converge.
CHAPTER 9
NEURAL CONTROL APPLICATION TO COMBUSTION
PROCESSES
9.1 Introduction
Combustion processes exist in many applications related to power generation,
heating and propulsion; for example, steam and gas turbines, domestic and
industrial burners, and jet and ramjet engines. The characteristics of these
processes include not only several interacting physical phenomena but also a
wide variety of dynamic behaviour. In terms of their impact on system per-
formance, pressure oscillations are of most significance. In some applications,
pressure oscillations are undesirable since they result in excessive vibration,
causing high levels of acoustic noise and in extreme cases mechanical failure.
In the frequency domain, the pressure is characterised by dominant peaks at
discrete frequencies which correspond to the acoustic modes of the combustion
chamber.
Generally speaking, there are two types of strategies that can be employed
to solve this problem: passive control and active control. Passive control has
been used in most practical combustors and approaches include changing the
flame anchoring point, the burning mechanism and the acoustic boundary con-
ditions, and installing baffles and acoustic dampers (Schadow and Gutmark,
1989). As a result of changes in the dynamic properties with operating point
it is difficult to optimise performance using passive means alone. Active con-
trol is the term that is widely used to describe the situation where control is
effected by adding energy to the system via the use of actuators in a way that
counteracts the oscillation.
Many active control approaches have been applied to combustion systems;
for example, phase-lead control (McManus et al., 1990), robust control (Tierno
and Doyle, 1992), LQG control (Annaswamy and Ghoniem, 1995), and adap-
tive control (Padmanabhan et al. 1995). In order to develop an effective active
controller for unstable combustors, Fung and Yang (1991) proposed a strategy
using state feedback theory. The unstable combustor was modelled as a linear
system, and a classical observer was used to estimate the states of the system
from the pressure measurement. This observer requires the parameters of each
mode (e.g., the damping and frequency) to enable estimation of the system
states. These parameters are generally difficult to obtain in advance because
they vary with the operating point and the ambient conditions. Neumeier
and Zinn (1995) presented an active control approach for combustion systems,
which attenuates the largest mode each time and ignores the other less impor-
180 9. Neural Control Application to Combustion Processes
tant modes. This can lead to poor pressure prediction, and also to a control
action that accentuates the less naturally dominant modes.
In this chapter another active control approach is studied, which addresses
all the modes independently. It is based on an output model which is estimated
by a neural network. This output model is then used to predict the system
pressure, to overcome the combustion system time delay. Finally, the prediction
is used by a controller to optimally attenuate the oscillating modes. Results
from the control of a simulated unstable combustor and a combustion test rig
are given.
(9.1)
av av ap
p-
at
+ pv-
ax
+ -ax - f = 0 (9.2)
ap ap av
- + v - + ,p- + (1 - ,)q - e =0 (9.3)
at ax ax
where p, v and p refer to the density of the mixture, the velocity of the gas, and
the pressure, respectively, m, f and q are the rates of added mass, momentum,
and heat release per unit volume, e denotes other sources of energy, and, is the
specific heat ratio. Let the above variables be separated into two parts: their
mean and perturbed components; for example, the pressure can be expressed
in the form p(x, t) = p(x) + p(x, t). Also assume that the mean flow is steady,
9.2 Model of Combustion Dynamics 181
the perturbed components are small variations about the means, the mean
heat release is small, and the Mach number of the mean flow is also small.
Then the following dynamic relations can be obtained
where (; is a coefficient. The above neglects the mean flow effects. In the pres-
ence of non-negligible mean flow and mean heat release rate, the underlying
acoustic relations are more complex than the above equations, but can also be
analysed in a similar manner.
It is assumed that in the combustor there are no external sources except
the heat release rate. This means m = f = e = O. The unsteady pressure p
(variations about a mean) can be expressed by the following physically based
model (Annaswamy and Ghoniem, 1995; Neumeier and Zinn, 1995):
n
(9.6)
where k i and 'PiO are determined by the boundary conditions, and correspond
to the spatial mode shapes, /1i represents the acoustic dynamics, n is the
number of modes and x is the displacement along the chamber length. Let
K = diag [ k1 k2 kn f (9.10)
The acoustic dynamics of the combustion system can then be written as (An-
naswamy and Ghoniem, 1995)
(9.12)
where ao = (r - 1)( ,m
-1, [l = (;K, , is the specific heat and the perturbed
localised heat release rate is of the form: q(x, t) = J(x - xo)qo(t).
Thus the fundamental dynamics are described by (9.12) with output (9.6).
182 9. Neural Control Application to Combustion Processes
where Wi is the weight, fi(~i' t) the basis function with parameter vector, N
the number of basis functions and rSt the modelling error.
---~ Combustor
Learning
Mechanism
For a different position x on the combustor, the weights will be different. The
model can be rewritten as
(9.14)
WN ]T (9.15)
Ft=[h 12 (9.16)
Wt (9.17)
In combustion systems with active elements, there is a large time delay between
the actuator's input and the actual measured pressure. To control a system
with time delay, a predictor is needed to estimate the future output. The
output predictor using the neural network is readily constructed as
N
where T is the time delay of the combustion system. This implies that the
accuracy of the output prediction depends on the estimated weights at time t.
It is therefore assumed that this time dependence is low in frequency compared
to the mode frequencies
Based on the mode predictor, the output of a controller that is used to
cancel the effect of the inherent time delay is generally of the form
where f (.) is a function of the output prediction. The design of this function
depends on what control performance is under consideration, and it can be a
linear or nonlinear function.
184 9. Neural Control Application to Combustion Processes
In this section, the active control of a simulated combustor with six modes is
considered. The output model based predictive control using neural networks
is shown in Figure 9.2.
d
r + + u p
Fig. 9.2. Output model based predictive control using neural networks
(9.21 )
where Wi is the weight and sin(7]it + cPi) is the basis function. For this system,
the parameters are chosen to be
10
5
j
V
N~
~
~ ~
0
~
:0
(f)
(f)
~
~
n.
-5
-10
I~ Combustor
Model
-15~----~--~----~----~----~----~----~----~
0.26 0.265 0.27 0.275 0.28 0.285 0.29 0.295 0.3
Time (sec)
6.---,,---,----,----,----,----,----,---,,---,----,
-2 W1
W2
W3
-4 W4
W5
W6
_6~ __ ~ __ ~ ____ ~ __ ~____ ___ L_ _ _ _L __ _
~ ~ __~ ____ ~
W7
WS
W9
10 W10
W11
W12
5
:c
(f)
OJ
'iii
S ~~
o '"
(?""'_" _.' ."7".'.-:.'.--:-.'. -: : ~ : :-:::: ..
I·
./
-5·i
I·
I
I
-10L---~--~--~--~--~--~----~--~--~--~
o 0.1 0.2 0.3 0.4 0.5 0.6 0.7 O.S 0.9
Time (sec)
Fig. 9.6. Weights W7, Ws, ... , W12 of the neural network
The performance of the output predictor is very good before the active control
is used, as shown in Figure 9.7. After the active control is applied to the system,
the prediction performance of the output predictor becomes worse, but is still
quite good. This is because the active control changes the characteristics of
the closed-loop system with time. The difference between the measured and
predicted pressures is small, as shown in Figure 9.8. This also shows that
though there is noise in the system, the output predictor has disturbance-
rejection properties.
To stabilise and attenuate the system oscillation, a simple controller is
introduced which is a linear function of the output prediction, that is,
n
u(t) = -Kc L Wi(t) sin(1]i(t + T) + ¢i) (9.28)
i=l
where Kc is the feedback gain. The active control response is given in Figure
9.9, where it can be seen that when the active control is switched on at 0.5
seconds, the pressure is rapidly reduced. The amplitudes of the combustor
acoustic modes increase with time before active control is switched on at 0.5
second. When the active control is applied, these amplitudes reduce gradually.
The behaviour of the six modes is illustrated in Figure 9.10.
188 9. Neural Control Application to Combustion Processes
10
5 J V
'.
V
~ ~ ~
~ ~
0
~
:::J
<fl
I
<fl
(J)
0::
-5 ~
-10
1== Combustor
Predictor
1
-15
0.26 0.265 0.27 0.275 0.28 0.285 0.29 0.295 0.3
Time (sec)
2r----,-----.-----r----,-----.----,r----,----~
1.5
~ ,\
"i ~ i'
N'}! J
I
0.5 I , '~I,1,1I I' !I
,I I
I 'I .1) ~I\I 'I' \"
, I' /
"
~~ o~' I, I .
,I
£ -0.5 .!
'j
.~ \ /
-1 II
~,I
I'
I ,I
-1.5
Ii
-2 Combustor I
Predictor
-2.5 '--------'-------'-------'------'------'--------''-------'-------'
1.26 1.265 1.27 1.275 1.28 1.285 1.29 1.295 1.3
Time (sec)
15,-----,------,------,-----~--.__,------,_----_,----_,------._----_,
10
3
mode 6
2.5
2
<J)
Q)
"0
~
Q.
E 1.5
ell
Q)
"0
0
::2:
Time (sec)
40,------------------,------------------,------------------,
30
·40
active control
·50
·60L------------------L------------------~----------------~
o 500 1000 1500
Fig. 9.11. Power spectrum of the system, with and without active control
Output model based predictive control using neural networks has also been
evaluated using an atmospheric combustion test rig with a commercial com-
bustor. A schematic diagram of the active control system for the combustor
test rig is shown in Figure 9.12.
The active controller consists mainly of the weight-learning algorithm for
the neural network based output model, the output predictor and the feedback
controller. These were designed using SIMULINK C-code S-functions. They
were implemented using the Math Works Real-Time Workshop, connected to
a dSPACE board based around the TMS320. The actuator was a loudspeaker,
which was installed on the outer wall of the combustion chamber. After scaling
for commercial reasons, the main results of active control for the combustor
with two modes are shown in Figure 9.13 and Figure 9.14. When the active
control is switched on at 1.5 seconds, the pressure is reduced. It can be seen
that these test results are consistent with the simulation results.
9.6 Active Control of an Experimental Combustor 191
Speaker
drive signal
d!!sa dSPACE control
and analysis
system
t
m
Chamber
pressure
signals
Main Illlllllim
Secondary
(2 BAR SU PPL YI
Pilot
1.5,--------,---------,---------,---------,---------,--------,
0.5
-0.5
-1
Active control on
_1.5L-________ ________
~ ~ _________ L _ _ _ _ _ _ _ _ _ L _ _ _ _ _ _ _ _~_ _ _ _ _ _ _ _~
5,----,-----,-----,-----,-----,-----,----,,----,-----,-----,
l-
i\
o i \ control off
I \
-5
-10
\ \
control on
\\r!N'
\ ~;\
/,
{r,: . '\ f'
. V
\' '!
I
-20
_25L____ L_ _ _ _ _ _ _ _L __ __ L_ _ _ _ _ _ _ _L____ L_ _ _ _ _ _ _ _L __ _
~ ~ ~ ~
Fig. 9.14. Power spectral density of the combustor, with and without control
9.7 Summary
An active control strategy for combustion systems has been presented. The
strategy is based on an output model, an output predictor and a feedback
controller. Neural networks were used to reconstruct the measured output ac-
curately, using minor knowledge about the combustion system. Unlike a classi-
cal observer, only a measured output signal is required. To overcome the time
delay of the system which is often very large compared with the sampling pe-
riod, an output predictor has been developed. An output-feedback controller
was introduced which uses the output of the predictor to suppress instability
in the combustion process. The active control of a simulated unstable combus-
tor system with six modes was used to demonstrate how each mode can be
extracted and dealt with separately. The performance of the strategy was also
illustrated in a combustor test rig with two dominant modes. Since the output
prediction is accurate despite the need for only limited a priori knowledge,
the approach will be useful for combustion systems where the behaviour is not
fully understood.
REFERENCES
Liu, G. P. and S. Daley (1999b). Output model based predictive control for
unstable combustion systems using neural networks. Control Engineering
Practice, vol. 7, pp. 591-600.
Liu, G. P. and S. Daley (1999c). Neural network based predictive control
of unstable combustion systems. Proceedings of the 14th IFAC World
Congress, Beijing, vol. J, pp. 421-426.
Liu, G. P. and S. Daley (1999d). Adaptive predictive control of combustor
NOx emissions. Proceedings of the 14th IFAC World Congress, Beijing,
vol. 0, pp. 91-96.
Liu, G. P. and S. Daley (2001) Adaptive predictive control of combustor NOx
emissions. Control Engineering Practice, vol. 9., no. 6, pp. 631-638.
Liu, G. P. and V. Kadirkamanathan (1995). Learning with Multiobjective Cri-
teria. Proceedings of the Fourth International Conference on Artificial Neu-
ral Networks, Cambridge, UK, pp. 53-58.
Liu, G. P. and V. Kadirkamanathan (1999). Multiobjective criteria for non-
linear model selection and identification with neural networks. lEE Pro-
ceedings, Part D, vol. 146, no. 5, pp. 373-382.
Liu, G. P., V. Kadirkamanathan and S. A. Billings (1995). Sequential iden-
tification of nonlinear systems by neural networks. Proceedings of the 3rd
European Control Conference, Rome, pp. 2408-2413.
Liu, G. P., V. Kadirkamanathan and S. A. Billings (1996a). Nonlinear pre-
dictive control using neural networks. Proceedings of the UKACC Interna-
tional Conference on Control '96, UK, vol. 2, pp. 746-751.
Liu, G. P., V. Kadirkamanathan and S. A. Billings (1996b). Variable neu-
ral networks for nonlinear adaptive control. Preprints of the 13th IFAC
Congress, San Francisco, vol. F, pp.181-186.
Liu, G. P., V. Kadirkamanathan and S. A. Billings (1996c). Stable sequential
identification of continuous nonlinear dynamical systems by growing RBF
networks. International Journal of Control, vol. 65, no. 1, pp. 53-69.
Liu, G. P., V. Kadirkamanathan and S. A. Billings (1997a). On-line identifi-
cation of nonlinear systems using Volterra polynomial basis function neural
networks. Proceedings of the 4th European Control Conference, Brussels.
Liu, G. P., V. Kadirkamanathan and S. A. Billings (1997b). Variable structure
control for nonlinear discrete systems using neural networks. Proceedings
of the 4th European Control Conference, Brussels.
Liu, G. P., V. Kadirkamanathan and S. A. Billings (1998a). On-line identifi-
cation for nonlinear systems using Volterra polynomial neural networks.
Neural Networks, vol. 11, pp. 1645-1657.
Liu, G. P., V. Kadirkamanathan and S. A. Billings (1998b). Predictive con-
trol for nonlinear systems using neural networks. International Journal of
Control, vol. 71, no. 6, pp. 1119-1132.
Liu, G. P., V. Kadirkamanathan and S. A. Billings (1999a). Variable neural
networks for adaptive control of nonlinear systems. IEEE Transactions on
Systems, Man, and Cybernetics, vol. 29, no. 1, pp. 34-43.
References 201
863-889.
Psaltis, D., A. Sideris and A. A. Yamamura (1988). A multilayered neural
network controller. IEEE Control Systems Magazine, vol. 8, pp. 17-21
Qian, S., Y. C. Lee and R. D. Jones, C. W. Barnes and K. Lee (1990). The
function approximation with an orthogonal basis net. Technical Report,
Los Alamos National Laboratory.
Qin, S. Z., H. T. Su and T. J. McAvoy (1992). Comparison offour net learning
methods for dynamic system identification. IEEE Transactions on Neural
Networks, vol. 3, no. 1, pp. 122-130.
Rayner, P. J. and M. Lynch (1989). A new connectionist model based on a
nonlinear adaptive filter. Proceedings of the International Conference on
Acoustics, Speech and Signal Processing, pp. 1191-1194, Glasgow, Scotland.
Renals, S. (1989). Radial basis function network for speech pattern classifica-
tion. Electronics Letters, vol. 27, no. 7, pp. 437-439.
Richalet, J., S. Abu el Ata-Doss, Ch. Arber, H. B. Kuntze, A. Jacubasch and
W. Schill (1987). Predictive functional control application to fast and ac-
curate robots. Proceedings of the 10th IFAC Congress, Munich, Germany.
Rioul, O. (1993). Regular wavelets: a discrete-time approach. IEEE Transac-
tions on Signal Processing, vol. 41, no. 12, pp. 3572-3579.
Rioul, O. and M. Vetterli (1991). Wavelets and signal processing. IEEE Signal
Processing Magazine, vol. 8, pp. 14-38.
Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry, World Sci-
entific, Singapore.
Robbins, H. and S. Munro (1951). Stochastic approximation method. Annals
of Mathematical Statisitics, vol. 22, pp. 400-407.
Roscheisen, M., R. Hofmann and V. Tresp (1992). Neural control for rolling
mills: incorporating domain theories to overcome data deficiency. Advances
in Neural Information Processing Systems, vol. 4, pp. 659-666.
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information
storage and organisation in the brain. Psychological Review, vol. 65, pp.
386-408.
Rumelhart, D. E., G. E. Hinton and R. J. Williams (1986). Learning internal
representations by error propagation. In Parallel Distributed Proceesing:
Explorations in the Microstructure of Cognition. D. E. Rumelhard and J.
L. McClelland (eds), vol. 1: Foundations, Bradford Books/MIT Press, Cam-
bridge, MA.
Ruskai, M. (1991). Wavelets and Their Applications. Jones and Bartlett Pub-
lishers.
Sadegh, N. (1993). A perceptron network for functional identification and con-
trol of nonlinear systems. IEEE Transactions on Neural Networks, vol. 4,
no. 6, pp. 982-988.
Sanner, R. M. and J. J. E. Slotine (1992). Gaussian networks for direct adap-
tive control. IEEE Transactions on Neural Networks, vol. 3, no. 6, pp.
837-863.
References 205
209
210 Index