You are on page 1of 27

RECURSIVE PARAMETER ESTIMATION

Prof. Dr. ir. Robin DE KEYSER


Ghent University, Faculty of Engineering
EeSA Department of Electrical energy, Systems and Automation
e-mail: rdk@autoctrl.UGent.be

1. Introduction
2. The Structure of Recursive Estimators
3. System and Signal Models
4. Recursive Least Squares
5. Recursive Instrumental Variables
6. Recursive Extended Least Squares
7. Time Varying Systems
8. Potential Operating Problems
9. Application of Recursive Identification
10. Conclusions
11. Appendix: formulae derivations
12. References
§1. INTRODUCTION

System Identification means modeling of dynamic systems and signals from


experimental data (Ljung, 1999). There is a large interest in parametric methods, i.e.
identification methods which give models characterized by a parameter vector. For
example this vector can consist of the coefficients in a difference equation or the
parameters in a transfer function.

Identification is a link between the real-world and the mathematical-world. As such


it is of considerable conceptual as well as practical interest. Some form of
identification technique will be a necessary step in any application of theory to real-
world systems.

Identification can basically be applied in an off-line or an on-line mode.

In an off-line (or batch) mode an experiment is carried out and afterwards all the
data are processed simultaneously. The methods employed for off-line system
identification are thus based on information from the plant which has been obtained
previously. This usually means that statistical tests are applied to a set of plant input-
output data in order to make an estimation of the model order and subsequently of
the values of the parameters within a model of that particular order.

In an on-line (or recursive) mode the data are used as soon as they are available.
The parameters can therefore be continuously estimated during the experiment. The
parameter estimates are recalculated each time new data becomes available. Thus
when the model is updated periodically, with reference to its past values, this is called
recursive identification or recursive parameter estimation. It is employed not only
within control algorithms, but also for many signal processing and filtering
problems.

Recursive or on-line methods have become increasingly important. Over the years,
many identification methods have been proposed. For the newcomer to the field it is
hard to see how the various methods are related. Once-upon-a-time the field of
identification has been called a “fiddler’s paradise” (Åström and Eykhoff, 1971). It is

2
still often viewed as a long confusing list of methods and tricks. Coherence and
unification in the field of identification is not immediate. One reason is that methods
and algorithms have been developed in different areas with different applications in
mind.

The term “recursive identification” is taken from control literature. In statistical and
econometric literature the field is usually called “sequential parameter estimation”
and in signal processing and telecommunication the methods are known as
“adaptive filtering algorithms”.

Within these areas algorithms have been developed and analyzed over the last 30
years. However, only recently there has been a noticeably increased interest in the
field from practitioners and industrial users. This is due to the construction of more
complex systems, where adaptive techniques (adaptive control, adaptive signal
processing) may be useful or necessary, and of course to the availability of
microprocessors for the easy implementation of more advanced algorithms.

In this text the most important forms of recursive identification are discussed.
Emphasis is placed on implementation and application aspects rather than on
theoretical (convergence) analysis. Moreover no attempt is made to make the list of
algorithms comprehensive.

§2. THE STRUCTURE OF RECURSIVE ESTIMATORS

Let us set recursive identification algorithms against batch identification by means of


a simple example. Suppose the relationship between two time series {y (⋅)} and {u(⋅)}

is given by a constant gain K, i.e. the process transforming the measured input u into
the measured output y has no dynamic elements.

Assuming that the data acquisition takes place in discrete time, as is normally the
case, we have at time t received a sequence of measurements

3
{u(1), y (1),u(2), y (2),...u(i ), y (i ),...u(t ), y (t )} . As indicated we are dealing with a static

process model, thus ideally


y (⋅) = Ku(⋅) (1)
where K is the unknown gain we want to estimate. To be realistic however, we accept
that this ideal relationship will in practice be disturbed (measurement errors, modeling
errors, secondary effects we did neglect…), such that we better take into account a
stochastic disturbance term v (⋅) in the model:
y (⋅) = Ku(⋅) + v (⋅) (2)
Due to these errors it would be unwise to estimate the gain K at time t by
Kˆ t = y (t ) / u(t ) . To diminish the erroneous effect of the random disturbances we could
start from the relationships:
⎧ y (1) = Ku(1) + v (1)
⎪ y (2) = Ku(2) + v (2)

⎪ .

⎪ .

⎨ . (3)
⎪ y (i ) = Ku(i ) + v (i )

⎪ .
⎪ .

⎪⎩ y (t ) = Ku(t ) + v (t )

which after summing up becomes:


t t t

∑ y ( i ) = K ∑ u( i ) + ∑ v ( i )
i =1 i =1 i =1
(4)

Now supposing that the average value for the random disturbances is zero, we have
t

∑ v (i ) ≈ 0
i =1
and a good estimate for K based on t pairs of measurement data

{y (⋅),u(⋅)} is given by
t

∑ y (i )
Kˆ t = i =1
t
(5)
∑ u( i )
i =1

One could think of the following identification experiment:


- measure the necessary data at the process during a sufficiently long period of
time

4
- store all measured values {y (⋅), u(⋅)} in a computer memory

- afterwards use the data to identify the unknown process gain K by means of
formula (5)

This is an off-line or batch identification procedure.

It is an easy exercise to show that formula (5) can also be written as


1
Kˆ t = Kˆ t −1 + ⎡⎣ y (t ) − Kˆ t −1u(t )⎤⎦ (6)
α t

with α t = α t −1 + u(t − 1) (7)


Based on the alternative algorithm (6)&(7) we could then think of a second
identification experiment:
- start from a set of initial values Kˆ 0 ,α 0

- measure y (1), u(1) , compute Kˆ 1,α1 by means of (6)&(7) and forget about
y (1), u(1)

- measure y (2), u(2) , compute Kˆ 2 ,α 2 by means of (6)&(7) and forget about


y (2), u(2)
- …
- measure y (t ), u(t ) , compute Kˆ t ,α t by means of (6)&(7) and forget about

y (t ), u(t ) .

This alternative way to obtain the estimate Kˆ t is a recursive procedure (on-line or

recursive identification). The algorithm can be interpreted as follows. Keeping in


mind the process model (1), the term Kˆ t −1u(t ) in (6) is a prediction of the process

output y (t ) , say yˆ (t ) , based on the latest parameter estimate Kˆ t −1 . The difference

y (t ) − Kˆ t −1u(t ) = y (t ) − yˆ (t ) is thus the prediction error. According to (6) the latest

parameter estimate Kˆ t −1 is thus adapted strongly if the prediction error is large

(indicating that the estimate Kˆ t −1 is indeed not a good estimate for the process). It is

adapted weakly if the prediction error is small (indicating that Kˆ t −1 is already a good

estimate and so it does not need to be changed much). This structure is typical for
recursive identification algorithms.

5
The estimator gain 1/ α (t ) is computed by means of a second recursion equation (7).
Other forms are possible for this gain updating formula, leading to algorithms with
different characteristics.

§3. SYSTEM AND SIGNAL MODELS

Linearization of the process model is a generally accepted procedure and thus has
been the basis of most algorithms. A typical single input – single output (SISO) model
is a linear difference equation.

Consider a dynamical system with input signal {u(t )} and output signal {y (t )} .

Suppose that these signals are sampled in discrete time t=1, 2, 3,… and that the
sampled values can be related through the linear difference equation:
y (t ) + a1y (t − 1) + ... + ana y (t − na ) = b1u (t − 1) + ... + bnb u (t − nb ) + v (t ) (8)

where v(t) is some disturbance of unspecified character. We shall use operator


notation for conveniently writing difference equations. Thus let q −1 be the backward
shift (or delay) operator:
q −1y (t ) = y (t − 1) (9)
Then (8) can be rewritten as
A(q −1 )y (t ) = B(q −1 )u(t ) + v (t ) (10)

where A(q −1 ) and B(q −1 ) are polynomials in the delay operator:

A(q −1 ) = 1 + a1q −1 + ... + ana q − na


B(q −1 ) = b1q −1 + b2q −2 + ... + bnb q − nb

The model (8) or (10) describes the dynamic relationship between the input and the
output signals. It is expressed in terms of the parameter vector
θT = ⎡⎣a1...ana ; b1...bnb ⎤⎦ (11a)

We shall frequently express the relation (8) or (10) in terms of the parameter vector.
Introduce the vector of lagged input-output data,
φT (t ) = [ − y (t − 1)... − y (t − na ); u(t − 1)...u(t − nb )] (11b)

6
Then (8) can be rewritten as:
y (t ) = φT (t )θ + v (t ) (12)
This model describes the observed variable y(t) as an unknown linear combination of
the components of the observed vector φ(t ) plus noise. Such a model is called a
linear regression in statistics and is a very common type of model. The components
of φ(t ) are called regression variables or regressors. In control systems, φ(t ) is
also called the measurement vector and θ the parameter vector.

If the character of the disturbance term v(t) is not specified, we can think of
yˆ (t ) = φT (t ) ⋅ θ (13)
as a natural guess or “prediction” of what y(t) is going to be, having observed
previous values of y(k), u(k), k=t-1,t-2,… This guess depends of course on the model
parameters θ . The expression (13) becomes a prediction in the exact statistical
sense, if {v (t )} in (12) is a sequence of uncorrelated random variables with zero

mean values. We shall use the term “white noise” for such a sequence and denote it
by {e(t )} .

If no input is present in (8) (nb = 0) and {v (t )} is considered to be white noise, then

(8) becomes a model of the signal {y (t )} :

y (t ) + a1y (t − 1) + ... + ana y (t − na ) = e(t ) (14)

Such signal is commonly known as an autoregressive process of order na or an


AR process.

An important feature of the set of models discussed until now is that the “prediction”
yˆ (t ) in (13) is linear in the parameter vector θ . This makes the estimation of θ
simple.

Since the disturbance term v(t) in the model (12) corresponds to an “equation error”
in the difference equation (8), methods to estimate θ in (12) are often known as
equation error methods.

7
We could add flexibility to the model (10) by also modeling the disturbance term v(t).
Suppose that this can be described as a moving average (MA) of a white noise
sequence {e(t )} :

v (t ) = C(q −1 )e(t )
C(q −1 ) = 1 + c1q −1 + ... + cnc q − nc

Then the resulting model is:


A(q −1 )y (t ) = B(q −1 )u(t ) + C (q −1 )e(t ) (15)

This is known as an ARMAX model. The reason for this term is that the model is a
combination of an autoregressive (AR) part A(q −1 )y (t ) , a moving average (MA) part

C (q −1 )e(t ) , and a control part B(q −1 )u(t ) . The control signal is in the econometric
literature known as the eXogenous variable, hence the X. In control literature also the
name CARMA model is used (Controlled AutoRegressive Moving Average).

The dynamics of the model (15) are expressed in terms of the parameter vector:
θT = ⎡⎣a1...ana ; b1...bnb ; c1...cnc ⎤⎦

Since the model (15) also provides us with a statistical description of the
disturbances, we can compute a properly defined prediction of the output y(t).

When no input is present, the use of (15) means that we are describing the signal
{y (t )} as an ARMA process. This is a very common type of model for stochastic

signals.

Notice that the models are used to describe a stochastic dynamical system with an
input {u(t )} and an output {y (t )} , as well as to describe the properties of a stochastic

signal {y (t )} , where no input is present.

8
§4. RECURSIVE LEAST SQUARES (RLS)

Many methods are possible for off-line system identification, where a finite amount of
plant data must be obtained and subsequently employed to obtain system parameter
estimates. An obvious approach to recursive identification is to take any off-line
method and modify it, so that it meets the constraints

θˆ (t ) = f ⎡⎣θˆ (t − 1),P(t ), φ(t )⎤⎦ (16)

P(t ) = g ⎡⎣P(t − 1), θˆ (t − 1), φ(t )⎤⎦ (17)

which is a generalization of the structure (6,7). Here f (.,.,.) and g(.,.,.) are known

functions of the previous estimate θ̂ , current data φ and the auxiliary variable P .
The only thing that needs to be stored at time t, consequently is the

{ }
information θˆ (t ),P(t ) . This quantity is updated with a fixed algorithm, with a number

of operations that does not depend on time t. The choice of the functions f and g in
(16,17) leads to several recursive identification methods. In this text the intention is to
take a brief look at some commonly encountered algorithms.

The first technique to be discussed is in fact that of Recursive Least Squares (RLS)
which is popular not only because of its relatively low computational requirements but
also because it is straightforward to understand.

Consider the difference equation model (12). At time instant t-1, we actually know not
only θˆ (t − 1) but also φT (t ) = [ − y (t − 1),...; u(t − 1),...] . With regard to (12) as the system

equation, a guess can therefore be made as to what the next output signal, yˆ (t ) , will
be, i.e.
yˆ (t ) = φT (t ) ⋅ θˆ (t − 1) (18)
As the character of the disturbance v(t) in (12) is not specified, the best we can do as
a first step is to forget about it in the prediction (as its average is zero).

Once the new output signal is measured, the error in prediction can be found as
ε (t ) = y (t ) − yˆ (t ) (19)

9
It is then fairly intuitive that when the noise signal v(t) is relatively small, if our
parameter estimates θ̂ are fairly close to their actual values θ then the error ε (t )

should also be small; if however our estimates θ̂ are a pretty poor approximation to
θ then one would expect ε (t ) to be large. By taking into account the magnitude of
the error ε (t ) it is possible to improve the parameter estimates by means of the
equation:
θˆ (t ) = θˆ (t − 1) + K(t )ε (t ) (20a)

= θˆ (t − 1) + K(t ) ⎡⎣ y (t ) − φT (t )θ̂(t − 1)⎤⎦ (20b)

such that for any particular K(t ) , if ε (t ) is small, very little change is made to our
estimates whereas for a large ε (t ) a lot of alteration is required (Prediction Error
Identification Method).

It is now apparent that the choice of K(t ) is important, e.g. K(t ) = 0 for all t is not
perhaps ideal!!! – it does mean though that another choice for K(t ) is better; so how
do we find a better choice and is it possible to calculate a ‘best’ choice?

Initially let us investigate the difference between the actual and estimated parameter
vectors,
∆(t ) = θ(t ) − θˆ (t ) (21)
It is straightforward to show that
∆(t ) = ⎡⎣I − K(t )φT (t )⎤⎦ ∆(t − 1) − K(t )v (t ) (22)

where I is the identity matrix. Equation (22) then shows that if rapid updating of the
parameter estimates is required then K(t ) must be large, although this will result in
large perturbations due to the K(t )v (t ) term. Conversely if K(t ) is chosen to be small,
in order to achieve better noise rejection, then the updating of the estimates will also
be sluggish.

A sensible choice for K(t ) is one which minimizes the sum of the squared error terms

ε 2 (t ) (least squares principle). Then if the ‘covariance matrix’ P(t ) is defined to


be

10
1 −1
t
{
P (t ) = R(t ) = Ε φ(t )φT (t ) } (23)

where Ε {...} signifies the stochastic expected value, it is shown in the appendix that
the recursively calculated least squares estimate is found if
P(t − 1)φ(t )
K(t ) = (24a)
1 + φT (t )P(t − 1)φ(t )
= P(t )φ(t ) (24b)

and also P(t ) = ⎣⎡I − K(t )φT (t )⎦⎤ P(t − 1) (25a)

P(t − 1)φ(t )φT (t )P(t − 1)


= P(t − 1) − (25b)
1 + φT (t )P(t − 1)φ(t )

The RLS parameter estimator consists of three equations, namely (20), (24) and (25)
and these are all recalculated at each time instant.

It is thought best to make some remarks here:


9 it is shown in the appendix that if {v (t )} = {e(t )} , i.e. if the disturbance is a zero

{ }
mean white noise sequence, then the estimate sequence θˆ (t ) will converge to

the actual vector θ as t → ∞ . It is apparent though that if the disturbance v(t) is

{ }
colored noise then the sequence θˆ (t ) will converge, but will be biased away

from θ .
9 The equations (24) and (25) may result in divergence due to numerical problems
alone if care is not taken. This problem of numerical instability might occur after
several ten-thousands of samples. In such case it is suggested that a numerically
stable algorithm such as UD factorization be used in order to avoid this possibility
(Bierman, 1977).
9 Some initial values must be selected in order to get the recursive estimator under
way, i.e., at time t=0 values must be given to the parameter estimates θˆ (0) and
the covariance matrix P(0) . The latter of these gives an indication of our
uncertainty in terms of the estimated parameter values: the covariance matrix of

{ }
the estimates is given by σ v2P(t ) where σ v2 = Ε v 2 (t ) , the disturbance variance.

11
Relatively high values, e.g. P(0) = 1000I , are normal practice; such choice points

to little confidence in our initial parameter estimates θˆ (0) and causes the first few
recursions of the estimator to fluctuate wildly before steadier estimates are
obtained.

§5. RECURSIVE INSTRUMENTAL VARIABLES (RIV)

Whilst the data vector φ(t ) and the disturbance v(t) are uncorrelated, so the RLS
technique can result in parameter estimates which converge to their true value.
However when φ(t ) and v(t) are correlated, so the expectation is that the estimates
will converge to a parameter set which is biased away from their true values.

The method of Instrumental Variables, is intended to produce estimates which


converge to their true values, whether or not φ(t ) and v(t) are correlated (see
appendix).

Instrumental Variables employs the prediction error term ε (t ) obtained from (18,19):

ε (t ) = y (t ) − φT (t )θˆ (t − 1)
This error is then used in the standard estimate update equation (20), i.e.
θˆ (t ) = θˆ (t − 1) + K(t )ε (t )
in a similar fashion to the method of RLS. However K(t ) is obtained from the
equation (appendix, equation A14b)
P(t − 1)z(t )
K(t ) = (26)
1 + φT (t )P(t − 1)z(t )
in which z(t ) is the pseudo-data vector defined as

zT (t ) = [ − y m (t − 1)... − y m (t − na );u(t − 1)...u(t − nb )] (27)

and found from


y m (t ) = zT (t )θˆ (t ) (28)
Also P(t ) is calculated from the standard RLS equation (25a):

P(t ) = ⎡⎣I − K(t )φT (t )⎤⎦ P(t − 1) (29a)

12
P(t − 1)z(t )φT (t )P(t − 1)
= P(t − 1) − (29b)
1 + φT (t )P(t − 1)z(t )
In summary then, as far as implementation is concerned, there is no difference
between RLS and RIV except for the calculation of gain K(t ) which, in the case of
RIV is obtained by means of the pseudo-data vector z(t ) found in terms of the model
output y m (t ) .

§6. RECURSIVE EXTENDED LEAST SQUARES (RELS)

In the section on ‘System and Signal Models’ it was already indicated that, if it is
known a priori that the disturbance v(.) is coloured instead of white noise, we could
add flexibility to the model (10) by also modelling the disturbance term v(t). This led
to the ARMAX model (15) (with e(.) being white noise):
y (t ) + a1y (t − 1) + ... + ana y (t − na ) =
(30)
b1u(t − 1) + ... + bnb u(t − nb ) + e(t ) + c1e(t − 1) + ... + cnc e(t − nc )

Let us introduce the data vector


ξT (t ) = [ − y (t − 1)... − y (t − na );u(t − 1)...u(t − nb ); e(t − 1)...e(t − nc )]

and the parameter vector


θT = ⎡⎣a1...ana ; b1...bnb ; c1...cnc ⎤⎦

With this notation, (30) can be rewritten as


y (t ) = ξT (t ) ⋅ θ + e(t ) (31)

This model looks just like the linear regression (12), and we can try to apply the
recursive least squares algorithm (20, 24, 25) to it for estimating θ̂ :

θˆ (t ) = θˆ (t − 1) + P(t )ξ(t ) ⎡⎣ y (t ) − ξT (t )θˆ (t − 1)⎤⎦ (32a)

P(t − 1)ξ(t )ξT (t )P(t − 1)


P(t ) = P(t − 1) − (32b)
1 + ξT (t )P(t − 1)ξ(t )

Notice that this would lead to unbiased parameter estimates as the disturbance e(t)
in (31) is white noise. The problem is, of course, that the variables e(.) entering the ξ

13
vector are not measurable, and hence (32) cannot be implemented as it stands. We
have to replace the components e(.) with some estimate of them. From (30) we have
e(t ) = y (t ) + a1y (t − 1) + ... + ana y (t − na ) − b1u(t − 1) − ...
− bnb u(t − nb ) − c1e(t − 1) − ... − cnc e(t − nc )

If we have a sequence of estimates

θˆ T (t ) = ⎡⎣aˆ1(t )...aˆna (t ); bˆ1(t )...bˆnb (t ); cˆ1(t )...cˆnc (t )⎤⎦

available, it seems natural to estimate e(t) by ê(t), computed according to


eˆ(t ) = y (t ) + aˆ1(t )y (t − 1) + ... + aˆna (t )y (t − na )
− bˆ1(t )u(t − 1) − ... − bˆnb (t )u(t − nb ) (33)
− cˆ1(t )eˆ(t − 1) − ... − cˆnc (t )eˆ(t − nc )

With
φT (t ) = ⎣⎡ − y (t − 1)... − y (t − na ); u(t − 1)...u(t − nb ); eˆ(t − 1)...eˆ(t − nc ⎦⎤ (34)

the equation (33) can be written


eˆ(t ) = y (t ) − φT (t )θˆ (t ) (35)

An obvious algorithm for estimating θ̂ is now obtained from (32) by replacing ξ(t ) by
φ(t ) , computed according to (34,35). This gives the recursive extended least
squares (RELS) algorithm.

An advantage of this algorithm is that it is computationally equivalent to the usual


recursive least squares algorithm. The same program can be used, as soon as it is
complemented with the recursion (35).

§7. TIME VARYING SYSTEMS

Throughout the discussion on recursive techniques this far it has been assumed that
a vector of parameter estimates will converge, under certain conditions, to a vector
which consists of the true values. The underlying implication in this is that the
parameters within the true vector will remain where they are in order to be converged
upon. However, in many practical situations the system under consideration will be
affected by ageing, modifications and unmodelled ambient conditions or unmodelled

14
dynamics. Each of these can cause the ‘actual’ or ‘true’ system parameters to vary
with respect to time. Usually, parameter variations will simply be in terms of a steady
drift; although, where modifications are made or when faults occur, a rapid alteration
can occur. The result of this is that if a recursive parameter estimator is required to
have an up-to-date picture of the system characteristics then it must be able to track
any system parameter variations.

In this section the method of RLS is reconsidered in order to show how it can be
modified to cope with time-varying systems. Similar modifications can however be
made to the other algorithms.

The most straightforward way of dealing with time-varying systems is based on the
reasoning that when the system itself is time varying, information from some time
earlier will not be as representative of the system as the data just obtained, the
earlier information being based on what the system was like in the past, rather than
what it is now like.

A common modification of the original RLS method is thus to weight new data more
heavily than old data. This can be done by including an exponential weighting factor
(called forgetting factor) in the performance index (appendix equation A2):
t 2

V [θ ] = ∑ λ t −k ⎡⎣ y (k ) − φT (k )θ ⎤⎦ (36)
k =1

where λ is the exponential weighting factor, 0 < λ ≤ 1 . When λ = 1 , all data are
weighted equally. For 0  λ < 1 , more weight is placed on recent measurements than
on older measurements. Following the derivation shown in the appendix for the
original algorithm, the performance index given by (36) results in the following
recursive least squares algorithm:

θˆ (t ) = θˆ (t − 1) + P(t )φ(t ) ⎡⎣ y (t ) − φT (t )θˆ (t − 1)⎤⎦ (37a)

1⎡ P(t − 1)φ(t )φT (t )P(t − 1) ⎤


P(t ) = P t − −
λ ⎢⎣ λ + φT (t )P(t − 1)φ(t ) ⎥⎦
( 1) (37b)

It can be seen from (37b) that the effect of the exponential weighting factor (λ, which
is <1) is to prevent the elements of P from becoming too small. This maintains the

15
sensitivity of the algorithm and allows new data to continue to affect the parameter
estimates. On the other hand, when y and u are close to zero, then P(t − 1)φ(t ) → 0 ,
and P(t ) ≈ P(t − 1) / λ . Hence P grows exponentially until φ changes. Equation (37a)

shows how bursts in θˆ (t ) can occur for large P, especially when a perturbation
signal is introduced. This phenomenon is known as estimator windup or
covariance windup.

§8. POTENTIAL OPERATING PROBLEMS

A certain amount of tuning and operational experience with parameter estimation


algorithms is required to make them successful, since certain operational problems
may occur during implementation, due to real world conditions.

When very little excitation of the process signals occurs, as discussed earlier, small
model errors can lead to large parameter changes (see (37a)).

Another particularly difficult problem in estimation occurs when a setpoint change is


implemented in a nonlinear system; an equivalent situation arises when an
unmeasured disturbance suddenly changes. This change in operating point imparts a
sudden or jump change to the estimated model parameters, as opposed to the slowly
changing parameters normally assumed (parameter drift).

The performance of a parameter estimation algorithm is a function of the use of the


forgetting factor, λ. If λ is selected equal to 1, the algorithm becomes progressively
more insensitive to parameter changes. The sensitivity of the algorithm to parameter
changes can be improved by selecting λ<1. Although this strategy improves the
sensitivity of the algorithm, it has two serious disadvantages.

First, if λ<1, the algorithm is more sensitive to noise, as well as parameter changes,
which causes the parameter estimates to drift erroneously. The quality of the
estimates can be improved if a perturbation signal is added to the process input.

16
The second disadvantage is that with λ<1, the elements of P may become
excessively large with time. This in turn causes the algorithm to become overly
sensitive to parameter changes and noise, resulting in large fluctuations and drifting
in the parameter estimates.

It is apparent that simply selecting a constant value for λ will yield unsatisfactory
performance for one reason or another. The use of criteria to adapt the value of the
forgetting factor according to the current situation is a must for successful
applications.

§9. APPLICATIONS OF RECURSIVE IDENTIFICATION

To give a better feeling for the role recursive identification plays in applications we
shall consider some problems from different areas.

Example 1 (Ship Steering)


A ship’s heading angle and position is controlled using the rudder angle. For a large
ship, such as a supertanker, this position control could be a fairly difficult problem.
The main reason is that the ship’s response to a change in the rudder angle is so
slow that it is affected by random components of wind and wave motion. Most ships
therefore have an autopilot, i.e. a regulator, which measures relevant variables, and,
based on these and on information about the desired heading angle, determines the
rudder angle. The design of such a regulator must be related to the dynamic
properties of the ship. This can be achieved either by basing its design upon a
mathematical model of the ship or by experimentally “tuning” its parameters until it
yields the desired behavior.

Now the steering dynamics of a ship depends on a number of things. The ship’s
shape and size, its loading and trim, as well as the water depth, are important factors.
Some of these may vary (loading, water depth) during a journey. Obviously, the wind
and wave disturbances that affect the steering may also rapidly change. Therefore
the regulator must be constantly retuned to match the current dynamics of the

17
system; in fact, it is desirable that the regulator retunes itself. This can be done by
estimating the ship parameters by means of a recursive parameter estimator.

Many control problems exhibit features similar to the foregoing example. Airplanes,
missiles and automobiles have dynamic properties that depend on speed, loading,
etc. The dynamic properties of electric-motor drives change with the load. Machinery
such as that in paper-making plants is affected by many factors that change in an
unpredictable manner.

Chemical process control is another major field of application. The area of adaptive
control is concerned with the study and design of controllers and regulators that
adjust to varying properties of the controlled object. This is currently a very active
research area. A specific technique (EPSAC) is described in a separate text (De
Keyser, 2003).

Example 2 (Short-Term Prediction of Power Demand)


The demand for electrical power from a power system varies over time. The demand
changes in a more or less predictable way with time-of-day and over the courses of
the week, month or year. There is also, however, a substantial random component in
the demand. The efficient production of electricity requires good predictions of the
power load a few hours ahead, so that the operation of the different power plants in
the system can be effectively coordinated.

Now, prediction of the power demand of course requires some sort of a model of its
random component. It seems reasonable to suppose that the mechanism that
generates this random contribution to the power load depends on circumstances e.g.,
the weather, which themselves may vary with time. Therefore it would be desirable to
use a predictor that adapts itself to changing properties of the signal to be predicted.

The foregoing is an example of adaptive prediction; it has been found that adaptive
prediction can be applied to a wide variety of problems. The operator guide
application described in a separate paper is another example (De Keyser & Van
Cauwenberghe, 1982).

18
Example 3 (Digital Transmission of Speech)
Consider the transmission of speech over a communication channel. This is now
more often done digitally, which means that the analog speech signal is quantized to
a number of bits, which are transmitted. The transmission line has limited capacity,
and it is important to use it as efficiently as possible. If one predicts the “next sample
value” of the signal both at the transmitter and at the receiver, one need transmit only
the difference between the actual and the predicted value (the “prediction error”).
Since the prediction error is typically much smaller than the signal itself, it requires
fewer bits when transmitted; hence the line is more efficiently used. This technique is
known as predictive coding in communication theory. Now the prediction of the next
value very much depends on the character of the transmitted signal. In the case of
speech, this character significantly varies with the different sounds (phonemes) being
pronounced. Efficient use of the predictive encoding procedure therefore requires
that the predictor is based on real-time recursive identification of the signal
characteristic parameters.

Example 4 (Channel Equalization)


In a communication network the communication channels distort the transmitted
signal. Each channel can be seen as a linear filter with a certain impulse response
that in practice differs from the ideal delta function response. If the distortion is
serious, the signal must be restored at the receiver. This is accomplished by passing
it through a filter whose impulse response resembles the inverse of that of the
channel. Such a filter is known as a channel equalizer. If the properties of the
communication channel are known, this is a fairly straightforward problem. However,
in a network the line between that transmitter and receiver can be quite arbitrary, and
then it is desirable that the equalizer can adapt itself to the actual properties of the
chosen channel.
The adaptive equalizer treated in this example belongs to the wide class of
algorithms commonly known as adaptive signal processing or adaptive filtering.

19
Example 5 (Monitoring and Failure Detection)
Many systems must be constantly monitored to detect possible failures, or to decide
when a repair or replacement must be made. Such monitoring can sometimes be
done by manual interference. However, in complex highly automated systems with
stringent safety requirements, the monitoring itself must be computerized. This
means that measured signals from the systems must be processed to infer the
current (dynamic) properties of the system: based on this data, it is then decided
whether the system has undergone critical or undesired changes. The procedure
must of course be applied on-line so that any decision is not unnecessarily delayed.

§10. CONCLUSIONS

While the development of on-line estimation algorithms is still an active research


area, such algorithms have been successfully implemented in real-life situations.

Based on the results to date, there are several conclusions that can be drawn about
the features of a successful estimation scheme:

1. The Recursive Least Squares (RLS) method is the most popular estimation
technique and appears to exhibit rapid convergence when properly applied.
Recursive Extended Least Squares (RELS) using pseudolinear regression
seems to be a satisfactory way to treat the non-white noise case, although the
parameters in the C-polynomial do not always need to be estimated. In the
latter situation Recursive Instrumental Variables (RIV) might be a good
alternative.

2. A variable forgetting factor is required to keep the estimator running


properly. A constant forgetting factor has many drawbacks. Monitoring the
process signals and the prediction error is a suitable means to decide how the
forgetting factor should be adjusted.

3. There is a large application field for recursive parameter estimation, among


others adaptive control, fault detection, quality control, speech processing,
data filtering, trend forecasting. This is certainly not an exhaustic list.

20
§11. APPENDIX: Formulae Derivations

We consider the difference equation model (12) (the linear regression):


y (t ) = φT (t )θ + v (t ) (A1)
The parameter vector θ is to be estimated from measurements of y (t ) , φ(t ) ;
t=1,2,…N. A common and natural way is to choose this estimate by minimizing what
is left unexplained by the model, the “equation error” v(t). That is, we write down a
criterion function
1 N

2
VN (θ ) = α t ⎡⎣ y (t ) − φT (t )θ ⎤⎦ (A2)
N t =1

then we minimize this with respect to θ . Here {αt } is a sequence of positive

numbers. The inclusion of the coefficients α t in the criterion (A2) allows us to give

different weights to different observations. In applications, most often α t is chosen

equal to 1. We already remarked that yˆ (t ) = φT (t )θ can be seen as a natural “guess”


or “prediction” of y (t ) , based upon the parameter vector θ . Thus the criterion (A2)
can be seen as an attempt to choose a model that produces the best predictions of
the output signal. The criterion VN (θ ) is quadratic in θ . Therefore it can be minimized
analytically (ref. the off-line Least Squares identification method), which gives
−1 N
⎡N ⎤
θˆ (N ) = ⎢ ∑ α t φ(t )φT (t )⎥ ∑ α φ(t )y (t )
t (A3)
⎣ t =1 ⎦ t =1

provided the inverse exists. This is the celebrated least squares estimate. For our
current purposes it is important to note that the expression (A3) can be rewritten in a
recursive fashion. To prove this, we proceed as follows. Denote
t
S(t ) = ∑ α k φ(k )φT (k )
k =1

Then, from (A3), we have that


t

∑ α φ(k )y (k ) = S(t − 1)θˆ (t − 1) .


k =1
k

From the definition of S(t ) it follows that

S(t − 1) = S(t ) − α t φ(t )φT (t ) .

Hence

21
⎡ t −1 ⎤
θˆ (t ) = S−1(t ) ⎢ ∑ α k φ(k )y (k ) + α t φ/ (t )y (t )⎥
⎣ k =1 ⎦
= S−1(t ) ⎡S(t − 1)θˆ (t − 1) + α t φ(t )y (t )⎤
⎣ ⎦
{
= S−1(t ) S(t )θˆ (t − 1) + α t φ(t ) ⎡ −φT (t )θˆ (t − 1) + y (t )⎤
⎣ ⎦ }
θˆ (t ) = θˆ (t − 1) + S−1(t )φ(t )α t ⎡⎣ y (t ) − φT (t )θˆ (t − 1)⎤⎦ (A4a)

and
S(t ) = S(t − 1) + α t φ(t )φT (t ) (A4b).
The algorithm (A4) is not, however, well suited for computation as it stands, since a
matrix has to be inverted in each time step. It is more natural to introduce
P(t ) = S −1(t )
and update P(t ) directly, instead of using (A4b). This is accomplished by the so-called
matrix inversion lemma, which we now state.

Matrix Inversion Lemma: let A, B, C and D be matrices of compatible dimensions,


so that the product BCD and the sum A+BCD exist. Then

[ A + BCD]
-1 -1
= A -1 - A -1B ⎡⎣DA -1B + C-1 ⎤⎦ DA -1 (A5)

Proof: Multiply the right-hand side of (A5) by A+BCD from the right. This gives
-1 -1
I + A -1BCD - A -1B ⎡⎣DA -1B + C-1 ⎤⎦ D - A -1B ⎡⎣DA -1B + C-1 ⎤⎦ DA -1BCD =

{⎡⎣DA B + C }
-1
I + A -1B ⎡⎣DA -1B + C-1 ⎤⎦ -1 -1
⎤⎦ CD - D - DA -1BCD =

{O} = I
-1
I + A -1B ⎡⎣DA -1B + C-1 ⎤⎦

which proves (A5).

Applying (A5) to (A4b) with


A = P −1(t − 1),B = φ(t ), C = α t ,D = φT (t )

gives
−1
P(t ) = ⎡⎣P−1(t − 1) + φ(t )α t φT (t )⎤⎦
−1
⎡ 1⎤
= P(t − 1) − P(t − 1)φ(t ) ⎢φT (t )P(t − 1)φ(t ) + ⎥ φT (t )P(t − 1)
⎣ αt ⎦

22
P(t − 1)φ(t )φT (t )P(t − 1)
P(t ) = P(t − 1) − (A6)
1
+ φT (t )P(t − 1)φ(t )
αt
The advantages of (A6) over (A4b) are obvious. The inversion of a square matrix is
replaced by inversion of a scalar. From (A6) we also find that
α t P(t − 1)φ(t )φT (t )P(t − 1)φ(t ) P(t − 1)φ(t )
α t P(t )φ(t ) = α t P(t − 1)φ(t ) − = (A7)
1 1
+ φ (t )P(t − 1)φ(t )
T
+ φ (t )P(t − 1)φ(t )
T

αt αt

Thus the least squares estimate θˆ (t ) defined by (A3) can be recursively calculated
by means of

θˆ (t ) = θˆ (t − 1) + K(t ) ⎡⎣ y (t ) − φT (t )θˆ (t − 1)⎤⎦ (A8a)

P(t − 1)φ(t )
K(t ) = (A8b)
1
+ φT (t )P(t − 1)φ(t )
αt

P(t − 1)φ(t )φT (t )P(t − 1)


P(t ) = P(t − 1) − (A8c)
1
+ φT (t )P(t − 1)φ(t )
αt
These formulas are known as the recursive least squares (RLS) algorithm. This is
one of the most widely used recursive identification methods. It is robust and
easily implemented.
Let us here only point out two aspects that must be considered in any application of
the algorithm:

- Initial Conditions
Any recursive algorithm requires some initial value to be started up. In (A8) we
need θˆ (0) and P(0) . Since we derived (A8) from (A3) under the assumption
that S(t ) is invertible, an exact relationship between these two expressions can
hold only if (A8) is initialized at a time t0 when S(t0 ) is invertible. Typically,

S(t ) becomes invertible at time t0 = dim φ(t ) = dim θ(t ) . Thus, strictly speaking,
the proper initial values for (A8) are obtained if we start the recursion at time
t0 , for which

23
−1
⎡ t0 ⎤
P(t0 ) = ⎢ ∑ α k φ(k )φT (k )⎥
⎣ k =1 ⎦
t0
θˆ (t0 ) = P(t0 )∑ α k φ(k )y (k )
k =1

It is more common, though, to start the recursion at t=0 with some invertible
matrix P(0) and a vector θˆ (0) . The estimates resulting from (A8) are then
−1
⎡ t
⎤ ⎡ t

θˆ (t ) = ⎢P−1(0) + ∑ α k φ(k )φT (k )⎥ ⎢P−1(0)θˆ (0) + ∑ α k φ(k )y (k )⎥ (A9)
⎣ k =1 ⎦ ⎣ k =1 ⎦
This can be seen by verifying that (A9) obeys the recursion (A8) with these
initial conditions.

By comparing (A9) to (A3), we see that the relative importance of the initial
values decays with time, as the magnitudes of the sums increase. Also, as
P −1(0) → 0 the recursive estimate approaches the offline estimate. Therefore,

a common choice of initial values is to take P(0) = c ⋅ I and θˆ (0) = 0 , where c is


some large constant.

- Asymptotic Properties
To investigate how the estimate (A3) behaves when N becomes large, we
assume that the data actually have been generated by
y (t ) = φT (t )θ 0 + v (t ) (A10)
Inserting this expression for y(t) into (A3) gives
−1
⎡N ⎤ ⎧N ⎫
θˆ (N ) = ⎢ ∑ α t φ(t )φT (t )⎥ ⎨∑ α t ⎡⎣φ(t )φT (t )θ 0 + φ(t )v (t )⎤⎦ ⎬
⎣ t =1 ⎦ ⎩ t =1 ⎭
−1
(A11)
⎡1 N ⎤ 1 N
= θ0 + ⎢ ∑ α t φ(t )φT (t )⎥ ∑ α t φ(t )v (t )
⎣ N t =1 ⎦ N t =1

Desired properties of θˆ (N ) would be that it is close to θ 0 , and that it

converges to θ 0 as N → ∞ . We see that if the “disturbance” v(t) in (A10) is

small compared to φ(t ) , then θˆ (N ) will be close to θ 0 . The sum

24
1 N
∑ α t φ(t )v (t ) will, under weak conditions, converge to its expected value as
N t =1
N → ∞ , according to the law of large numbers. This expected value depends
on the correlation between the disturbance term v(t) and the data vector φ(t ) .

It is zero only if v(t) and φ(t ) are uncorrelated. This is true if {v (t )} is a

sequence of uncorrelated random variables with zero mean values (white


noise). Then v(t) does not depend on what happened up to time t-1, and
hence Ε {v (t )φ(t )} = O .

When {v (t )} is not white noise, then (usually) Ε {v (t )φ(t )} ≠ O . This follows


since φ(t ) contains y (t − 1) , while y (t − 1) contains the term v(t-1) that is

correlated with v(t). This means that we may expect θˆ (N ) not to tend to θ 0 as

N → ∞ , i.e. the estimates are biased.

One technique to overcome the bias problem is to replace φ(t ) in (A11) by a


vector z(t ) such that z(t ) and v(t) are uncorrelated.

That is, instead of (A3) we try to estimate (here we take α t ≡ 1 ):


−1 N
⎡N ⎤
θˆ (N ) = ⎢ ∑ z(t )φT (t )⎥ ∑ z(t )y (t ) (A12)
⎣ t =1 ⎦ t =1

By inserting the expression (A10) for y(t) into (A12) we obtain:


−1 N
⎡N ⎤
θˆ (N ) = ⎢ ∑ z(t )φT (t )⎥ ∑ ⎡⎣z(t )φ T
(t )θ 0 + z(t )v (t )⎤⎦
⎣ t =1 ⎦ t =1
−1
(A13)
⎡1 N
⎤ 1 N
= θ0 + ⎢ ∑ z(t )φT (t )⎥ ∑ z(t )v (t )
⎣ N t =1 ⎦ N t =1
We see that θˆ (N ) is likely to tend to θ 0 as N → ∞ under the following three

conditions:
• z(t ) and v(t) are uncorrelated
• v(t) has zero mean
1 N
• the matrix lim
N →∞ N

t =1
z(t )φT (t ) is invertible.

25
The estimate (A12) is known as the instrumental variable (IV) estimate. The vectors
z(t ) are referred to as the instrumental variables.

It is obvious that the estimate (A12) can be rewritten in a recursive fashion, just as
the least squares estimate in (A8). We then find that

θˆ (t ) = θˆ (t − 1) + K(t ) ⎡⎣ y (t ) − φT (t )θˆ (t − 1)⎤⎦ (A14a)

P(t − 1)z(t )
K(t ) = (A14b)
1 + φT (t )P(t − 1)z(t )

P(t − 1)z(t )φT (t )P(t − 1)


P(t ) = P(t − 1) − (A14c)
1 + φT (t )P(t − 1)z(t )

We have not yet discussed the choice of the instrumental variables z(t ) . Loosely
speaking, they should be sufficiently correlated with φ(t ) to ensure the invertibility
condition, but uncorrelated with the system noise terms. A common choice is
z(t ) = [ − y m (t − 1)... − y m (t − na ); u(t − 1)...u(t − nb )]

where y m (t ) is the output of a deterministic system driven by the actual input u(t):

y m (t ) + a1y m (t − 1) + ... + ana y m (t − na ) = b1u(t − 1) + ... + bnb u(t − nb ) (A15)

For the recursive algorithm (A14) an often used approach is to let ai and bi be time-

dependent. Then the current estimates aˆi (t ) , bˆi (t ) obtained from (A14) can be used
at time t in (A15). That is, we can write:
y m (t ) = zT (t )θˆ (t ) (A16)

26
§12. REFERENCES:

K. Åström, P. Eykhoff (1971). “System Identification – A Survey”, Automatica 7, 123-


167.
G. Bierman (1977). “Factorization Methods for Discrete Sequential Estimation”,
Academic Press, New York.
R. De Keyser (2003). “Model Based Predictive Control”. Invited Chapter in UNESCO
Encyclopaedia of Life Support Systems (EoLSS). Article contribution
6.43.16.1, Eolss Publishers Co Ltd, Oxford, ISBN 0 9542 989 18-26-34
(www.eolss.net).
R. De Keyser, A. Van Cauwenberghe (1982). “A Self-Tuning Multistep Predictor –
Application as an Operator Guide in Blast Furnace Control”, Automatica 17(1),
167-174
L. Ljung (1999). “System Identification: Theory for the user”, Prentice Hall, NJ
L. Ljung, T. Söderström (1983). “Theory and Practice of Recursive Identification”, MIT
Press, Cambridge Massachusetts
P. Young (1984). “Recursive Estimation and Time-Series Analysis”, Springer-Verlag,
Berlin

27

You might also like