You are on page 1of 9

EEE Instrumentation and Measurement

Technology Conference
Budapest, Hungary, May 21-23,2001

Black-box Models from Input-output Measurements

Lennart Ljung
Div. of Automatic Control
Link6ping University
SE-58183 Linkoping, Sweden
email: ljung@isy.liu.se

Abstract - A black-box model of a system is one that does not use anjl par- “successful in the past”.
ticular prior knowledge of the character or physics of the relationships in-
volved. It is therefore more a question of ”curve-~%ting”than ”modeiling”. This paper deals with Black-box models for dynamical sys-
In this presentation several examples of such black-box model structures
will be given. Both linear and non-linear structures are treated. Rekrtion- tems, for which inputs and outputs can be measured. We shall
ships between linear models, fuzq models, neural networks and classical both deal with linear and non-linear systems.
non-parametric models are discussed. Some reasons for the usefulness of
these model types will also be given. First, in Section I1 we list the basic features of black-box mod-
eling in terms of a simple example. In Section I n we outline
Ways to fit black box structures to measured input-output dnla are described,
as well as the more fundamental (statistical)properties of the resulting mod- the basic estimation techniques used. Section IV deals with
els. the basic trade-offs to decide the size of the structure used (es-
sentially how many parameters are to be used: the “fineness of
I. INTRODUCTION: MODELS OF DIFFERENT COLORS approximation”). So far the discussion is quite independent of
the particular structure used. In Section V we turn to particu-
At the heart of any estimation problem is to select a suitable lar examples of linear black-box models and to issues how to
model structure. A model structure is a parameterized family choose between the possibilities, while VI similarly deals with
of candidate models of some sort, within which the search for non-linear black-box models for dynamical systems.
a model is conducted.
I[. BLACK-BOX MODELS: BASIC FEATURES
A basic rule in estimation is not to estimate what you already
know. In other words, one should utilize prior knowledge and To bring out the basic features of a black-box estimation prob-
physical insight about the system when selecting the model lem, let us study a simple example. Suppose the problem is
structure. It is customary to color-code - in shades of grey - to estimate an unknown function go(z), -1 5 2 5 1. The ob-
the model structure according to what type of prior knowledge servations we have are noise measurements y(k) at points zk
has been used: which ?wemay or may not choose ourselves:

White-box models: This is the case when a model is per-


fectly known; it has been possible to construct it entirely
from prior knowledge and physical insight. How to approach this problem? One way or another we must
Grey-box models: This is the case when some physical decide “where to look for” g. We could, e.g., have the infor-
insight is available, but several parameters remain to be mation that g is a third order polynomial. This would lead to
determined from observed data. It is useful to consider the - in this case - grey box model structure
two sub-cases:
- Physical Modeling: A model structure can be built on g(z,e) = o1 + 6 2 +~e3x2+ . . . + enzn--l (2)
physical grounds, which has a certain number of pa- with n = 4, and we would estimate the parameter vector 8
rameters to be estimated from data. This could, e.g., be from the observations y, using e.g. the classical least squares
a state space model of given order and structure. method.
- Semi-physical modeling: Physical insight is used to sug-
gest certain nonlinear combinations of measured data Now suppose that we have no structural information at all
signal. These new signals are then subjected to model about g. We would then still have to assume something about
structures of black box character. it, e.g. it is an analytical function, or that it is piecewise con-
Black-box models: No physical insight is available or stant or something like that. In this situation, we could still use
used, but the chosen model structure belongs to fami- (2), but now as black-box model: if we assume g to be ana-
lies that are known to have good flexibility and have been lytic we know that it can be approximated arbitrarily well by a
0-7803-6646-81011$10.00 02001 IEEE

138
polynomial. The necessary order n would not be known, and We shall also generally denote all measurements available up
we would have to find a good value of it using some suitable to time N by
scheme.
ZN (6)
Note that there are several alternatives in this black-box situa-
tion: We could use rational approximations: HI. ESTIMATION TECHNIQUES AND BASIC
el + e2s+ e 3 X 2 + . .. + enXn-l PROPERTIES
(3)
g(27e) = 1+ enfix + en+252 + . ..+ en+m-12m-l In this section we shall deal with issues that are independent
or Fourier series expansions of model structure. Principles and algorithms for fitting mod-
els to data, as well as the general properties of the estimated
n
models are all model-structure independent and equally well
s(z7e) = e,, + 0 2 e - ~ cos(trz) + e2esin(trz) (4) applicable to, say, linear ARMAX models and Neural Network
e=i models, to be discussed later in this paper.
Alternatively, we could approximate the function by piece-
A. Criterion of Fit
wise constant functions, as illustrated in Figure 1. We shall
It suggests itself that the basic least-squares like approach is
I S , , , , , I , I , ,
a natural approach, even when the predictor Q(tl6) is a more
general function of 8:

8, e
= argminVN(e,ZN) (7)

where

l N
vN(’97zN) = zxllz/(t)
-6(tle)l12 (8)
t=l

We shall also use the following notation for the discrepancy


between measurement and predicted value
0.7
1 2 3 4 5 8 7 8 0 10
&(he)= Y(t) - G(t7 16) (9)
Fig. 1. A piece-wise constant approximation
This procedure is natural and pragmatic - we can think of it
as “curve-fitting”between y(t) and $(tie). It also has several
in Section VI return to a formal model description of the pa-
statistical and information theoretic interpretations. Most im-
rameterization in this figure.
portantly, if the noise source in the system is supposed to be
It is now clear that the basic steps of black-box modeling are a Gaussian sequence of independent random variables { e (t ) }
as follows: then (7) becomes the Maximum Likelihood estimate (MLE).

1. Choose the “type” of model structure class. (For exam- If y(kl0) is a linear function of B the minimization problem
ple, in the case above, Fourier transform, rational func- (7) is easily solved. In more general cases the minimization
tion, or piecewise constant.) will have to be carried out by iterative (local) search for the
2. Determine the “size” of this model (i.e. the number of minimum:
parameters, n). This will correspond to how “fine” the
approximation is.
3. Use observed data both to estimate the numerical values where f typically is related to the gradient of VN, like the
of the parameters and to select a suitable value of n. Gauss-Newton direction. See, e.g. [ 11, Chapter 10.
We shall generally denote a model structure for the observa- It is also quite useful work with a modified criterion
tions by
W N ( e 7 Z N )= V N ( z N )+6116y2 (11)
O(kl@)= 9 ( 3 k 7 4 ) (5)
with VN defined by (8). This is known as regularization. It
It is to interpreted so that $(kle) is the predicted value of the may be noted that stopping the iterations (i) in (10)before the
observation y(k) assuming the function can be described by minimum has been reached has the same effect as regulariza-
the parameter vector 8. tion. See, e.g., [ 2 ] .

i39
B. Convergence as N + 03 covariance matrix of this sensitivity derivative. This is a quite
natural result.
An essential question is, of course, what will be the properties
of the estimate resulting from (8). These will naturally depend The resiilt (14) - (15) is general and holds for all model struc-
on the properties of the data record Z N . It is i n general a dif- tures, both linear and non-linear ones, subject only to some
ficult problem to characterize the quality of exactly. One regularity and smoothness conditions. They are also fairly nat-
Formally has to be content with the asymptotic properties of ural, and will give the guidelines for all user choices involved
ON as the number of data, N , tends to infinity. in the process of identification. Of particular importance is that
the asymptotic covariance matrix (14) equals the Cramer-Rao
It is an important aspect of the general identification method lower bound, if the disturbances are Gaussian. That is to say,
(8) that the asymptotic properties of the resulting estimate can prediction error methods give the optimal asymptotic proper-
be expressed in general terms for arbitrary model parameteri- ties. See [ 11 for more details around this.
zations.
IV. CHOICE OF “TYPE’ AND “SIZE’ OF MODEL
The first basic result is the following one:
A. Bias-Variance Trade-off
,8 +e* as N + 03 where (:12)
),8
The obtained model g(z, will be in error in two ways:

e* = argminElla(t,e)112
e
(13) 1. First, there will be a discrepancy between the limit model
g ( : c , e * ) and the true function go(z), since our structure
That is, as more and more data become available, the estimate assumptions are not correct, e.g. the function is not piece-
converges to that value 6 * , that would minimize the expected wise constant. This error we call a bias error, or a model-
value of the “norm” of the prediction errors. This is in a sense mismatch error.
the bestpossible approximation of the true system that is avail- 2. Second, there will be a discrepancy between the actual
able within the model structure. The expectation E in (13) is estimate e N and the limit value. This is due to the noise-
taken with respect to all random disturbances that affect the corrupted measurements (the term e ( k ) in (1)). This error
data and it also includes averaging over the’ “input properties” will be called a variance error, and can be measured by
(in the example of Section II, this would be the distribution of the: covariance matrix (14).
the values of zt.). This means, in particular, that e* will make
It should be clear that there is a trade-off between these aspects:
$(tie*) a good approximation of y(t) with respect to those as-
To get ,a smaller bias error, we should, e.g., use a finer grid
pects of the system that are enhanced by the conditions at hand,
when the data were collected. for the piece-wise approximation. This in turn would mean
that fewer observed data can be used to estimate each level of
C. Asymptotic Distribution the piece-wise function approximation, which leads to large
variance for these estimates.
Once the convergence issue has been settled, the next ques-
tion is how fast the limit is approached. This is dealt with by To deal with the bias-variance trade-off is thus at the heart of
considering the asymptotic distribution of the estimate. The the b1ac:k-box model structure selection.
basic result is the following one: If { ~ (et*,) } is approximately
Some quite general expressions for the expected model fit, that
white noise, then the random vector a ( 8 - ~e*) converges
are independent of the model structure, can be developed, and
in distribution to the normal distribution with zero mean and
these will be dealt with in this section.
the covariance matrix of ON is approximately given by
B. An Expressionfor the Expected Mean-Square Error
Pe = W ~ 4 t ) + ~ ( t ) I - l (14)
Let us measure the (average) fit between any model (5) and the
where true system as
x = E.&$*)
v(e>= Ely(t) - D(tlW (16)

Here expectation E is over the data properties (i.e. expectation


over “Z”’ with the notation (6)).
This means that the convergence rate of 8, towards e* is
1 / a . Think of II, as the sensitivity derivative of the pre- Before we continue, let us note the very important aspect that
dictor with respect to the parameters. It is also used in the ac- the fit 17 will depend, not only on the model and the true sys-
tual numerical search algorithm (10). Then (14) says that the tem, but also on datu properties. In the simple case of Sec-
covariance matrix for ON is proportional to the inverse of the tion I1 this would be the distribution of the observation points

140
x k , IC = 1 , 2 , .. . . In the case of dynamical systems it involves and let
input spectra, possible feedback, etc. We shall say that the fit
depends on the experimental conditions.

The estimated model parameter 6, is a random variable, be- In Section I1 this is just the variance of e(t). In the more gen-
cause it is constructed from observed data, that can be de- eral cases of dynamic models to be considered later, X is the
scribed as random variables: To evaluate the model fit, we then innovations variance, i.e., that part of y ( t ) that cannot be pre-
take the expectation of v ( 8 ~with
) respect to the estimation dicted from the past. Moreover W ( 8 * )is the bias error, i.e. the
data. That gives our measure discrepancy between the true predictor and the best one avail-
able in the model structure. Under the same assumptions as
FN = E v ( 8 ~ ) (17) above, (18) can be rewritten as

The rather remarkable fact is that if FN is evaluated for data


with the same properties as those of the estimation data, then,
asymptotically in N , (see, e.g., [l],Chapter 16) The three terms constituting the model error then have the fol-
lowing interpretations

X is the unavoidable error, stemming from the fact that


Here 8* is the value that minimizes the expected value of the the output cannot be exactly predicted, even with perfect
criterion (8). The notation d i m 8 means the number of es- system knowledge.
timated parameters. The result also assumes that the model W(8*)is the bias error. It depends on the model struc-
structure is successful in the sense that E ( t ) is approximately ture, and on the experimental conditions. It will typically
white noise. decrease as dim 8 increases.
The last term is the variance error. It is proportional to the
It is quite important to note that the number d i m 8 in (18) will (efficient) number of estimated parameters and inversely
be changed to the number of eigenvalues of v”(8) (the Hessian proportional to the number of data points. It does not de-
of v) that are larger than S in case the regularized loss function pend on the particular model structure or the experimental
(1 1) is minimized to determine the estimate. We can think of conditions.
this number as the eficient number of parameters. In a sense,
we are “offering” more parameters in the structure, than are C. A Procedure in Practice
actually “used” by the data in the resulting model.
A pragmatic and quite useful way to strike a good balance be-
Despite the reservations about the formal validity of (18), it tween variance and bias, i.e., to minimize (21) w.r.t. dim 8 is
carries a most important conceptual message: If a model is as follows:
evaluated on a data set with the same properties as the estima-
Split the observed data into an estimation set and a vali-
tion data, then the j t will not depend on the data properties,
dation set
and it will depend on the model structure only in terms of the
Estimate models, using the estimation data set, for a num-
number of parameters used and of the bestjt offered within the
ber of different sizes of the parameter vector.
structure.
Compute the value of the criterion for each of these mod-
The expression (18) clearly shows the trade off between vari- els, using the validation data.
ance and bias. The more parameters used by the structure (cor- Pick as your final model the one that minimizes the fit for
responding to a higher dimension of 8 and/or a lower value of validation data.
the regularization parameter S) the higher the variance term,
An alternative is to offer a large number of parameters in a
but at the same the lower the fit v(8*). The trade off is thus to
fixed model structure and use the regularized criterion (1 1).
increase the efficient number of parameters only to that point
Estimate several models, as you turn the “knob” 6 from larger
that the improvement of fit per parameter exceeds v ( O * ) / N .
values towards zero and evaluate the obtained models as above.
This can be achieved by estimating FN in (17) by evaluating
the loss function at 8N for a validation data set. It can also be V. LINEAR BLACK-BOX MODELS
achieved by Akaike (or Akaike-like) procedures, [3], balancing
the variance term in (18) against the fit improvement. A. Linear Models and Estimating Frequency Functions

The expression can be rewritten as follows. Let gO(t) denote A linear system is uniquely defined and described by its fre-
the “true” one step ahead prediction of y(t), and let quencyfunction G(ei”) (i.e. the Fourier transform of its im-
pulse response. We could therefore link estimation of linear
we)= Elgo@) 6(tie)i2- (19) systems directly to the function estimation problem (l), taking

141
x k = eiwb and allowing g to be complex-valued. With obser- is a shorthand notation for the relationship
vations of the input-output being directly taken from, or trans-
formed to the frequency domain (y would here be uncertain ~ ( t ) + f i ~ ( t - +...+
1) fnf~(t-nf)
observations of the frequency response at certain frequencies) =blu(t - n k ) + * * * + bnbU(t - (nb+ nk - 1)) (28)
we have a straightforward function estimation problem.
Here, th.ere is a time delay of nk samples.
One question is then how to parameterize G ( e i w ) .It could be
done by piece-wise constant functions, which is closely related In the sime way the disturbance transfer function can be writ-
to standard spectral analysis (see Problem 7G.2 in [l]). See ten
also e.g. [4]. Otherwise it is more common to parameterize
them as rational functions as in (3). Direct fits of rational fre-
quency functions to observed frequency domain data is treated
comprehensively in [5],and in, e.g. Section 7.7 of [l]. The parameter vector 8 thus contains the coefficients b i , ci, di,
and f i of the transfer functions. This model is thus described
B. Eme-domain Data and General Linear Models by five structural parameters: nb, nc, nd, nf, and nk and is
known as the Box-Jenkins (BJ)model.
If the observations y to be used for the model fit are input-
output data in the time domain, we proceed as follows: Assume An important special case is when the properties of the dis-
that the data have been generated according to turbanoe signals are not modeled, and the noise model H ( q )
is chosen to be H ( q ) E 1;that is, nc = nd = 0. This special
case is known as an output error (OE) model since the noise
source e(t) = w ( t ) will then be the difference (error) between
where e is white noise (unpredictable), q is the forward shift the actual output and the noise-free output.
operator and H is monic (that is, its expansion in q - l starts
with the identity matrix). We also assume that G contains a A common variant is to use the same denominator for G and
delay. Rewrite (22) as H:

y(t) = [I - H - 1 ( Q , 8 ) l y ( t+ +
) H - l ( Q , w ( q , @ u ( t ) e(t)

This model is known as the ARMAX model. The name


is derived from the fact that A ( q ) y ( t ) represents an Auto-
Regression and C ( q ) e ( t )a Moving Average of white noise,
while B(q)u(t)represents an extra input (or with econometric
terminology, an exogenous variable).
It will be required that 8 are constrained to values such that the
filters H - l G and H-' are stable. The special case C ( q )= 1gives the much used ARX model. In
Figure :2 the different models are depicted.
C. Linear Input-output Black-box models
D. Linear State-space Black-box models
In the black-box case, a very natural approach is to describe An important use of state-space models is to incorporate any
G and H in (22) as rational transfer functions in the shift (de- available physical insights into the system. It can however also
lay) operator with unknown numerator and denominator poly- be usecl for black-box modeling, both in continuous and dis-
nomial asn in (3). We would then have crete time. A discrete time innovations form is

z(t + 1) = A(B)z(t)+ B(e)u(t)+ K(O)e(t) (32)


y(t>= C(O)z(t)+ e ( t ) (33)
If the parameterization is done so that (32) covers all linear
system of a certain order as the parameter 8 ranges of a cer-
Then tain set, then this will give a black-box model. For example, if
8 parameterizes every matrix element in an independent way,
this will clearly correspond to a black box model. We shall

142
the resulting models’ ability to reproduce validation data is in-
le spected. For ARX models, this can be done very efficiently
over many models simultaneously.
B ARX
The choice between the different members of the black-box
families in Figure 2 can be guided by the following comments:

le Models that have a common denominator for G and H ,


that is ARMAX and ARX are suitable when the dominat-
OE ing disturbances are load disturbances, that enter through
the same dynamics as the input.
The BJ model is suitable when there is no natural connec-
tion between the systems dynamics and the disturbances
(like measurement noise).
The OE model has the advantage that a consistent estimate
of G can be obtained even if the disturbance properties
are not correctly modeled (provided the system operates
in open loop).
ARMAX
VI. NON-LINEAR BLACK-BOX MODELS

There is clearly a wide variety of non-linear models. One pos-


sibility that allows inclusion of detailed physical prior infor-
mation is to build non-linear state space models, analogous
to (32). Another possibility of the grey-box modeling type is
to use semi-physical modeling mentioned in the introduction.
BJ This section however deals with black-box models. We follow
the presentation in Chapter 5 of [ 11, to which we refer for any
Fig. 2. The different linear black-box models details.

A. Non-linear Black-box Models


call that afull parameterization. The only structure parameter
to determine is then the model order, dim x. While this may be We shall generally let cp contain a finite number of past input
wasteful in the sense that dim 8 will be much larger than nec- output data:
essary, it gives the important advantage that is also covers all T
choices of coordinate basis in the state space and thus includes cp(t) = [y(t - I),.. . ,y(t - n),u(t - k),.. . ,u(t - k - m)]
those parameterizations that are best conditioned. (34)

(A canonical parameterization only needs dim 8 = n ( 2 p + m ) and the predictor for y(t) is allowed to to be a general function
parameters with n=dim x and p and m being the number of of cp(t):
inputs and outputs. On the other hand, some overlapping such
O ( t ) = S(cp(t)) (35)
parameterizations will be necessary to cover all n:th order sys-
tems if p > 1, and they will typically not be very well condi- We are thus, again, back at the simple example (1). The map-
tioned. See e.g. Appendix 4A in [l].) ping g can be parameterized as a function expansion
n
An advantage with a full parameterization of (32) is that so
called subspace methods (see e.g. [6]) allow efficient non- 9(cp,8) = CW@k(cp- 7%)) (36)
k=l
iterative estimation of all the matrices involved. The model can
then be further refined by minimization over 8 in the criterion Here, n is a “mother basis function”, from which the actual
(8). This can still be done efficiently, even if the parameteriza- functions in the function expansion are created by diZation (pa-
tion is full, see [7]. rameter p) and translation (parameter r). For example, with
n = cos we would get Fourier series expansion with p as fre-
E. Some Practical Aspects quency and y as phase. More common are cases where n is a
unit pulse. With that choice, (36) can describe any piecewise
To choose the order(s) of the linear black box models one es- constant function, where the granularity of the approximation
sentially follows the procedure described at the end of Section is governed by the dilation parameter p. Compared to Figure 1
IV. That is, a number of different orders are tried out, and we would in that case have

143
n=4, It is not difficult to understand this. It is sufficient to check
0 y1 = l,y2 = 2’73 = 3.5,y4 = 7, that the delta function - or the indicator function for arbitrarily
pi = 1,p2 = 2/3,/33
= 1/3.5,p4 = 1/3 small areas - can be arbitrarily well approximated within the
al=0.79,a2=l.l,as=1.3,a4=1.43 expansion. Then clearly all reasonable functions can also be
approximated. For a a simple unit pulse function K , with ra-
A related choice is a soft version of a unit pulse, such as the dial construction this is immediate: It is itself a delta function
Gaussian bell. Alternatively, IC could be a unit step (which also approximator.
gives piecewise constant functions), or a soft step, such as the
sigmoid. The question of how eficient the expansion is, i.e., how large
n is required to achieve a certain degree of approximation, is
Qpically K is in all cases a function of a scalar variable. When more difficult, and has no general answer. See, e.g. [9]. We
cp is a column vector, the interpretation of the argument of n may point to the following aspects:
can be made in different ways:
If the scale and location parameters /3 and y are allowed
If p is a row vector p(cp - y) is a scalar, so the term in to depend on the function go to be approximated, then
question is constant along a hyperplane. This is called the the number of terms n required for a certain degree of
ridge approach, and is typical for sigmoidal neural net- approximation is much less than if P k , yk, k = 1,. . . is
works. an a priori fixed sequence. To realize this, consider the
Interpreting the argument as I )cp - y11 p as a quadratic norm following simple example:
with the positive semidefinite matrix ,fl as a quadratic
form, gives terms that are constant on spheres (in the p
norm) around y. This is called the radial approach. Ra- EXAMPLE
VI.l: Piece-wise Constant Functions
dial basis neural networks are common examples of this. (From [l]) Suppose we use, as in Figure 1 piece-wise constant func-
Letting n be interpreted as the product of n-functions ap- tions to approximate any scalar valued function of a scalar variable ‘p:
plied to each of the components of cp, gives yet another 9o(’p), Ol‘pIB (37)
approach, known as the tensor approach. The functions n
used in (neuro-)fuzzy modeling are typical examples of Gn(P)= C ~ ~ K ( P ~ ( ( P -(38)Y ~ ) )
k=l
this.
where K is the unit pulse. Suppose that we require sup,,, Igo(cp) -
ijn( ‘p)l 5 e and we know a bound on the derivative of g~:sup Igb(’p)l 5
Note that the model structure is entirely determined by: C . For (38) to be able to deliver such a good approximation for any such
function 90 we need to take Y k = kA, P k = l / A with A 5 2c/C, i.e.,
The scalar valued function n(z) of a scalar variable 2. we need n 2 C B / ( 2 e ) .That is, we need a fine grid A that is prepared
The way the basis functions are expanded to depend on a for the worst case 1% ( 9 )I = C at any point.
If the actual function to be approximated turns out to be much smoother,
vector cp. and has a large derivative only in a small interval, we can adapt the
choiceofpk = l/Ak so that Ak M 2e/lgb(’p*)l fortheinterval around
The parameterization in terms of B can be characterized by -yk = kAk M ’p* which may give the desired degree of approximation
three types of parameters: with much fewer terms in the expansion.

The coordinates a
The scale or dilution parameters p For the local, radial approach the number of terms re-
The location parameters y quired to achieve a certain degree of approximation 6 of a
p times differentiable function in Rd(dim cp = d) is pro-
These three parameter types affect the model in quite different portional to
ways. The coordinates enter linearly, which means that (36) is
a linear regression for fixed scale and location parameters. (39)
B. Approximation Issues It thus increases exponentially with the number of regres-
sors. This is often referred to as the curse of dimensional-
A key issue is how well the function expansion is capable of ap- ity.
proximating any possible “true system” go (9). There is rather
extensive literature on this subject. See, e.g., [8]. for an identi- C. Neural Networks
fication oriented survey,
Neural inetworks have become a very popular choice of model
The bottom line can be expressed as follows For almost any structure in recent years. The name refers to certain structural
choice of n(x) - except being a polynomial - the expansion similarities with the neural synapse system in animals. From
(36) can approximate any “reasonable” function go (cp) arbi- our perspective these models correspond to certain choices in
trarily well for suficiently large n. the general function expansion.

144
Sigmoid Neural Networks. Nearest Neighbors or Interpolation

The combination of the model expansion (36), with a ridge The nearest neighbor approach to identification has a strong
interpretation of P(cp - y) and the sigmoid choice ~ ( x= ) intuitive appeal: When encountered with a new value cp* we
11(1+ e z ) for mother function, gives the celebrated one hidden look into the past data Z N to find that regression vector cp(t)=
layerfeedforward sigmoid neural net. cpt that is closest to cp*. We then associate the regressor cp*
with the measurement y ( t ) corresponding to p ( t ) = cpt . This
Wavelet and Radial Basis Networks. is achieved in (36) by picking scale and location parameters so
that each term “contains” exactly one observation.
The combination of the Gaussian bell type mother function and B-Splines.
the radial interpretation of P(cp - y) is found in both wavelet
networks and radial basis neural networks. B-splines are local basis functions which are piece-wise poly-
nomials. The connections of the pieces of polynomials have
D. Wavelets continuous derivatives up to a certain order, depending on the
degree of the polynomials. Splines are very nice functions,
Wavelet decomposition is a typical example for the use of local since they are computationallyvery simple and can be made as
basis functions. Loosely speaking, the “mother basis function” smooth as desired. For these reasons, they have been widely
is dilated and translated to form a wavelet basis. In this context used in classic interpolation problems.
it is common to let the expansion (36) be doubly indexed ac-
cording to scale and location, and use the specific choices (for F: FuzzyModels
one dimensional case) Pj = 2j and yk = k . This gives, in our
Fuzzy models are essentially an attempt to incorporate some
notation,
physical insight of non-mathematical nature into an otherwise
black-box model. They are based on verbal and imprecise de-
scriptions on the relationships between the measured signals
in a system. The fuzzy models typically consist of so-called
Note the multi-resolution capabilities, i.e. several different rule bases, but can be cast exactly into the framework of model
scale parameters are used simultaneously, so that the intervals structures of the class (36). In this case, the basis functions are
are multiply covered using basis functions of different resolu- constructed from the fuzzy set membership functions and in-
tions (i.e. different scale parameters). ference rules of combining fuzzy rules and how to “defuzzify”
the results. When the fuzzy models contain parameters to be
E. Nonparametric Regression in Statistics adjusted, they are also called neum-fuzzy models.

Estimation of an unstructured, unknown function as in (1) is We refer to Section 5.6 in [ 11 for the details of this. The bottom
a much studied problem in statistics. See among many refer- line is that the terms of (36) are obtained from the parameter-
ences, e.g. [ 101 and [ 1I]. ized membership functions as
d
Kemel Estimators.
gk((P,P,y)= n K j ( P i ( c p j - T i > > (43)
j=1
A well known example for use of local basis functions is Ker-
nel estimators. A kemel function K ( . ) is typically a bell-shaped Here tcj would be a basic membership function (often piece-
function, and the kernel estimator has the form wise linear) for the j : th component of cp, describing, e.g. the
temperature. Moreover ,B would describe how quickly the tran-
sition from say cold to h o t would take place, while y would
give the temperature around which this transition takes place.
Finally a in (36) would the the typical expected output for the
where h is a small positive number, yk are given points in the particular combination of regressor values given by (43).
space of regression vector cp. This clearly is a special case of
We are thus back to the basic situation of (36) and where the
(36) with one fixed scale parameter h = 1/P for all the ba-
expansion into the d-dimensional regressor space is obtained
sis functions. This scale parameter is typically tuned to the
by the tensor product construction.
problem, though. A common choice of 6 in this case is the
Epanechnikov kernel: Normally, not all of the parameters a k , p k and ~k should be
freely adjustable. If the fuzzy partition is fixed and not ad-
justable (i.e ,B and y fixed), then we get a particular case of the
kernel estimate (41), which is also a linear regression model.

145
Thus fuzzy models are just particular instances of the general [lo] C.J. :Stone, “Consistent non-parametric regression (with discussion),”
model structure (36). Ann. Statist., vol. 5, pp. 595-645, 1977.
HI1 M.P. Wand and M.C. Jones, Kernel Smoothing, Number 60 in Mono-
graphs on Statistics and Applied Probability. Chapman & Hall, 1995.
One of the major potential advantages is that the fuzzy rule ba-
sis may give functions g j that reflect (verbal) physical insight
about the system. This may be useful also to come up with rea-
sonable initial values of the dilation and location parameters to
be estimated. One should realize, though, that the knowledge
encoded in a fuzzy rule base may be nullified if many parame-
ters are let loose.

VII. CONCLUSIONS

We have illustrated the kinship between many model structures


used to identify dynamical systems. We have treated the case
there is no particular physical knowledge about the systems
properties: The black-box case. The archetypical problem of
estimation an unknown function from noisy observations of its
values serves as a good guide both for linear and non-linear
dynamical system.

A key issue is to find a flexible enough model parameterization.


There are many such parameterizations available, known under
a variety of names. It is not possible to point to any particular
one that would be uniformly best. Experience and available
software dictates the choice in practice.
Another key issue is to find a suitable “size” or “fineness of ap-
proximation” of the model structure. The basic advice in this
respect is to estimate models of different complexity and eval-
uate them using validation data. A good way of constraining
the flexibility of certain model classes is to use a regularized
criterion of fit.
References
L. Ljung, System Identification - Theory for the User, Prentice-Hall,
Upper Saddle River, N.J., 2nd edition, 1999.
J. Sjoberg, Q. Zhang, L. Ljung, A. Benveniste, B. Delyon, P.Y. Gloren-
nec, H. Hjalmarsson, and A. Juditsky, “Nonlinear black-box modeling
in system identification: A unified overview,” Automatica, vol. 31, no.
12, pp. 1691-1724, 1995.
H. Akaike, “A new look at the statistical model identification,” IEEE
Transactions on Automatic Control, vol. AC-19, pp. 716-723, 1974.
L.P. Wang and W. R. Cluett, “Frequency-sampling filters: An imporves
model structure for step-response identification,” Automatica, vol. 33,
no. 5, pp. 939-944, May 1997.
J. Schoukens and R. Pintelon, Identification of Linear Systems: A Prac-
tical Guideline to Accurate Modeling, Pergamon Press, London (U.K.),
1991.
P. Van Overschee and B. DeMoor, Subspace Identification of Linear
Systenu: Theory, Implementation, Applications, Kluwer Academic Pub-
lishers, 1996.
T. McKelvey and A. Helmersson, “State-space parametrizations of mul-
tivariable linear systems using tridiagonal matrix forms,” in IEEE Pro-
ceedings of the 35th Conference on Decision and Control, Kobe, Japan,
December 1996, pp. 3654-3659.
A. Juditsky, H. Hjalmarsson, A. Benveniste, B. Delyon, L. Ljung,
J. Sjoberg, and Q. Zbang, “Nonlinear black-box modeling in system
identification: Mathematical foundations,” Automatica, vol. 3 1, no 12,
pp. 1724-1750, 1995.
A. Barron, “Universal approximation bounds for superpositions of a
sigmoidal function,” IEEE Trans. In$ Theory, vol. IT-39, pp. 930-945,
1993.

146

You might also like