Professional Documents
Culture Documents
MODULAR NEURAL
NETWORKS
Applications to
Time Series
THE KLUWER INTERNATIONAL SERIES IN ENGINEERING
AND COMPUTER SCIENCE
PREDICTIVE
MODULAR NEURAL
NETWORKS
Applications to
TimeSeries
by
Vassilios Petridis
Aristotle University ofThessaloniki, Greece
petridis@vergina.eng.auth.gr
and
Athanasios Kehagias
American College of Thessaloniki
and Aristotle University ofThessaloniki, Greece
kehagias@egnatia.ee.auth.gr
....
.,
Springer Science+Business Media, LLC
ISBN 978-1-4613-7540-1 ISBN 978-1-4615-5555-1 (eBook)
DOI 10.1007/978-1-4615-5555-1
Library of Congress Cataloging-in-Publication Data
Preface ix
1. INTRODUCTION 1
1.1 Classification, Prediction and Identification: an Informal Description 1
1.2 Part I: Known Sources 3
1.3 Part II: Applications 5
1.4 Part III: Unknown Sources 5
1.5 Part IV: Connections 7
4. MATHEMATICAL ANALYSIS 59
4.1 Introduction 59
4.2 Convergence Theorems for Fixed Source Algorithms 60
VI
Part II Applications 99
Appendices 270
A- Mathematical Concepts 271
A.1 Notation 271
A.2 Probability Theory 272
A.3 Sequences of Bernoulli Trials 279
A.4 Markov Chains 280
References 283
Index 313
Preface
The subject of this book is predictive modular neural networks and their ap-
plication to time series problems: classification, prediction and identification.
The intended audience is researchers and graduate students in the fields of
neural networks, computer science, statistical pattern recognition, statistics,
control theory and econometrics. Biologists, neurophysiologists and medical
engineers may also find this book interesting.
In the last decade the neural networks community has shown intense interest
in both modular methods and time series problems. Similar interest has been
expressed for many years in other fields as well, most notably in statistics,
control theory, econometrics etc. There is a considerable overlap (not always
recognized) of ideas and methods between these fields.
Modular neural networks come by many other names, for instance multiple
models, local models and mixtures of experts. The basic idea is to independently
develop several "subnetworks" (modules), which may perform the same or re-
lated tasks, and then use an "appropriate" method for combining the outputs
of the subnetworks. Some of the expected advantages of this approach (when
compared with the use of "lumped" or "monolithic" networks) are: superior
performance, reduced development time and greater flexibility. For instance, if
a module is removed from the network and replaced by a new module (which
may perform the same task more efficiently), it should not be necessary to
retrain the aggregate network.
In fact, the term "modular neural networks" can be rather vague. In its
most general sense, it denotes networks which consist of simpler subnetworks
(modules) . If this point of view is taken to the extreme, then every neural
network can be considered to be modular, in the sense that it consists of neurons
which can be seen as elementary networks. We believe, however, that it is more
profitable to think of a continuum of modularity, placing complex nets of very
simple neurons at one end of the spectrum, and simple nets of very complex
neurons at the other end.
We have been working along these lines for several years and have developed
a family of algorithms for time series problems, which we call PREMONN's (i.e.
x
Aristotle University,
Thessaloniki, Greece,
1 INTRODUCTION
phonemes; for instance to be able to say that (for some T and n) the segment
YT+l, YT+2, ... , YT+n corresponds to the phoneme [00]. Other instances of time
series classification arise in connection to processing of radar, sonar or seismic
signals, the analysis of electrocardiographic or electroencephalographic signals,
DNA sequencing and so on. An extensive list of classification applications can
be found in (Hertz, Krogh and Palmer, 1991).
The model of a time series generated by a collection of sources can be (and
has been) used not only in classification problems, but also for prediction and
identification. Consider the case of prediction: if the input / output behavior of
the sources is known, then a predictor can be developed for each source. If it is
known that a segment of the observed time series is generated by a particular
source, future observations within this segment can be predicted by using the
corresponding predictor. Various methods of predictor combination build upon
this idea. In identification problems it is required to obtain, for instance, an
input / output model of the time series. Such a model may be global, i.e. hold
for all possible values of the observed time series, or local, i.e. several models
may be combined, each describing the input / output behavior for a particular
range of observations. In the latter case, each local model may be considered as
a source; under this interpretation the time series is generated by the resulting
collection of sources. It can be seen that in all problems discussed above the
primary task is establishing, at every time step t, the active source. In other
words, classification is a prerequisite for prediction and identification.
Multiple sources and local models are well suited to modular methods, a topic
which has recently attracted a great deal of attention in the neural networks
community. There is no universally accepted definition of what a modular
neural network is, but generally the term denotes a network which is composed
of subnetworks (modules). Each module may be specialized in a particular task,
or several modules may perform the same task in slightly different manner
(perhaps because each underwent a different training process). In the latter
case the term ensemble of networks is sometimes preferred. At any rate, it is
hoped that the combination of modules will yield superior performance, greater
noise robustness, a shorter training cycle and so on. This is a reasonable hope,
since modular neural networks implement the time-honored divide-and-conquer
approach to problem solving. The results reported in the literature indicate
that modular neural networks indeed outperform "lumped" or "monolithic"
networks.
In this book we present a family of modular neural algorithms which we have
found to work well in practical time series problems. In addition to experimen-
tal results, we present a mathematical framework to explain the success of the
proposed algorithms. This framework is based on a phenomenological point of
view which can be summarized thus.
If a module predicts a time series with greater accuracy than competing mod-
ules, then it should receive higher credit as a potential model of the time series;
however, credit must be assigned in connection with past, as well as present,
predictive accuracy.
INTRODUCTION 3
This rather simple principle can yield mathematically precise results such as
convergence to correct classification.
Part I of the book is mainly devoted to the presentation and analysis of
a family of recursive, online predictive modular time series classification algo-
rithms. The proposed algorithms are modular in the sense that they combine
a collection of modules; each module can be developed independently of the
rest and replaced or removed from the system, without affecting the remaining
modules. The algorithms are characterized as predictive because the modules
are actually predictors (one predictor corresponding to each active source) and
classification is performed by the use of credit functions which are obtained
from the modules' predictive error.
Part II presents the application of the proposed algorithms to real world
problems of time series classification, prediction and identification.
All the methods presented in the first two parts of the book, are based on
the assumption that the collection of active sources is known in advance and
that an input / output model of each source is available. If these assumptions
are not satisfied, a different approach is required; this is discussed in Part III.
Namely, we propose a family of algorithms which can be used to identify the
active sources and develop one predictive module per source. Hence the results
of Part III complement those of Part I.
Finally, in Part IV we discuss the use of multiple models methods (which may
be considered a superset of modular methods) in the fields of neural networks,
statistical pattern recognition, control theory, fuzzy set theory, econometrics
and statistics, and provide a framework which allows a unified interpretation
of such methods; extensive bibliographic references are also provided.
We will now discuss each part of the book in more detail.
In the offiine learning phase K predictors are trained, each predictor using
data generated by one of the K sources.
At t = 0, equal credit is assigned to all sources.
The online phase (for t= 1,2, ... ) consists of the following steps.
Since previous credit values are used to compute the current ones, the above
algorithm is recursive. In addition, as will become obvious later, credit assign-
ment is competitive: the new credit of a source does not depend on the absolute
magnitude of the respective prediction error, but on relative magnitude of all
K prediction errors. The algorithm is characterized as predictive, since credit
assignment depends on predictive error. In addition, the algorithm is modular:
each predictor is an independent, separately trained module and can be easily
replaced or removed from the system, without affecting the operation of the
remaining modules.
As discussed in Chapter 3, each predictor is trained independently (during
the offline phase) and can be implemented by several different kinds of neural
networks (for instance, feedforward or recurrent networks employing sigmoid,
linear, RBF or polynomial neurons). The credit assignment module is also
independent of the prediction modules, and may be replaced without affecting
their operation. Various credit computation schemes can be used, which may
be, for instance, of a multiplicative, additive or "fuzzy" character. By varying
the characteristics of the predictive and credit assignment modules, a family of
Predictive Modular Neural Networks (PREMONN) is developed, which can be
used to perform time series classification, prediction and parameter estimation
(of dynamical systems). We present numerical experiments to compare the
performance of the various algorithms introduced.
In Chapter 4, we prove mathematically (for most of the algorithms intro-
duced) that, under mild assumptions, the credit functions converge to the "cor-
rect" classification. Roughly speaking, this means that credit function of the
module with maximum predictive power converges to one and all the remaining
credit functions converge to zero.
INTRODUCTION 5
4. and use these models (in conjunction with the algorithms of Part I) to per-
form classification.
Observe Yt.
For k = 1,2 compute yf (the prediction of Yt by predictor no.k).
Compute the prediction error IYt - yf I.
Set k* = arg mink=1,2, .. ,K IYt - yfl.
If IYt - yf 1 < d, allocate Yt to predictor no. k*. Else increase K by one
and allocate Yt to predictor no. K.
Retrain each predictor on all data assigned to it.
Next t.
In other words, at time t it is claimed that Yl, Y2, ... , Yt has been produced
by source Zt, where Zt maximizes the posterior probability. This is called the
Maximum A Posteriori (MAP) estimate of z. Note that, while the source
variable is fixed, its estimate may be time varying.
The classification problem has now been reduced to computing pf, for t =
1,2, .... and k = 1,2, ... , K. This computation can be performed recursively.
The main result in this direction is the following theorem.
(2.3)
Then the posterior probabilities pf evolve in time according to the following
equation.
(2.4)
variable, its probability density function is the same as its probability function.
Then, for k = 1,2, ... , K and t = 0, 1, 2, ... we have 1
k_p( -kl )_dYt,z(a,kIYl,Y2, ... ,Yt-d (2.5)
Pt - r z- YbY2, .. ·,Yt-l,a - d ( I ) .
Yt a Yb Y2, ... , Yt-l
The above equation is a form of Bayes' rule: it says that the conditional density
of z is the same as the joint conditional density of Yt and z, divided by the
conditional density of Yt (where all conditioning is on Yb Y2,'" ,Yt-l). A
rigorous derivation of eq.(2.5) is presented in Appendix 2.A. From eq.(2.5) we
can obtain (see Appendix 2.A)
k _ dyt,z(a, k I Yl, Y2, ... , Yt-d
Pt - K . (2.6)
Ej=l dyt,z(a,j I Yl, Y2, ... , Yt-l)
We also show in Appendix 2.A that
dyt,z(a, k I Yb Y2, ... , Yt-l) = d Yt (a I Yl, Y2, ... , Yt-b z = k).
(2.10)
ITo denote probability densities of random variables, we use the following notation: the
probability density of the random variable x is denoted by d., (a); the joint probability density
of the random variables x, y is denoted by d",II(a, b); the conditional probability density of
the random variables x, given that y = b, is denoted by d.,(aly = b), or also by d.,(alb) or
d.,(aIY).
It must also be kept in mind that the probability density of a continuous valued random
variable does not exist always, but if it does, then it satisfies the following relationship
in other words, Yt , conditioned on Yl, Y2, ... , Yt-l and z = k, has a Gaussian
probability density with mean yf and standard deviation CT. (Extensions for
vector valued Yt are obvious.) Setting a = Yt in eq.(2.10), substituting in eq.
2.9 and cancelling the V1::
211'0'
terms, we obtain the desired posterior probabilities
update equation:
k
IYt_~kI2
t
k Pt-l . e
=
20'
Pt -";:"':---::'---'---"':'2;;-' (2.11)
K n _IUt-Yil
I:n=l Pt-l . e 20'
the numerator. Indeed, upon dividing the update equations for p~ and for pf',
the denominators cancel and we obtain
_IYt_;~12
p~ ptl e 2"
(2.13)
pf' Pr.:l IYt-y;n12 .
e 20'2
Eq.(2.13) shows that the likelihood ratio oftwo sources is updated at every time
step according to the "prediction" error of each source; namely, sources with
larger error become less likely to have produced the observations). However,
the likelihood ratio at time t, also depends on the likelihood ratio at time t - 1.
Hence, the operation of the update equation is essentially the following: at
every time step eq.(2.1l) penalizes more heavily sources with higher prediction
error; but past performance of each source is also taken into account.
From the above analysis it is obvious that eq.(2.1l) performs a "sensible"
update of the posterior probabilities. In fact, the update is sensible even when
the probabilistic assumptions are dropped. The only assumptions necessary to
justify the update equation are the following.
1. The time series Yl, Y2, ... is produced by some unknown source functions
Fk(')' k = 1,2, .. , K; a noise process (of unspecified characteristics) may
distort the observations.
3. The k-th sample time series is used to train offiine a sigmoid neural predic-
tor fk (.); in light of the well known universal approximation properties of
sigmoid neural networks, it is reasonable to assume that, given a sufficient
sample of the time series, fk(') approximates F k (.).
2Note that, if we have (for all k) 0 < P~ < 1 and L::=l p~ = 1, then we also have (for all k
and t) that 0 < p~ < 1 and L::=l p~ = 1.
16
For k = 1,2, ... , K train (ofHine) sigmoid neural network predictors fk(.).
At t = 0 choose initial values p~ which are arbitrary, except for the fact that
they satisfy
K
o < p~ < 1, LP~ = l. (2.14)
k=l
Next k.
At time t classify the entire time series to the source no.Zt, where
Zt = arg max p~.
k=1,2, ... ,K
Next t.
PREMONN CLASSIFICATION AND PREDICTION 17
le~l~
Ptk k
Pt-l e -~2"
m
Pt
= 'm- .
Pt-l _~
le!"12 . (2.18)
e ~"
This ratio at time t is equal to the same ratio at time t -1, multiplied by the
le~12 Ie!" 12
term e- ~ 2" /e- ~2" ,which expresses the relative error of the k-th and m-th
predictors. More specifically, if the m-th predictor has higher error than the
k-th one, the exponential term is greater than one, and the ratio of k-thcredit
to the m-th one increases from time t -1 to time tj the reverse situation holds
in case the m-th predictor has lower error than the k-th one. Hence the credit
ra~io evolves dynamically in time, depending on past values of itself (feedback)
as well as on the currently observed prediction errors. By repeatedly applying
eq.(2.18) for times t - 1, t - 2, ... , 1, we obtain
(2.19)
It now becomes obvious that the source with highest credit at time t is the one
with minimum cumulative prediction error.
The above arguments are, of course, quite informal. A more solid justifica-
tion of the use of eq.(2.17) and its variants will be given in Chapter 5, where it
will be proved (for a variety of credit update algorithms) that: under very mild
assumptions on the time series Yt, if the k-th source has "smallest prediction
error" , then limt....oo pf = 1 and limt.... oo pi = 0 for all m :F k. The exact mean-
ing of "smallest prediction error" , as well as the exact sense of convergence will
be described in Chapter 5.
In conclusion, let us list some of the advantages of the phenomenological
point of view. First, as already mentioned, it renders assumptions about the
nature of the time series unnecessary. Second, it allows a simple treatment
of switching sources, as will be seen in the next section. Third, it allows the
introduction of variant classification algorithms; we have found such variants
(a number of which will be presented in Chapter 4) to be, in some instances,
more efficient than the basic algorithm presented above.
18
1.00
..;'P-"""~'.~'''~'\'''''r''''''I" ....,It.''.~\-,-,\Ji~/'''''''~
0.90
0.80
"
0.70
0.60
~ 0.50
5
0.40
0.30 :
0.20
.~
0.10 :
V:
0.00 ~~"",'.......,v-,/.,.,,-~.A.-_. "'~vv·.........r.,.",,·
o 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375
Time Steps
will become 1. But, since before ts p~ is very close to 0, after ts it starts from
a very unfavorable initial condition. Therefore it is most likely that a new
source switch will take place before p~ becomes sufficiently large. In such a
case, classification to the second source fails. In the extreme case, because of
numerical computer underflow p~ is set to zero before ts; referring to eq.(2.17)
we observe that p~ will remain 0 for all subsequent time steps. To resolve
this problem, whenever p~ falls below a specified threshold h, it is reset to h.
Then the usual normalization of the p~'s is performed; this ensures that the
thresholded p~'s remain approximately within the [h, 1] range and add to 1. In
essence, this thresholding is equivalent to introducing a forgetting factor; the
argument to show this goes as follows. Suppose that several samples of the
time series are observed, which have not been produced from the k-th source.
For each such sample, the k-th predictor produces a large error and, as can be
seen from eq.(2.17), p~ is multiplied by a number close to zero. If this process
continued for several time steps, p~ would become zero soon, as explained
above. If we never let p~ go below h, we essentially stop penalizing predictor k
for further bad predictions; these are, in effect, "forgotten". If h is small, then
p~ will also be small and will not essentially alter the classification results. On
the other hand, when source no.k becomes active, p~ can recover quickly. An
alternative way of looking at the use of threshold h is that whenever one or
more p~'s fall below h (because the corresponding predictors perform poorly) we
restart the algorithm using new initial values for the credit functions, obtained
from resetting the corresponding p~'s to h. Under this interpretation, our
prior belief in any of the source models never goes below h. In the experiments
we present in Section 2.6, we always chose h=O.Ol; this choice is arbitrary but
consistent and gives good results.
In this section we discuss some issues which are related to the implementation
of the PREMONN algorithm and also introduce some possible variations of the
basic algorithm. These issues are treated here briefly; a detailed presentation of
variant algorithms appears in Chapter 4 and a full discussion of implementation
issues (for the standard and variant algorithms) appears in Chapter 7.
20
Yt
p/
Credit
Pt2
Assgn.
PtK
MONNs perform well even when the individual prediction modules have poor
performance, as long as they are clearly separated in the parameter space: the
decision module will simply pick the "least bad" prediction module. In par-
ticular, PREMONN is immune to a high level of noise in the data (in this
connection see also Section 2.6). Finally, the modularity of the PREMONN
algorithm introduces parallelism naturally. Prediction modules can execute in
parallel and send the results to the decision module. Hence execution time is
independent of the number of classes in the classification problem.
Kalman filters etc. In fact, several different predictor types can be used within
the same PREMONN.
In general, the probabilistic analysis presented in Section 2.1 breaks down in
case the source functions Fk(') are different from the predictor functions Ik(')'
This, however, presents no difficulty in the phenomenological framework, where
the functions Fk (.) can be completely ignored; in this context it is only required
that predictions yf are available for k = 1,2, ... , K and t = 1,2, .... As soon
as the yf are available, no matter how they were generated, credit assignment
can take place. Hence it can be seen that the phenomenological point of view
yields considerable freedom in the design of PREMONNs.
After a particular type of predictor is chosen, values must be selected for
various predictor parameters; for instance, in a sigmoid feedforward neural
network, the number of layers and neurons, as well as the predictor order
M. The usual common sense rules should be applied to the selection of such
parameters. However, it must be stressed that such choices are not crucial for
the performance of the algorithm, because PREMONNs are particularly robust
to faulty predictions. This is easily understood by considering eq.(2.19) once
again. It can be seen that what matters in credit assignment is not absolute,
but relative prediction performance. Consider, for instance, the case where the
k-th predictor is tuned to the currently active source, but is not well trained.
In this case we may expect the errors e~ to be large, but consistently smaller
than those of the remaining predictors. Considering eq.(2.19) it is clear that
in this case p~ will dominate pr;' for m i= k, resulting in correct classification.
This observation is corroborated by the experiements presented in Section 2.6,
as well as by the theoretical analysis of Chapter 5.
(2.20)
(2.21)
which again penalizes more heavily predictors with higher error, while keeping
track of past predictor performance. Credit update schemes of the form (2.20)
are termed multiplicative schemes (for obvious reasons).
PREMONN CLASSIFICATION AND PREDICTION 23
The cumulative squared prediction error can also be used directly, rather
than in its exponential form; in this case we can obtain credit update schemes
of the form
where g(.) is a strictly positive and increasing function. In this case, however,
we lose several attractive features of the credit function. In particular, the
properties p~ < 1 and E;;=I p~ = 1, for all k and t, do not hold any longer. In
addition, p~ now describes the discredit, rather than credit of the k-th model.
We call such schemes additive.
We have also implemented incremental credit assignment schemes (which
resemble a steepest ascent procedure) and schemes which implement fuzzy rea-
soning (by using the minimum operator in place of the product and the maxi-
mum operator in place of the sum). These schemes will be presented in Chapter
4. Note that the variants presented previously refer to the credit assignment
scheme and are completely independent of the type of predictive modules used.
Such variants will be examined in Chapter 4.
2.5 PREDICTION
A prediction algorithm can be easily obtained from the PREMONN classifica-
tion algorithm. The key idea is predictor combination, a methodology which
has become increasingly popular in the last decade (Clemen, 1989; Farmer and
Sidorowich, 1988; Perrone and Cooper, 1993; Quandt, 1958; Quandt and Ram-
sey, 1978; Tong and Lim, 1980). Generally speaking, predictor combination
consists in the use of a collection of predictors (operating in parallel) in con-
junction with a predictor selection mechanism; the final outcome is a prediction
fit of Yt.
Several predictor selection mechanisms have appeared in the literature. The
PREMONN architecture is particularly suitable for a predictor combination
approach. The predictor collection (i.e. y~ = fk(Yt-l, Yt-2, ... , Yt-M), k =
1,2, ... , K) is available as an integral component of the PREMONN; further-
more prediction combination can be effected in a natural manner by use of the
credit functions p~. We have experimented with the following two combination
methods.
Our experience indicates that, while each of the two predictor combination
methods may be advantageous in particular situations, neither method has a
clear, universal advantage over the other. At any rate, both methods give quite
satisfactory prediction results.
Finally, it is worth noting that in both the weighted and winner-take-all
methods, the combined prediction Yt depends only on quantities (pLl' yf for
k = 1,2, ... , K) which can be computed at time t - 1.
2.6 EXPERIMENTS
In this section we apply the basic PREMONN Classification Algorithm to
several time series classification tasks, using computer synthesized logistic,
Mackey-Glass and sequential logic gates time series. We explore the effect of
varying parameter values and observation noise levels on classification accuracy.
For values of a greater than 3.67, eq.(2.23) yields a chaotic time series, (see
Figs. 2.3, 2.4 and 2.5).
A total of twelve predictor modules have been trained. All the predictors
used are 18-5-1 sigmoid feedforward neural networks. The 18 inputs are the
values Yt-l, ... , Yt-18 and the target output is Yt. Eleven of the predictors have
been trained on sample logistic time series, generated according to eq.(2.23),
with a= 3.0, 3.1, ... ,4.0. The mean square error of these predictors varies, but is
generally between 0.1 and 0.3. The twelfth neural network predictor is trained
on a Gaussian white noise time series, with mean I-£w = 0.50 and standard
deviation C1w =0.25. In this case the mean square error of the predictor is 0.3. 3
Let us now present a few representative experiments. In the first experiment
a test time series is generated, using eq.(2.23), a = 4.0 and different initial con-
ditions than the ones used to generate the training time series.. 182 samples of
the time series have been generated 4. A PREMONN employing two predictors
has been used: the first predictor is the one trained on logistic, a= 4.0, and
the second is the one trained on white noise. The PREMONN is required to
discover that the actual time series is logistic rather than noise. The algorithm
3Note that 0.25, the noise variance, is the minimum theoretically attainable MSE, which can
be attained by using the constant predictor Yt =0.5.
4 Actually 200 samples, of which the first 18 are used as initial values for the sigmoid
predictors.
PREMONN CLASSIFICATION AND PREDICTION 25
Figure 2.3. Credit function evolution for a classification experiment involving a logistic
time series and two predictors.
1.00
0.90
0.80
0.70
!j 0.80
tI
§ 0.50
IL
~ 0.40
0.30
0.20
0.10
0.00 '-.-
0 25 50 75 100 125 150 175
Time Steps
parameters are (J'= 0.15, h = 0.01. The results of this experiment are presented
in Fig. 2.3. It can be seen that classification to the logistic is correct and very
fast.
In the second experiment a composite time series has been used. The first
half of the time series consists of 82 samples of a logistic time series with a=4.0;
the second half consists of 100 samples of white noise with mean J.tw = 0.5 and
standard deviation (J'w =.25. The PREMONN employs the same two predictor
modules used in the previous experiment; (J'= 0.20, h = 0.01. The task is
to classify the first 82 samples as belonging to a logistic time series and the
next 100 as being white noise. The results of this experiment are presented
in Fig.2.4. In the beginning of the time series, classification to the logistic is
almost instantaneous. Then at the switching point t s =82, a very quick switch
to the noise module is observed.
In the third experiment ten predictor modules are used, corresponding to
logistics with a= 3.0, 3.1, ... , 3.9. 200 time steps of a test logistic have been
generated with a=3.8 (and new initial conditions); (J' = 0.15, h = 0.01. The
results of this experiment are presented in Fig.2.5. It can be seen that classifi-
cation to the true logistic is very fast. This demonstrates PREMONN's ability
to deal with a large number of sources.
In order to evaluate the dependence of classification accuracy on the (J' and
h parameters, as well as on the level of noise in the observations of Yt, we have
performed many additional experiments, which are presented in tabular form.
26
Figure 2.4. Credit function evolution for a classification experiment involving a time series
with two components (logistic and noise) and two predictors. Switching time ts =82.
1.00
....... ---~ ............ \ .... ........ .
~
0.90
0.80
0.70
0.60
~
"
u.
~
0.50
~
5 0.40
0.30
0.20
.,,.\
.,
0.10 : I.
Figure 2.5. Credit function evolution for a classification experiment involving a logistic
time series and ten predictors.
1.00
~
~ 0.50
~
U
0.00 -P===F====l====F===F===O==='F='==r=-
51 101 151
Time Steps
PREMONN CLASSIFICATION AND PREDICTION 27
Table 2.1. Classification accuracy c for various values of (T. Dataset A is a logistic time
series; dataset B is a composite (logistic and noise) time series.
Dataset Uv U h c
Table 2.2. Classification accuracy c for various values of h. Dataset A is a logistic time
series; dataset B is a composite (logistic and noise) time series.
Dataset {ju {j h c
Table 2.3. Classification accuracy c for noisy observation of the time series. Dataset A is
a logistic time series; dataset B is a composite (logistic and noise) time series. Both time
series are mixed with additive white noise (observation noise) with zero mean and standard
deviation {j u .
Dataset {ju (j h c
A 0.000 0.100 0.010 0.994
A 0.050 0.150 0.010 0.994
A 0.100 0.200 0.010 0.994
A 0.200 0.300 0.010 0.989
A 0.300 0.400 0.010 0.961
A 0.500 0.600 0.010 0.741
B 0.000 0.100 0.010 0.961
B 0.050 0.150 0.010 0.928
B 0.100 0.200 0.010 0.873
B 0.200 0.300 0.010 0.879
B 0.300 0.400 0.010 0.522
B 0.500 0.600 0.010 0.516
Using first T = 10 and then T = 17 we obtain two chaotic time series. More
specifically, for each value of T we integrate eq.(2.24) and then sample at times
8=5, 10, 15, ... sees to obtain a time series YI, Y2, ... , where Yt = '¢(5 . t), t=l,
2, ... . We use the above data to train two sigmoid feedforward predictors
(both of size 5-5-1), Input is Yt-l, Yt-2, ... , Yt-5 and target output is Yt . The
mean square prediction error is (for both predictors) approximately 0.04.
PREMONN CLASSIFICATION AND PREDICTION 29
Figure 2.6. Credit function evolution for a classification experiment involving a Mackey
Glass time series.
_: Time Ser1es; ___: Credit FunctiOn
1.50
1.20
g 0.90
]
~ 0.60
0.30
0.00
, -,_ . . •' .••••• - •. ~ .... _.- .•.•.• 0- __ . _ . . . . . . . . . .
o 50 100 150
TImeSIeps
Figure 2.7. Credit function evolution for a classification experiment involving a composite
time series with two Mackey-Glass components.
1.50
,~
1.20
, ...
0.90 ; !
i ~
!:
;
0.60
~
0.30 :!',
\~,.'"._ ~.,._
••• .......,> ., ••, ............, •• .' •.•• _ _ ~.J! \\.._."~..,~'.,..\.....,.".....,.. ,.~"..'".
0.00 .J....-~"""'=~*":=:::~~='=i---""'''''''''~=.:!'''I-'''''''''''='''''''''''-''''''''
o 100 200 300
Time Steps
Table 2.4. Classification accuracy c for various values of (T. Dataset A is a time series with
one Mackey-Glass component; dataset B is a composite time series with two Mackey-Glass
components.
Dataset Uw U h c
A 0.000 0.010 0.010 0.850
A 0.000 0.050 0.010 0.995
A 0.000 0.100 0.010 0.995
A 0.000 0.200 0.010 0.950
A 0.000 0.300 0.010 0.930
B 0.000 0.010 0.010 0.955
B 0.000 0.050 0.010 0.993
B 0.000 0.100 0.010 0.980
B 0.000 0.200 0.010 0.895
B 0.000 0.300 0.010 0.755
values of the (T and h parameters. Hence the conclusions of the previous section
are further corroborated.
PREMONN CLASSIFICATION AND PREDICTION 31
Table 2.5. Classification accuracy c for various values of h. Dataset A is a time series with
one Mackey-Glass component; dataset B is a composite time series with two Mackey-Glass
components.
Dataset aw a h c
Table 2.6. Classification accuracy c for various values of a. Dataset A is a time series with
one Mackey-Glass component; dataset B is a composite time series with two Mackey-Glass
components. Both time series are mixed with additive white noise (observation noise) with
zero mean and standard deviation a v .
Dataset av a h c
A 0.000 0.040 0.010 0.995
A 0.050 0.090 0.010 0.995
A 0.100 0.140 0.010 0.995
A 0.200 0.240 0.010 0.965
A 0.300 0.340 0.010 0.975
A 0.500 0.540 0.010 0.855
B 0.000 0.040 0.010 0.993
B 0.050 0.090 0.010 0.990
B 0.100 0.140 0.010 0.980
B 0.200 0.240 0.010 0.888
B 0.300 0.340 0.010 0.875
B 0.500 0.540 0.010 0.558
Yt = NOT(ut), (2.26)
Here XOR, NOT, NOR and NAND are the usual logic gates; Yt and Ut are
Boolean variables, taking values in {O, 1}; in particular, Ut, U2, •.• is a sequence
of independent random variables, which takes each the values 0 and 1 with
probability 0.50.
We first run each of the above equations separately, with randomly generated
Ul, U2, ... sequences and four (training) Yt time series, each oflength 200. These
time series are used to train four sigmoid 2-3-1 feedforward predictors of the
form (k = 1,2,3,4)
y~ = h(Yt-l,Ut).
(notice the slight change from the previous model; the predictions now depend
not only on past values Yt-l, but also on input ut-}
In the test phase we run two experiments. The first involved a time series
generated by a source switching sequence of the form XOR - NOT - NOR -
NOT (illustrated in Fig.2.8). The second time series was generated by a source
switching sequence of the form XOR _ NAND- NOR - NAND (illustrated
in Fig.2.10). The classification results for the two experiments are illustrated
in Figs. 2.9 and 2.11. It can be seen that classification is almost perfect.
2.7 CONCLUSIONS
In this chapter we have presented the basic PREMONN classification algorithm.
Classification is performed by recursive computation of the credit functions pf,
which indicate the likelihood of each candidate source having generated the
observed time series.
The motivation for developing the algorithm has been probabilistic, but it
has been established by informal arguments that its use can be justified by
purely phenomenological arguments, without recourse to probabilistic consid-
erations. Further justification for the use of this and related algorithms will
be provided in Chapter 5, where convergence to correct classification will be
established mathematically. For the time being, the following remarks are in
order. The phenomenological point of view offers greater flexibility than the
probabilistic one, in the sense that the PREMONN algorithm can be applied
to a wider range of problems and modified in various ways (some of these mod-
ification will be presented in Chapter 3 and their possible advantages will be
discussed). In addition, the phenomenological point of view allows classification
of time series generated by switching sources.
PREMONN CLASSIFICATION AND PREDICTION 33
Figure 2.8. The sequence of source switchings used in the first classification experiment.
XOR corresponds to source no.1, NOT corresponds to source no.2, NOR corresponds to
source no.3, NAND corresponds to source no.4.
o+-______+-______+-______ ~------~------~
Figure 2.9. Credit function evolution for the first logic gates classification experiment.
: :: : I
I ,: : i
: :~ :i
:
I
I:
I.
:i:i
: ~f :i
6 I I: :1
~ : :: ;i
I'rl",
0.50
':
:~
IL
~ ,.,:
,: "
Ii
~ Ii
I, i,
I: I ,
Ii I :
1
1
Figure 2.10. The sequence of source switchings used in the first classification experiment.
XOR corresponds to source no.l. NOT corresponds to source no.2. NOR corresponds to
source no.3. NAND corresponds to source no.4.
Figure 2.11. Credit function evolution for the first logic gates classification experiment.
1.00
r - - - - - - - - , •.•. _._._._._.-: J - - - - - - - - I .. 1······· __ · __ ···
I
,! I:'
':
, ':
'!l
6
c:
;;;"
0.50
.
IL
r ,!
I
~
u :t:
",
l:,
,I .,
I, .,"!,
I
II "!,
"
L I![
0.00
101 201 301 401
Time Steps
PREMONN CLASSIFICATION AND PREDICTION 35
IYt -;~ 12
k pLl . e- 2"
Pt = ----='----"----:-1
K n _
--2"712
Yt-yf
0. (2.A.l)
Ln=l Pt-l . e 2"
(2.A.2)
(these are the relationships introduced in Section 2.1). Then the above rela-
tionships will be used to prove eq.(2.A.l).
Step 1. Since et is independent of z and Yl, Y2 ,... , it follows that
(recall that yf is a deterministic function of Yl, ... , Yt-l and et is zero mean,
Gaussian)
j-
a
(Xl
1 ~
- - . e- 2.,. dy
..;27rcr
= Pr(Yt - yf < aIYl, ... , Yt-l, Z = k) =>
j-
a
(Xl
1 ~
- - . e- 2.,. dy
..;27rcr
= Pr(Yt < a + yflYl, ... ,Yt-l, Z = k) =>
(substituting a - yf in place of a)
j-(Xl
a- y : 1 ~
m= . e- 2.,. dy
V 27m
= Pr(Yt < alyl, ... , Yt-l, Z = k). (2.A.5)
1 r(a,k)
G ~dP(a, m) = Pr(z = k, Yt E E). (2.A.14)
1 G
rea, k)
~dP(a,m)
r ~ rea, k)
= JE~ ~ 'r(a,m)da =
by cancelling the vhCT in numerator and denominator, the proof of the theorem
is complete .•
3 GENERALIZATIONS OF THE BASIC
PREMONN
It has been assumed so far that the predictors fk (.) are feedforward sigmoid
neural networks. However, as has already been pointed out, the fk(.)'S may
represent various other functional forms, such as linear functions, polynomials,
radial basis functions, splines etc. In addition, the predictions can be obtained
by feedforward or feedback (recurrent) calculations. Generally, the only require-
ment for the operation of the PREMONN algorithms is that the predictions yf
are available for k = 1,2, ... ,K and t = 1,2, ... j the method by which these are
obtained is not important. In fact, a bank of predictive modules may include
predictors of different types, for instance a sigmoid and a linear predictor may
coexist in the same PREMONN. The use of different predictor types may be
advantageous if it is known that different sources are approximated better by
different predictor types.
the only difference being that now 1.1 signifies Euclidean norm. Variants in-
volving the exponential form e- E : QE: (with Q a positive definite matrix) are
also possible and obvious. These do not induce substantial modifications in the
credit update algorithm.
(3.1)
where k = 1,2, ... , K and s = 1,2, '" . Note the change of the time variable:
time is now denoted by t rather than s. This is done because we reserve the
variable t to denote data blocks rather than single data points. Specifically, we
define observation blocks Yi and prediction blocks ~k as follows
yt == [
Y(t-l).N+1
~~t-l)'N+2
1
, ~k ==
Y~-l)"N+l
~~t-l)"N+2
k 1 . (3.2)
[
Y(t-l)"N+N Y(t-l)"N+N
Finally, the credit functions are redefined, in terms of block prediction errors,
by
1Ekl2
It has already been remarked in Chapter 2 that there is nothing special about
IE"1 2
the quadratic term in the exponential e-~. Any function of the form
e-g(lEm will do, as long as g(.) is a strictly positive and increasing function.
Hence a credit update equation of the following form may be used
k e-g(lEm . pLl
Pt = (3.5)
L~=l e- g (I E t'1) . P~-l
This equation is presented (Kehagias and Petridis, 1997a). We have experi-
mented with several such functions and obtained results comparable to those
of the quadratic error function 2 . The general idea is clear: large errors result
2In fact, any strictly positive function G(.) can be written in the form G(.) = e-log[G(.}];
=
hence, using g(.) log [G(.)), the update
G(IEfl) ·ptl
G (I Et'1) 'Pt'-l '
k
Pt = Ln=l
K
(3.6)
42
Hence, if at time step t the k-th predictor has larger error than the m-th one,
k k
then the ratio -4
p.
is reduced relative to the ratio -4.
p.
In fact, by repeating the
above argument for times t - 1, t - 2, ... , 1, we obtain
(3.8)
It becomes obvious from eq.(3.8) that multiplicative credit update schemes fur-
nish a method for evaluating credits according to the exponentiated cumulative
square error. Namely, if after t observations of the time series the k-th predictor
has larger error than the m-th one (as measured by the function g(.)), then the
k
ratio ~ is less than one. In short: predictors of larger error receive smaller
credit.
(3.10)
Note that in eqs.(3.9) and (3.10) the function pf actually is the discredit (rather
than the credit) of the k-th predictor: a large value of pf indicates that the
respective predictor is performing poorly. Changing slightly eq.(3.9) we obtain
the recursion
k
Pt =
t - 1 Pt-l
-t-· k
+ 1E tk 12 , (3.11)
in which case P:
yields the running average of the cumulative square error of
the k-th model.
can be used with any strictly positive and decreasing function G(.).
GENERALIZATIONS OF THE BASIC PREMONN 43
Additive credit update schemes of the forms (3.9), (3.11) are easier to im-
plement than multiplicative ones; however, our experience indicates that their
performance is somewhat inferior to that of multiplicative schemes. In addi-
tion, some attractive properties of the credit function are lost when an additive
algorithm is used. For instance, the properties 0 < p~ < 1 and 2:~=1 p~ = 1,
for all k and t, do not hold any longer. However, the difference in performance
is not that great and the simplicity of implementation makes additive schemes
an attractive alternative to the multiplicative ones.
Fuzzy Set Formulation. The source set is a finite crisp set 8 = {1, 2, ... , K}.
The source variable z takes, as usual, values in 8. The estimate of z is called
Zt and also takes values in 8. The computation of Zt at time t is based on a
process of fuzzy inference.
Consider the attribute: "source no. z has been active from time s up to time
t". A crisp set of elements that satisfy this attribute, must include exactly one
member of 8; this is so because it is assumed that the time series is generated
by a single source. However, we propose to use a fuzzy set:
A(s, t) = {(z, ILA(s.t) (z)) Iz E 8}.
44
The fuzzy set A(s, t) consists of the crisp set e (the set of possible values of
the source parameter) and the membership function PA(s,t) (z); for a given z,
PA(s,t)(z) is the membership grade of the attribute "source no. z has been active
from time s to time t". Obviously A(s, t) has a time dependence on times s
and t. Now, consider the k-th member of e: PA(l,t)(k) is the membership
grade of "source no. k has been active from time 1 to time t", or equivalently,
"observations Yl, Y2, ... ,Yt have been generated by source no. k". For economy
of space, and also for compatibility with previous analysis, we use the notation
p~ ~ PA(l,t)(k);
What is required here is to provide and justify a method for updating P: at
every time step. This will be derived presently; but first note that, for a given
time t, it is natural to set
In other words the time series is classified to the source no.Zt which achieves
maximum membership grade.
where g(.) is any positive increasing function; for instance we can use g(IEfl)=
1Ekl2
e-T;;i to obtain
IE k l 2
PA(t-l,t)(k) = e-T;;i
In eq.(3.13) the membership grade is expressed in terms of predictive accu-
IE k l 2
racy: for instance, when I Ef I is large, PA(t-l,t)(k) = e-T;;i is small. Now
eqs.(3.12), (3.13) result in the following recursive equation:
rate, (3.14) shows that when I Ef I is large, then e-g(IE~1> (and consequently
PA(t-l,t)(k)) is small; this implies thatPA(I,t_l)(k) AND e-g(IE~1> is also small.
In fact, a little reflection shows that eq.(3.14) results in a decreasing sequence
of membership grades pf. This may result in various implementation problems
(e.g. numerical underflow), so a normalized form will be used in what follows.
k pf_1AND e-g(IE~1>
(3.15)
Pt = OR :=1 (pf_1AND e-g(IEt'I>)"
The previous comments about the influence of I Ef I on pf apply to eq.(3.15)
as well, but now relative, not absolute, magnitude of I Ef I influences pf, since
the computation of membership grades is competitive. Hence, a large I Ef I
does not necessarily imply small membership grade pf; the value of pf may be
large if I E;' I > I Ef I for n 1= k, that is, if other predictors perform even worse.
Note that the form of the decision module has not yet been specified; this will
depend on the implementation of the fuzzy AND and OR inference, to be
discussed in the next section.
Modes of Fuzzy Inference. The form of the fuzzy credit assignment de-
pends on the implementation of the fuzzy AND and OR in eq.( 3.15). In
fuzzy set theory there are two standard ways to implement such logical oper-
ators (Bezdek, Coray,Gunderson and Watson, 1981a) AND is implemented
by a product and OR is implemented by a sum; alternatively AND is
implemented by a minimum and OR is implemented by a maximum. Two
"hybrid" combinations are also possible: AND is implemented by a product
and OR is implemented by a maximum; AND is implemented by a min-
imum and OR is implemented by a sum. Only the first two cases are dealt
with here. The Sum/Product Fuzzy PREMONN Algorithm is based on the
equation
k Pf-l . e-g(lE~1>
Pt = K (3.16)
Ln=1 pf-l . e-g(IEt'1>
This is, of course, exactly the basic PREMONN algorithm. The Max/Min
Fuzzy PREMONN algorithm is based on the equation
k Pf-l /\ e-g(lE~1>
(3.17)
Pt = V~=1 (pf-l /\ e-g(IEt'I»
where /\ indicates the minimum operator and V indicates the maximum opera-
tor. In addition to the sum/product and max/min algorithms, one can use the
max/product algorithm (in the sum/product algorithm replace the sums with
max operators) and the sum/min algorithm (in the sum/product algorithm
replace the product with min operators). Since these algorithms are obvious
modifications of the sum/product and max/min algorithms, we do not give
their descriptions here. We have introduced these algorithms in (Petridis and
Kehagias, 1997b) under the name Predictive Modular Fuzzy Systems (PRE-
MOFS).
46
Phenomenological Point of View. Having obtained the membership grade
update algorithms, we can now rename membership grade as credit function
and the fuzzy point of view can be abandoned in favor of the phenomenolog-
ical one. Let us then rename the P:
quantities as credit functions. It will
be observed that the sum/product algorithm is exactly the basic PREMONN
algorithm, while the max/min, sum/min and max/product algorithms are vari-
ations. For instance, consider the max/min algorithm. This says that, at time
t, the credit function P: is a normalized version of pLl t\e-g(IE~I). Ignoring the
normalization, for the time being, this says that P:
will be no greater than pLl
or e-g(IE~I). Hence, if either the previous value of the credit function or the
current error is small, the new value of the credit function will also be small.
Now let us consider the scaling effected by the denominator in eq.(3.17). This
results in the maximum of the p:'s being equal to 1. Hence, the max/min al-
gorithm updates the credit functions in the following manner: at every step all
credit functions decrease (or at least do not increase) but the credit of predic-
tors with larger errors decreases more; then the credit functions are rescaled, so
the maximum credit becomes equal to one. It is clear that here we have once
again a case of recursive, competitive credit assignment, just like in the basic
PREMONN algorithm. The usual attractive properties of the credit functions
are preserved. The normalized form of equations (3.16) and (3.17) ensures
that for both algorithms we have 0 < P: ::; 1 for all t and K. In the case
of the sum/product PREMONN ~~=l P:
= 1 for every t and in the case of
the max/min PREMONN V~=l P:
= 1 for every t. Hence the normalization
ensures that at least one P:
never becomes too small. In fact, it will be seen
in Chapter 5 that, under appropriate conditions, one P:
will tend to one, both
for the sum/product and max/min algorithms.
e:,
rithm. This will be used to motivate the "Bayesian" incremental credit assign-
ment scheme. Consider the one-step errors k = 1, 2, ... , K and the following
difference equation
q tk _ qk
I
t-l
lekl2
= e - ~. Pt-l
k
-
(
L
K
n
Pt-l . e
- lenl2)
~
k
. qt-l . (3.18)
n=l
Hence, q:'s are defined by the above recursion and some initial conditions q~
(k = 1,2, ... , K) which satisfy
K
q~ > 0, L q~ = 1;
k=l
GENERALIZATIONS OF THE BASIC PREMONN 47
the p~'s are the original Bayesian posterior probabilities defined in Chapter 2
We claim that if the q:'s (as given by the above equation) converge, then,
at least at equilibrium, they approximate the p~'s. Indeed, at equilibrium we
have q~ ~ qf-l and from eq.(3.18) we obtain
Since
(3.19)
it follows that q: ~ p~ (for k = 1,2, ... , K). The point of introducing the q:'s
is to avoid the computation of p~ by equation (3.19). In this case the Pf-l 's in
(3.18) are unknown, so let us substitute them by the qf-l's, which approximate
them. After some rewriting, eq.(3.18) becomes
qf=qf-l+')"
lekl2 (~p~_l·e-~
[e-~- K lenI2)] ,pf-l' (3.20)
Since the originalp~'s have disappeared from the picture, let us rewrite eq.(3.20)
replacing qf-l 's by P~-l 's and, in addition using now the N-step error. We then
obtain the credit update equation
Finally, let us rename the qf's as p~'s (since we expect that qf ~ p~) and in
eq.(3.21) let us replace the quadratic function 1~~~2 by the more general positive
and increasing function g(le~l). Then eq.(3.21) becomes
.e-g(E~)
"k
Pk
k t 1
Multiplicative Pt = n.e g(E~)
L...J n =l P t - 1
Fuzzy Sum/Product
Fuzzy Max/Min
Here eb e2, ... ,et, ... is a Gaussian white noise process with zero mean
and standard deviation 0". Note that in place of the previously used Z we
now have placed Zt. This is a stochastic process taking values in the set e =
{I, 2, ... , K}, according to a Markovian law described by Pmk, the transition
probability matrix, defined by
K K =
Lm=l Ln=l dyt,Zt_l,Zt(a, n, mIYt-l, ···Yl)
GENERALIZATIONS OF THE BASIC PREMONN 51
Theorem 3.1 If the Markovian time series model presented above holds, the
posterior probabilities pf evolve in time according to the following equation.
(3.33)
Eq.(3.33) has been introduced in (Petridis and Kehagias, 1998). Note that
it is compatible with the fixed-source case; in that case we would have Pkk = 1
and Pnk = 0 for n -=I k and eq.(3.33) would reduce to
(3.34)
which is, as expected, the original basic PREMONN credit update equation
(2.17) of Section 2.2.
52
1. There are K sources which are activated successively in time to generate the
time series Yl, Y2, ... .
2. If the Yt sample of the time series is generated by the k-th source, then Yt
may be approximated by
for any measurable sets Ace and B C R and for any Yt-n E R, Zt-n E e
(n > 0). In addition, the process Yt obviously is a (deterministic) function of
(Zt, Yt). Hence the process [(Yt, Zt), Ytl falls within the definition of HMMs. In
fact, the posterior probability update eq.(3.33) is simply the Viterbi decoding
algorithm (Rabiner, 1988). This connection between Markovian PREMONNs
and HMMs is further discussed in Chapter A.
1. Updating the credit at times N, 2N, 3N, ... implies that Zt also changes at
times N, 2N, 3N, .... Then it is appropriate to replace the state transition
matrix P (which expresses the likelihood of state transitions in one time step)
by the matrix R = pN (which expresses the likelihood of state transitions
in N time steps). In addition, we will use a source variable Zt in place of Zt;
however, this variable (unlike yt, Y/ and so on) will not be considered as
a vector variable with N components, but as a scalar one. In short, we are
introducing the source variable Zt which describes source switchings at times
N, 2N, 3N, ... and we will assume that source switchings at intermediate
times are not possible.
2. The "transition" matrix formulation applies to multiplicative and incremen-
tal schemes; it also applies to fuzzy schemes, with the understanding that,
if products are replaced by min operators and sums by max operators, then
matrix multiplication must be accordingly modified.
3. Regarding additive and counting schemes, in place of the "transition" matrix
R , a transition function w( n, k) is used. This function penalizes source
54
Multiplicative
Additive 3
Fuzzy Max/Min
k _ V K
n=l
pn
t-l
I\R
nk
l\e- g (E;')
Pt -
V"'=1 V,,=1
K (
Incremental
3.6 EXPERIMENTS
In this section we present some comparative experiments on classification of
computer generated time series. The goal of the experiments is to compare
the performance of the PREMONN classification algorithms presented in the
previous sections. In particular, we are interested in comparing classification
accuracy and noise robustness of the PREMONN algorithms. In addition we
want to explore the difference in performance between fixed and switching
sources versions of the same algorithms.
The data used for training and testing the PREMONN algorithms are gen-
erated by four chaotic sources (dynamical systems). Namely, the data are
generated by sources of the form
GENERALIZATIONS OF THE BASIC PREMONN 55
.,
.''""
~., 0.50
E
i=
0.00
201 401
Time Steps
4The fuzzy sum/product algorithm is the same as the multiplicative one and hence is not
listed separately.
56
of classifying the 5000 observations of the test time series to one of four possible
categories (Le. logistics, tent-map, double logistic and double tent-map). Both
the fixed source and Markovian switching source versions of each algorithm
are used. All algorithms are used with the quadratic error function: g( e) =
e- ~
2.,. • We superimpose on the time series observations additive white noise
3.7 CONCLUSIONS
We have presented several variations of the basic PREMONN classification
algorithm, argued for their phenomenological justification and performed nu-
merical experiments to compare the algorithms. While a relatively small set of
experiments has been presented here, the results corroborate our experience,
which can be summarized as follows.
First, there does not appear to be a significant advantage in the use of
the (Markovian) switching source algorithms; their fixed source counterparts
exhibit comparable and in fact usually better classification accuracy. Hence
the added complexity which must be incorporated in the algorithms to handle
the switching source situation does not appear to be justified.
Second, while the additive algorithms perform better than the multiplicative
ones under noise free conditions, they are not noise robust. The slightly more
complex multiplicative algorithm performs better.
A good combination of complexity and performance is offered by the count-
ing algorithm: while it is extremely simple to implement, it is quite accurate
and noise robust. The fuzzy algorithm has comparable performance; while it
GENERALIZATIONS OF THE BASIC PREMONN 57
is rather more complex than the counting algorithm, it may be easier to im-
plement than the multiplicative algorithm; the fuzzy interpretation may be
epistemologically more appealing to certain users.
Finally, the ICRA algorithm probably combines the most attractive features.
It has the highest accuracy and noise robustness and can be implemented in
hardware by a neural network.
The experiments presented in this chapter are only meant to impart to the
reader a general idea of the PREMONN algorithms performance. For a better
evaluation of the PREMONN potential two steps are required. In the following
chapter we will present a mathematical analysis of the algorithms and show
that in every case convergence to the "correct" classification is guaranteed
under mild and reasonable digressions. Then, after a discussion of identification
problems (this is presented in Chapter 5), in Chapters 6 to 9 we will present
applications of the PREMONN algorithms to real world problems. 5
SBefore concluding this chapter let us give the answer to the exercise posed on page 55. The
switching times in Figure 3.2 are tl = 101 and t2 = 301.
4 MATHEMATICAL ANALYSIS
In this chapter we formulate and prove convergence theorems for most of the
credit assignment schemes introduced in Chapter 3. The presentation style is
quite uniform: in every case a theorem is stated, which has the general form:
"the credit of the best model converges to one as time goes to infinity"; what
is meant by "best model" in every case is explained and various remarks are
made regarding the algorithm, the conditions necessary for convergence and so
on. The actual proofs are presented in several Appendices at the end of the
chapter.
4.1 INTRODUCTION
g( Ek) = E
t
I tl 2
20"2 '
k 'Ekft
k Pt-l . e 20-
Pt = ----'---"----,E-n"""'r""" (4.3)
"",K n _ t
L.m=l Pt-l . e 20-
For example purposes we examine first the special case of eq.(4.3) and then the
more general case of eq.(4.2). The following theorem is proved in Appendix
4.A.
Now, for k = 1,2, ... , K, define Ck ~ e-E(g(E;» and suppose em is the unique
maximum of Ck for k = 1,2, ... , K. Then, for the pf's defined by eq.(4.2) we
have, with probability 1, and for k oF m,
k
lim p~ = 1, lim l!!.. = O. (4.5)
t--+oo t--+oo p,!,
Remark. The boundedness assumptions B2 and B3 are required in order
to establish square integrability of Ef; they can be replaced by any other
appropriate conditions that yield the same result.
62
C2 For k = 1,2, ... , K the function fk(zl, ... , ZM) is measurable and there is a
constant ak such that Ifk(zl, ... ,zM)1 ~ ak' {IZ11 + IZ21 + ... + IZKI}·
Remark. Note that for this theorem minimal assumptions are required. On
the other hand, the conclusion is that the credit function of the best model is
greater than that of all other models, but it does not necessarily converge to
one.
Remark. Condition E2 requires that the limits Dk must exist for all k (in a
probabilistic context, this condition would hold for an ergodic time series; but
our formulation avoids any reference to probabilistic concepts.) If the above
conditions are satisfied, convergence to the "best" class is guaranteed and, in
the limit, the largest membership grade is attained by the m-th class which
minimizes prediction error in the sense of limit Cesaro average.
P: = 0,
k
P't =
lim 1, lim lim !!!.. = o. (4.17)
t-+oo t-+oo t-+oo Pr'
Remark 1. Ck = E(e-g(E:», i.e. the expectation of e-g(E:>. Since g(lel)
is an increasing function of lei, a large value of Ck implies good predictive
performance. In this sense, Ck can be viewed as a prediction quality index and
it is natural to consider as optimal the predictor m that has maximum Cm.
Remark 2. The theorem can be generalized to the case where there is more
than one predictor that achieves maximum am; then the total posterior prob-
ability of all such predictors will converge to 1.
(4.18)
note that 0 < Ckn < 1 for all k, n. Now consider the quantities 7r~n, defined for
t = 1,2, ... and k = 1,2, ... , K by the recursion
presented in Section 4.4. Then the 7rfn,s, as given by eq.(4.19), are convergent;
this is the conclusion of the following theorem.
Theorem 4.8 Consider the system defined by eq.(4.20), with Ckn defined by
eq.(4·18) for k,n= 1, 2, ... , K. Suppose that for a fixed n (1 ::; n ::; K) the
following conditions hold.
Then, for the 7r~n defined in eq. (4.19) the limt--->oo 7rfn exists for k = 1,2, ... , K
and limt--+oo 7rfn > limt--+oo 7rfn for all k =I- n.
Remark I Condition HI states that the n-th source remains active for all
time. Of course this will not be true in a switching sources situation. However,
the point of the theorem is to show that the switching sources algorithm will
converge during time intervals between source switchings.
MATHEMATICAL ANALYSIS 67
Remark 2. In H2, it is assumed that the n-th prediction quality index, Cnn.
is the maximum one. In other words, H2 is stated so as to imply that the n-th
source is matched to the n-th (best) predictor. Our usual assumption has been
that the n-th source and n-th predictor are matched. However, if for some
reason the source-to-predictor correspondence were permuted, so that the n-th
active source corresponded to the (say) i-th predictor, then the conclusion of
.
the theorem would still hold true, provided Cin > It~E Cmn is true for all
m =f i.
Remark 3:Note that H2 requires that for all m =f n the inequality Cnn
> l~E • Cmn is true; this is somewhat stronger than simply requiring the n-th
predictor to be the best one (which would be expressed as Cnn > Cmn ).
Remark 5: The above remark about slow switching is related to the rela-
tionship between p~ and 7rf. Suppose that Zt is fixed to n for some time, say
Ts time steps. Suppose also that (for k = 1,2, ... , K) we have Ckn ~ 1YtZ';/12
(this assumption will be true if the original system has ergodic behavior and
N is large). Finally, suppose that the convergence of 7rfn (which is guaranteed
by the theorem) takes place (up to desirable accuracy) within some time, say
Tc time steps. If N < < Tc < < Ts, then it is reasonable that p~, as given by
(4.20), is approximated by 7rfn, as given by (4.19).
4.4 CONCLUSIONS
We have presented convergence proofs for all the fixed source algorithms in-
troduced in the previous chapter, and also for one of the switching sources
algorithms. It can be seen that under reasonable conditions, any of the above
algorithms can be expected to converge to correct classification. Hence we may
use these algorithms confidently. It is important that no probabilistic interpre-
tation of the algorithms was necessary to prove convergence.
68
k
Pt
k
Pt-l e
_IE:r
2.-
_IE~t
e- 2!
IEkr _IE:r
p~ p~ e 2.- e 2.-
-=-. ... =>
pf' p(f IEkr
e- 2! e
_IE~r
2.-
IEml
e- 2:
(4.A.1)
By the continuity of the exponential function and (4.A.1) we have that, for all
€ > 0 and almost all Yl, Y2, ... , there is a te (depending on Yl, Y2, ... ) such
that for all t ~ te
(4.A.2)
MATHEMATICAL ANALYSIS 69
o~
Pt
p! ~ Pop! . (~+ E)t
Cm
(4.A.3)
The third part of eq.(4.4) follows easily from eq.(4.A.3). Note that the term
P~/PO' does not affect convergence, as long as neither p~ nor PO' are zero. Hence
the initial values of the credit functions are not crucial to the convergence of
the algorithm, as long as they are not zero. Now, from eq.(4.A.3)we also have
Since, for k = 1,2, ... , K, the p~'s are given, the first bracket in eq.(4.A.4)
is fixed. Since, for k = 1,2, ... , K, the Ck'S are given, and for k # m we have
Ck < Cm, we can choose E small enough, so that the second bracket in eq.(4.A.4)
k
is less than one. So it follows that limt--+oo maxm # (li-)
Pt
= 0 with prob. 1. Then
we have (with prob. 1)
(4.A.6)
70
lim
"t
ws=1
e-g(E:)
= E(g(E~)).
t--+oo t
By the continuity of the exponential function and (4.A.6) we see that for all
E > 0 and almost all Y1, Y2, ... , there is a t€ (depending on Y1, Y2, ... ) such
to the t power, we have that for all t ~ t€ and almost all Y1, Y2, ...
The rest of the proof is exactly like that of Theorem 4.1 and hence is omitted .•
second part of eq.(4.7) follows from the assumption regarding the Ck'S .•
MATHEMATICAL ANALYSIS 71
E(l(E; < E;" for n =1= k)) = Pr(E; < E;" n =1= k) (4.C.2)
pkt = _
_
pf'
POk . e -
Po
""t
W.=1 9 (Ek)
e- 2::=1 g(E::)
•
=t
=}
~
pk
t _t _
pf' -
a
- 1, t - 2, ... , 1)
t
pk e 2:t.-
~. _ _.,.--_ _
Po e
_2:: J
J
t
t
g(E;l
g(E;'l .
(4.D.2)
Using the limit of Cesaro average of g(E:), g(E;') and the continuity of the
exponential function, we conclude that for every E > 0 there is some t€ such
that for all t > t€ we have
Since, by assumption E2, Cm is strictly larger than Ck for all k =1= m, one can
find some E small enough that the bracketed term above is less than one. Raise
eq.(4.D.3) to the t-th power; for every t > tf we get
p! < PoPE. . (~ + E)
Pt Cm
t (4.D.4)
. pt = 0
hm - Vk =1= m;
Pt
t-->oo
this proves the third part of eq.(4.12). The first and second part of the same
equation are proved in much the same way as in Theorem 4.2 .•
Proof of Theorem 4.6: The credit update equation for the fuzzy max/min
algorithm can be written using Ck,t as
k pL1 1\ Ck,t
Pt = K . (4.D.5)
Vn=l (Pf-1 1\ Cn,t)
Suppose that for some time s we have P":' < em,s; then, since P":' 1\ em,s is the
minimum of p":,, em,s, we must have P":' 1\ em,s = P":' and eq.(4.D.5) yields
m
Ps +1 = K
P":' . (4.D.6)
Vn=l (Pf-1 1\ Cn,t)
On the other hand, for n = 1,2, ... , K
K K
p~ 1\ cn,s ~ cn,s => V (p~ 1\ Cn,s) ~ V cn,s = em,s (4.D.7)
n=l n=l
where the last maximum equals em,s by (4.14). Use (4.D.7) in the right hand
side denominator of (4.D.6):
m
P +1
s = Pms .
VK (n
1
>P .-
) - sCm,s
mImI
> Ps .-d· (4.D.8)
n=l Pt-1 1\ Cn,t
The last inequality follows from (4.14). Now, applying (4.D.8) T times we get
m m 1
Ps+r > Ps . dr
and taking T large enough, P":'+r will get larger than em,s+r, which is bounded
above by eq.(4.14). In short: for some to = s + T we have
Pto k <
k 1\ cto k <
_ cto _ 'Y. (4.D.10)
MATHEMATICAL ANALYSIS 73
Combining (4.D.9) and (4.D.10) (and using the assumption f3 > ')') it is con-
cluded that
K K
V (p~o A Cn,to) :::; V Cn,to = Cm,to ::::} P~+l ::::: Cm,to = 1.
~l ~l ~~
Since by construction we also have 1 ::::: P~+l' it follows that P~+l = 1. But
then P~+l ::::: Cm,to+l and the argument can be repeated from (4.D.9) down.
From this follows that there is some to such that for all t ::::: to we have Pt = 1.
This yields the first part of eq.(4.15). Similarly, one sees that for all t ::::: to and
k =1= m
To prove positivity of pf, we will again use induction. Suppose that for t = s
we have 0 < P~-l < 1 for k = 1,2, ... , K. Now we will show that
since P~-l < 1 and e-g(E:') < 1. Then, from eq.(4.E.4) follows that
1A sigma-field F generated by random variables Ul, U2, ... is defined to be the set of all sets
of events dependent only on Ul, U2, ...• A random variable v is said to be F-measurable if
knowledge of Ul, U2, ... completely determines Vj in other words, either v is one of Ul, U2,
... or it is a function of them: V(Ul' U2, ... ). Note that the total number of Ul, U2, ... may
be finite, countably infinite or even uncountably infinite. For more details see (Billingsley,
1986).
MATHEMATICAL ANALYSIS 75
(4.E.5)
(4.E.6)
the important point is that the quantity in curly brackets has a limit. Since
pm> 0, it can be cancelled on both sides of (4.E.7); then we get
(4.E.8)
(4.F.l)
(Actually the qf's depend on n as well, but since we will consider n fixed in
what follows, we suppress this dependence from the notation.) Comparing
eq.(4.19) and eq.(4.F.l), se see that the qf's are simply the unsealed versions
of the 7T'fn,s. This is actually proved in the following Lemma.
qi q? qf qi + ... + qf qi + ... + qf A
7T'ln = 7T'2n = ... = 7T'Kn = ,,..In + + 7T'Kn = 1 = 1·
1 1 1 "1 ... 1
Now suppose that the proposition holds for t = r. Then 7T'rmn = ..1.... m
Ar qr for
m = 1,2, ... ,K and
Now, to prove convergence we work for a while with the auxiliary quantities
qf rather than the 7rfn,s. Define qt = [ql, q~, ... , qf] and
R12
~2
Q=RD;
KKK K K
K
L wm(em - >..) = O.
m=l
This leads to a contradiction, because we assumed that >.. ~ 1 and, for all m,
Wm > 0 and am < 1; so we have a sum of strictly negative numbers that equals
less than zero. Hence we must have >.. < 1 and the proof is complete .•
78
... VK] =
[Qt]ni
At -> WnVi } [Qt] ni Vi
--
At
-> -
1 => - - ->
[Qt] nJ-
-
V-
= 'Yij (4.F.3)
[Qt]nj WnVj J
Since qt = qt-1Q, qt-1 = qt-2Q etc., finally qt = qoQt . Then we have for
m, l = 1,2, ... , K
qr' = q6[Qthm + q5[Q t bm + ... + qJ<[Qt]Km. (4.F.4)
1 1
(4.F.5)
q~ q6[Qt]ll + q5[Q t hl + ... + qJ<[Qt]KI .
Taking i = 1,2, ... , K and j = n in (4.F.2) we obtain
.. "
"0' (4.F.7)
~.[Qt]nl----tl (4.F.9)
[Qt]nm q~ .
In other words
",K mn K 1 K
) =}
L....tm-l 7J"t
7J"t
In ----t L 'Yml
m=l
=} -zn ----t
7J"t
L 'Yml
m=l
=}
7r mn . ,,",K
L.....-k=l R km dmn 7r mn 1 + K f dmn
---===--- . - - <- - .- -- . -- =}
7r nn . Rnn dnn - nn
7r 1- f dnn
Usually the "best" ~ or 0 value is taken to be the value which optimizes some
appropriate function J (~) or J (0) 1. Hence, in case the identification problem
is solved ofHine, it reduces to a standard optimization problem, which can
be solved by steepest descent methods (especially Levenberg-Marquardt type
algorithms), genetic algorithms and simulated annealing, to mention only a few
possibilities. For the online case, some of the preferred methods are recursive
least squares, extended Kalman filtering, instrumental variable methods and so
on. An excellent overview of the subject can be found in (Ljung, 1987).
While the formulation of the identification problem is straightforward, its
solution is not always easy, especially in case eq.(5.1) is nonlinear. Depending
on the nature of the nonlinearity, the ofHine case may require the solution of an
arbitrarily hard optimization problem, while generally efficient online methods
are available only for linear modelling functions. In short, at this point in
time the general system identification problem may be considered practically
unsolved.
This is merely a change in notation, and has been employed so that eq.(5.2) re-
sembles eq.(2.1). Indeed the two equations are identical except for the presence
in eq.(5.2) of the input term Ut. We have already remarked that such an input
term does not alter in any way either the applicability of the PREMONN al-
gorithms or their phenomenological justification. Hence the similarity between
the classification and identification problems becomes apparent. In classifica-
tion terms, the parameter estimation and black box identification problems can
be described as follows.
(5.3)
Let us illustrate the method for the case d = 1. In this case, suppose that it is
known that 0 takes values in an interval e = [a, b] C Rl. [a, b] can be quantized
into a K-members set e = {a,a+8x,a+28x, ... ,b}, where 8x = l'<-=-~. In other
words, we replace e with e = {Ol, O2 , ••• , OK}, where, for k = 1,2, ... , K
b-a
Ok = a + K _ 1 . (k - 1); k=I,2, ... ,K.
yf = f8 k (Yt-l,Yt-2, ···,Yt-M)
and run any of the usual PREMONN credit assignment algorithms; the only
modification is that now pf refers to the credit of the k-th model (with para-
meter Ok) rather than the k-th predictor. It is clear that this is only a change
of nomenclature.
If the quantization is sufficiently fine, then it is reasonable to expect that
for some k we will have 10k - 00 1 sufficiently small, so that yf approximates
Yt much better than the remaining yr ,
m # k. 2 Hence, by the analysis of
Chapter 4 it is reasonable to expect that, given sufficient time, pf will approach
one and hence Ok, the "best" approximation to 00 will be identified.
This method really amounts to exhaustively searching the parameter set
by simulating several candidate models and scoring them according to their
prediction error. In case the true system has fixed parameter eo,
this is probably
the simplest parameter estimation method available. In case the true system
parameter is eo(t), a time varying quantity, then the recursive nature of the
PREMONN credit update is rather useful, because it allows for online scoring.
Conceptually it is quite obvious that the above method can be applied for
any value of d. However, if we assume Q levels of quantization per dimension
of the parameter space, the total number of models becomes Jd and it is clear
that the method is practicable only as long as d and Q are relatively small. In
particular, even for a very coarse quantization (i.e. small Q), when d is bigger
than three or four, the curse of dimensionality results in an unmanageably
large number K.
In the case of large d, quantization may still be used in conjunction with a
method of moving grids. This method starts with a coarse quantization; as soon
as a promising value of () is identified, a finer quantization is effected around
this value and a new value of 0 is obtained. In every iteration the 0 value
with highest p: is selected; since p:
is related to a cumulative error function
(see Chapters 3 and 4), it is reasonable to expect that successive iterations of
the above procedure yield 0 values with progressively decreasing cumulative
square error. Essentially this leads to a steepest-descent-like approximation
to the true parameter value (}o. Iterative application may yield a sequence
2In other words we assume that proximity in the parameter space results in proximity in
the output space. While this assumption wil generally be not true (especially in the case of
nonlinear ones) it holds true in a number of cases large enough to make it useful.
86
0(1), 0(2), ... which converges to 0 0 , However, because of the greedy nature of
the algorithm, convergence to a locally optimal estimate which may in fact be
quite distant from ( 0 ) is also possible. To overcome this problem, as well as the
curse of dimensionality, a more sophisticated search strategy is needed; this is
presented in the next section.
In "genetic" terms, the final goal (discovering the true parameter vector ()o) is
translated into finding the "fittest individual".
In practice it is certain that the j-th parameter will lie within an interval
[Aj, B j ], j = 1,2, ... , d. Each interval is discretized, using 2n steps. Since () is a
vector of length d, there are d parameters, each taking 2n values; hence there
are (2n)d = 2n.d parameter combinations, resulting in 2n·d different models,
each coded by an· d-bit string; this is the "genotype" of the particular model.
Initialization
Create randomly e(O) = {BiO), B~O), ... , B~),}, the first generation of K mod-
els.
Main
Set r f-- 1.
add it to e(i);
set r f-- r + 1;
90
else
discard it.
End if.
End while
Next i
Set () * f-
()(i-l)
k
h
were k = argmaxk=l,2, .. ,K Pt.
k
5.5 EXPERIMENTS
We now present two sets of parameter estimation experiments. Both sets utilize
computer simulations of dynamical systems; the first system involves a small
parameter set and the second a large parameter set (hence requires the use of
the Credit Assignment / Genetic Parameter Estimation Algorithm).
Wt = Wt-l + "t { 3P
U·
2 (Zt_l
2J· .qs ·dr
Zt-l -
·ds .qr)
Zt-l Zt_l -
r.L}
t , (5.5)
where
(5.6)
and
+
0 Lo 0
[ L. Ls 0 Lo
A= (5.8)
-Lo
0 Ls
0
0
Ls
1'
L'~'-l 1
0
B~[
Rs 0
0 Rs LoWt-l
(5.9)
0 -LoWt-l Rs LoWt-l .
-LsWt-l 0 0 Rr
Here il s, its are stator currents, ir, itT are rotor currents, Wt is angular velocity,
l'tqS , l'tds are stator voltages and Tl is torque. at is the integration step; R s ,
SYSTEM IDENTIFICATION BY THE PREDICTIVE MODULAR APPROACH 91
Rr are stator and rotor resistances, Ls, L r , Lo are stator, rotor and mutual
inductances; J is moment of inertia and P is number of pole pairs. yt and Wt
are the states; Ut and Tl are the inputs.
All the parameters are assumed known, except for R r . However, Rr is
necessary for the determination of the motor time constant T r , which in turn
is required for efficient and economic angular velocity control. In addition, Rr
depends on operating conditions, in other words, it may be time varying. Hence
we are faced with a typical online parameter estimation problem, for which
various methods of solution have been proposed in the past; see (Krishnan and
Doran, 1987) for a specific method and (Krishnan and Doran, 1991) for a good
reVIew.
Using the predictive credit approach to parameter estimation, we consider
that the vector sequence fils itS]' constitutes the time series YI , Y2, ... which
is assumed to have been generated by a system of the form (5.4), (5.5). What
is unknown is the value of Rr , which plays the role of the source parameter Bo.
To obtain a sample of the Yl , Y2, ... time series we integrate eqs.(5.4),
(5.5) with integration step is set to &=0.5 msec. However, for the parameter
estimation experiments we actually use subsampled time series, with several
values of the subsampling time 88; namely 88 is taken to be 0.5 (full sampling),
1, 2, 3 msec. 3 Finally, the error order N is taken equal to 10.
Each simulation is run for 10000 time steps, each step corresponding to 0.5
milliseconds of real time; hence the operation of the motor is simulated for a
5 second time interval. Input is a three phase AC voltage of 220 Volts RMS
value and torque T L =1.5 N·m. The actual motor has the following parameters:
Rs=11.58 Ohm, Ls =0.071 Henry, Lr=0.072 Henry, Lo=0.069 Henry, J=0.089
kg·m 2 , B=O Nt·sec/m, P=2. In the experiment we use the following strategy
is used to simulate the effect of Rr variation: the value of Rr is changed every
1000 steps (Le. every 0.5 sec; hence ten Rr values are used: from time t=O.O
to 0.5 seconds Rr= 4.9 Ohm, from 0.5 to 1.0 seconds Rr=5.9 Ohm and so on
until the value 13.9 Ohm. Finally, the time series observations of the stator
current were mixed with additive noise at various noise levels, indicated by the
signal to noise ratio SNR.
We use ten candidate parameter values (K=1O), tuned to Rr values of 5,6, ... ,
14 Ohm. When actual Rr value is 4.9, the best estimate is 5 Ohm; similarly
for R r=5.9, 6.9, ... . Hence we can evaluate the results of the parameter
estimation experiments by listing the usual c figure of classification accuracy,
where a correct classification occurs when the algorithm picks the parameter
value which is closest to the currently active true parameter. The c figures for
various noise levels and sampling times are presented in Table 5.1.
3A larger sampling time implies that less information is obtained about the operation of the
motor and fewer comparisons are performed between the true system and the predictors.
Presumably, this makes the identification task harder.
92
Table 5.1. Classification figure of merit c for AC motor parameter estimation experiment.
All: i
h + 12 + m2hl2 cos((/>2) + (mll~ + m2ln + m2l~
A 12 : 12 + im2l~ + ~m2hl2 COS(¢>2)
A 22 : h + 4m2l~
B12: -m2h12 sin(¢>2)
B21: m21112sin(¢>2)
[ Au (5.10)
A12
Xt = [ ¢;~l 1'
¢2t-l
Yt = [ h =(¢,,) +
l} sin(¢lt) +
~~(¢" - q",)
l2 sin(¢lt - ¢2t)
1 (5.12)
and Ut = [¢It ¢2tl' and 0 = [ml m2 l} l2l'. This completes the discrete
time description of the system. Next we integrate eq.(5.11) and add white,
zero-mean and uniformly distributed noise to obtain the final time series Yl,
Y2, ....
The genetic algorithm parameters as follows: number of bits per parameter
n=13, crossover probability q=0.8, population size K =50, number of break
points m=4, number of generations Imax=5000. The parameter space Q is a
subset of R4, namely Q= [A},B 1 jx [A 2 ,B2 jx [A3,B3jX [A 4 ,B4 j. Here, for
j=l, 2, 3, 4, Aj is the lower bound of the j-th parameter, chosen to be 30%
of the true parameter value, and B j is the upper bound of the j-th parameter,
chosen to be 200% of the true parameter value.
The parameter estimation experiment is run at various noise levels; in other
words the additive noise mixed to the series is uniformly distributed in the
interval [-A,Aj, where A takes the values 0° (noise free), 0.25°, 0.50°,1°,2°,
3° and 5°. For everyone of the above noise level one hundred experiments are
run. The accuracy of the parameter estimates for every experiment is expressed
by the following quantity:
16md + 16m21 + 16ltl + 16121
S= m1 m2 11 12
4
where oml is the error in the estimate of ml and similarly for the remaining
parameters. In other words, S is the average relative error in the parameter
estimates.
The experiment results are presented in Table 5.3, in cumulative form. For
comparison purposes, we also list in Table 5.4 similar results for a more tradi-
tional genetic algorithm which uses selection probabilities proportional to the
inverse of total square error. In both tables the presentation format is the
same. Each row lists (for various noise levels) the number of experiments (out
of a total of one hundred) for which the final S error figure is less than the
number indicated in the first column. As can be seen, the credit assignment I
94
Table 5.3. Parameter Estimation by Credit Function Genetic Algorithm; Accuracy Results.
Noise
Noise
0.075
~
Iii 0.05
~
u.
0.025
0 II
0 50 100 150 200
% of correct valu
jecture that this is due to the relative insensitivity of the manipulator behavior
to mass values.
Finally, let us mention that the average duration of one run of the parameter
estimation algorithm is 7 minutes on an HP Apollo 735 workstation.
5.6 CONCLUSIONS
In this chapter we have formulated the system identification problem in classi-
fication terms and seen how it can be solved by using the "predictive modular
approach" , which essentially is the PREMONN approach minus the neural net-
works terminology. It is rather clear at this point that the important part of
PREMONN is the "predictive modular" credit assignment; the "neural" part
is rather incidental since the "predictors" or "models" can be implemented in
non-neural ways.
In previous chapters we have applied the predictive modular credit assign-
ment approach to finite source or model sets. In this chapter we have seen
that in some cases of parameter estimation problems the above approach also
applies to infinite source or model sets. We have presented experiments which
indicate that such infinite sets can be searched efficiently by the divide-and-
conquer approach, whereby only a finite subset is examined at any given stage
of the search process. However, the algorithms we have presented so far are
rather ad hoc and their convergence properties cannot be analyzed in a rigorous
manner. In Part III we will treat in considerably greater detail the black box
96
0.25
0.2
0.15
>-
(.)
c::
Q)
:J
CT
~
u.. 0.1
0.05
0 II •
0 50 100 150 200
% of correct valu
0.75
~
5i
:J 0.5
CT
~
u..
0.25
o +--------r------~~------1_------_4
o 50 100 150 200
% of correct value
SYSTEM IDENTIFICATION BY THE PREDICTIVE MODULAR APPROACH 97
0.4
0.3
ij'
Iii
~
u. 0.2
0.1
0+--------4--------~--------+_------~
o 50 100 150 200
% of correct value
Yt
Pt'
Credit
Pt2
Assgn.
PtK
6.2 PREDICTION
on the fifth time series observation, the earliest moment at which this source
switch can be registered is after the next credit update, i.e. five time steps
later. If two source switches take place, say on the fifth and seventh time steps,
then the first source switch will probably pass completely unnoticed.
Essentially, a high value of N results in "low resolution" of the algorithm
along the time scale. The converse situation holds when low values of N are
used. The algorithm has increased resolution, at the expense of being less
robust to noise, since random error predictions are not averaged out.
Choosing a value for h has already been discussed in Section 2.3. Let us
now present some remarks regarding the choice of the R matrix. In case we
have some prior knowledge about the source switching mechanism, this may
be incorporated in R. For example, consider a case of two sources: source no.l
is active at times t = 1,3,5, ... and source no.2 is active at times t = 2,4,6, ... ;
in this case we would set Rkk = ° (k = 1,2) and Rnk = 1 (n, k = 1,2,
n i=- k). In some cases only partial information is available about the source
switching mechanism. For instance, suppose we know that source switching is
slow, i.e. once a source is activated, it will remain activated for a fairly long
time. This information can be incorporated in R by setting Rkk = 1- (N -1) . E
(k = 1, ... , K) and Rjk = E (n, k = 1, .. , K, n i=- k), where E is a small positive
number. The above remarks, intended for multiplicative algorithms, carryover
in exactly analogous manner to the case of fuzzy and incremental algorithms.
Regarding additive algorithms, an upper threshold A may be used in a manner
analogous to that of h: if the discredit becomes greater than A, it is no longer
increased. Similarly, in the case of counting algorithms, an upper threshold A
can be used: whenever credit becomes greater than A, it is no longer increased.
Similar remarks can be made regarding choice of w in the additive and counting
switching source algorithms. For instance, in the counting scheme (Section 3.5),
large values Wkk and small values Wkn, n i=- k are used when it is known that
source switching is slow.
(6.1)
When (J is small, the expression (6.1) will be small for every k; what is more
important, however, is that any differences in the magnitude of the IEfl2 terms
e-E(IE~12)/20'2
e- E (IE;'1 2)/20'2
(k, n= 1, ... , K). If predictor k has large mean square error and predictor n
has small mean square error, then for a small (J the above ratio is close to 0,
and convergence is fast. If, on the other hand, (J is large, the above ratio is
small and convergence is slow. In this sense, (J operates in a manner similar to
N, the block size parameter.
106
In this case, IE
2
",
;t
reduces to the proportional error of the k-th predictor with
respect to the total error, at time t.
not on absolute, but on relative predictive accuracy. In other words, the time
series will be classified to a source even if the respective predictor performs
poorly, as long as it consistently outperforms the remaining predictors. This
results in considerable robustness to prediction error. Good predictors gener-
ally result in superior classification performance, but, within certain bounds,
high prediction error does not affect classification too much: since some clas-
sification must always take place, if no good class is available, the "least bad"
one will be chosen.
6.3.5 Modularity
All the remarks made in Section refsec6.2.5, regarding modularity of the pre-
dictive modules, can be repeated here with reference to the credit assignment
module. In particular, at any point in the classification or prediction process,
the credit assignment module can be replaced by a new one, implementing a
different credit assignment algorithm. This will not require retraining of the
remaining modules.
7.1 INTRODUCTION
Visually Evoked Potential (VEP), or Visually Evoked Response (VER) , is an
electroencephalographic signal. Specifically, VER is the total electrical response
of the visual cortex to noninvasive visual stimulation. Electrical responses are
evoked from human subjects (by presenting them with a stimulus, usually a
flash of light or alternating checkboards (Sokol, 1976)) and measured by appro-
priately placed electrodes. The VER signal has rather low amplitude (0.1 - 20
J.LV) and is superimposed on other components, caused by the normal activity
of the brain. Recording VER involves extracting the relevant information from
the ongoing EEG, with a signal to noise ratio of about -5 dB.
Abnormal VER responses, which can be used for diagnosing neuroophthal-
mological disorders, can be detected by evaluating features of the response
lThe work reported here was carried out by M. Swiercz and M. Grusza of the Technical
University of Bialystok, Poland and P. Sobolewski, of the Opthalmology Division, Suwalki
District Hospital, Poland. We want to thank these researchers for giving us permission to
present their work in this book.
latency +
1. When the "flash" stimulation is used there is a slight initial negative deflec-
tion followed by a W-shaped wave. The positive components (peaks) of that
wave are called PI and P2 waves. The time elapsed from the stimulation
to the first positive component (PI) is called latency and the difference be-
tween the minimal and the maximum values of the waveform is called the
amplitude.
The morphology and the topography are the qualitative (not quantitative)
features of the waveform. There are no precise definitions of these features
(and the criteria of their comparison); generally speaking they relate to the
location of the positive and negative components of the waveform, the time
distance between them, the "flatness" of the waveform, the fluctuations of the
VER after the major peaks etc.
In short, while latency and amplitude can be easily quantified, they do not
furnish sufficient information for diagnosis. Wave morphology and topography,
on the other hand, are more informative but their evaluation is harder and
requires subjective expert's interpretation. Some diseases are characterized by
specific VER patterns, which result in clear clinical diagnosis when correlated
with other clinical symptoms (Halliday and Kriss, 1976); however, in many
cases the morphological analysis of a complex VER signal is not reliable, be-
CLASSIFICATION OF VISUALLY EVOKED RESPONSES 111
cause of the lack of objective evaluation rules and the irregular features of each
individual VER waveform.
VER-based diagnosis of neuroopthalmological disorders can be seen as a
classification task: the subject's VER waveform must be assigned to one of
several possible classes, one class corresponding to each neuroopthalmological
disorder (or to the healthy state). Neural network classification methods can
be applied either to static feature vectors obtained by preprocessing aVER
signal (static pattern classification) or by considering the dynamically evolving
time series (time series classification).
Swiercz, Grusza and Sobolewski initially adopted the static pattern clas-
sification approach, using lumped neural network classifiers. This approach
provided useful results: it was found that lumped neural network classifiers are
capable of modelling the VER / disorder association up to a certain level of pre-
cision and can be useful in setting up more objective diagnosis procedures for
several neuroophthalmological disorders. However, the performance of lumped
classifiers has not been absolutely satisfactory and hence Swiercz, Grusza and
Sobolewski decided to apply the PREMONN methodology to the classification
problem, hoping for an improvement of classification rates.
Clearly the PREMONN classification algorithm is well suited to the VER
classification problem. Specific neuroophthalmological disorders appear to have
typical patterns, each of which can be considered as generated by a specific
source; the set of disorders (and corresponding sources) may be considered
finite. As will become clear in Section 7.4, the PREMONN algorithm yielded
higher classification accuracy, as compared to previously used lumped neural
classifiers (Swiercz, Grusza and Sobolewski, 1997).
Also, lumped ANN classifiers can classify chromatic visual evoked potentials
(Swihart and Matheny, 1992) and hence distinguish between the responses of
normal and color blind individuals. Subtle differences in the VER waveform,
which play an important role in the diagnosis of serious retinal pathologies,
have been recognized by such a system.
As already mentioned, Swiercz, Grusza and Sobolewski (Swiercz, Grusza and
Sobolewski, 1997) have applied lumped ANN classifiers to the VER classifica-
tion problem. A separate network was used for each of the classified disorders;
the network input was the wavelet decomposition coefficients of the VER wave-
forms (Strand and Nguyen, 1996; Thakor and Sherman, 1996) and the output
was the type of disorder. Various network architectures were tried, including
The best results Swiercz, Grusza and Sobolewski obtained with the above
networks will be reported a little later.
Finally, neural networks have been also used for diagnostic interpretation of
visual field data from PC-based video-campimeters (Mutlukan and Keating,
1994). High classification accuracies suggest that neural networks incorporated
into PC-based video-campimeters may enable correct interpretation of results
by non-specialists.
4The Ganzfeld sphere is a special construction, with an open front and a stroboscopic lamp
or checkboard monitor at the opposite end. The patient places his or her head inside the
sphere; the visual field is limited only to a small angle (inside this sphere) and the patient is
exposed to the visual stimulation. The examination is performed in a dark room.
114
1.2 _._
_.._........._.. ..............._.... _----------_._..............._.-
0.8
~ 0.6
0.4
0.2
o +-------+-______+-______+-______+-______ ~
1.2 , - - - - - - - - - - - - - - - - - - - - - - - - - ,
.•'~" ' ..•• ..;\ :"~~" "'~\.•,./·'w·· _ ..... '. '---•.."•..••\"../ •. ·'...r· "' ••l-'" .........
.....,...'."" ....... ".' ....
0.8
0::
~ 0.6
0.4
0.2
o+----+----~--~---_+---~
51 101 151 201 251
Time Steps
CLASSIFICATION OF VISUALLY EVOKED RESPONSES 115
1.2
,' .•. - . . .>,~""-:",,.- ••""'" •• ..-...., _•• " - •• -... .... , ••- ' ••: •••••• ,........ - .... ....... "'.--'' '
,
0.8
0::
~ 0.6
0.4
0.2
o+-------~------~------~------_+------~
51 101 151 201 251
Time Steps
."--'-",. ,"""
.:-'., ..- ................... _. -- ... ..
_" -'\. ..... .......
_' ...... ''... \
- _.- ... - .... "
.,,\
I
......-'.
0.8
~ 0.6
0.4
0.2
1.2 .-----.-:,:-,\-"'-\--;-
.. _-.-,,-,-/-..-.-...-<.-..-••-,.-•.-,.- •• -....-,.-0.-\,-._'1
....- - - - - - - -
n::
~ 0.6
0.4
0.2
o +--------+--------+--------+--------+-------~
51 101 151 201 251
Time Steps
n:: 0.6
~ ., ..... .- .-.-\ ...... _"" ...,
0.4
0.2
0
51 101 151 201 251
Time Steps
CLASSIFICATION OF VISUALLY EVOKED RESPONSES 117
that additional noise will be introduced to the data by individual habits of the
technician or doctor who collects the data. The most clear pictures appear for
the waveforms of healthy subjects and for sclerosis multiplex patients. Looking
at the curves for sclerosis multiplex, two distinct maxima can be seen; however
for some patients they are shifted in time by as much as 25 ms. For the healthy
subjects the shapes are also well defined, so the differences between the average
and the enveloping curves are smaller, both at the beginning and the end of
the waveform. However the latencies (the time elapsed to the main positive
peak) for individual curves differ significantly. Probably the maxima for optic
neuritis are better concentrated in time and the variations in the first part of
the curve (before it reaches maximum) are smaller. Analyzing chiasmal optic
neuropathy one can observe rather flat average curves at the beginning, but the
variations for individual curves are quite substantial. The differences in local
values and global parameters between the curves belonging to the same classes
make this a quite difficult classification task.
For further processing, the VER waveforms were smoothed using wavelet ap-
proximation. A fourth level decomposition by discrete fourth order Daubechies
filters was employed and the coarse approximation of the signal was used for
classification purposes 5. This method of data preprocessing removed the tiny
fluctuations of the waveforms which could result in deteriorated prediction ac-
curacy.
7.3.2 Predictors
Seven predictors were built: one for healthy subjects, one for each of the five
classified disorders and one for non-classified disorders. The predictors were
neural networks with two hidden layers. While the number of neurons in each
hidden layer varies, generally 3 or 4 sigmoid neurons were used in the first layer
and 2 to 4 sigmoid neurons in the second layer. The inputs used were Yt-},
Yt-2, ... , Yt-M· The output layer used one linear neuron and the target output
was Yt. Prediction quality did not depend significantly on network architecture,
i.e. number of neurons and the type of sigmoid activation function (logistic or
hyberbolic-tangent) .
70% of the data were randomly selected and used for off-line training of
the predictors; the MATLAB 4.2 environment and the Neural Networks Tool-
box training routines (Demuth and Beale, 1994) were used. The disorders are
indexed by i = 1,2, ... ,7 (healthy state, five classified diseases and a class of
unclassified disorders). The neural network prediction is denoted by fit and the
total number of cases in each class by Ni. The average square prediction error
of each class is E i , defined by
1 Ni 256
Ei = (256 _ M). N- L L
, j=1 t=M+1
(Yt - fit)2.
5See the Matlab Wavelet Toolbox (Demuth and Beale, 1994) for details.
118
For all classes Ei is around 0.002; hence the average absolute error is equal to
about 0.045. Since the VER samples have values in the range 0.6 to 1.0, the
average absolute prediction error is about 5% of the signal.
It must be noted that Swiercz, Grusza and Sobolewski deliberately chose to
use neural predictors of a small size (i.e. with few neurons and connections),
which resulted in short training times but also in relatively high prediction
error. As already discussed, PREMONN robustness to noise allows for correct
classification even in the presence of high prediction error.
7.4 RESULTS
A total of almost seventy five experiments were performed using the PRE-
MONN architecture, trying to determine the following.
1997)). It must be noted that the lumped classifiers could not handle success-
fully as many disorders as the PREMONN classifiers; hence certain entries in
the last column of Table 7.1 are left blank.
Classification accuracy for a particular disorder is the number of cases of
this disorder which were correctly identified (by the classifier) as belonging
to this disorder divided by the number of cases (belonging to this disorder)
which are present in the test set. The two last rows of the table show the
average classification accuracy results, weighted by the number of cases of each
disorder. We present two such averages. The row marked ''weighted average
I" shows averages for which both methods have been applied. The row marked
"weighted average 2" shows average over all disorders; this row has entries only
for the PREMONN classifiers, since the lumped neural classifier has only been
applied to four out of the seven classes.
It can be seen that the PREMONN classifiers outperform the lumped FF
neural one for every disorder in which a comparison is possible, except for
optic neuritis. In addition, the overall average performance of the PREMONN
classifiers is better than that of the lumped classifier. Finally, the PREMONN
classifiers have been applied to a larger set of disorders. In particular, it would
be extremely difficult to apply the FF neural classifier to the "non-classified"
disorders. The credit function profiles for two representative experiments are
presented in Figures 7.9 and 7.10.
Figure 7.9 corresponds to the VER time series collected from a subject with
oedema of the optic nerve and Figure 7.10 corresponds to the VER time series
collected from a subject with optic neuritis.
7.5 CONCLUSIONS
The results indicate that PREMONN is an efficient tool for the analysis of
VER patterns. The classification accuracy was higher for PREMONN than for
120
.......... -.-..,/.'... " - - / - •• ,, ' - - ' - " " " - " - " - " ,-
~ .';\
"<*'
., .... '\ .,..',
.... ,,,..' ..........-...--••••,•••
0.8
a:
~ 0.6
0.4
0.2
o +---____-+________~------~------~--------~
51 101 151 201 251
Time Steps
Figure 7.9. Credit function profile for a subject with oedema of the optic nerve.
o -F~';;"""'-='-~~~=~;""'~=';=-=-=-=-='F=-=-=-=-='
51 101 151 201 251
Time Steps
CLASSIFICATION OF VISUALLY EVOKED RESPONSES 121
Figure 7.10. Credit function profile for a subject with optic neuritis.
~
~" 0.5
lumped classifiers trained separately on preprocessed VER data for all classes
of disorders, except optic neuritis. Classification accuracy averaged over all
cases is also higher for PREMONN. Finally the PREMONN classifier was ap-
plied to a wider class of disorders. While it could theoretically be possible to
train and "tune" very carefully a lumped ANN architecture to classify a single
disorder with better accuracy than PREMONN does, the predictive approach
gives significantly better results for the full set of disorders classified at the
same time.
Such a reliability of classification, obtained simultaneously for all classes of
disorders is very promising and indicates that PREMONNs can be successfully
used at the first stage of diagnostic of major ophthalmological disorders. They
may be regarded as a key element of an expert system to support the doctor's
decision about qualifying the patient to more sophisticated and more expensive
diagnostic methods.
An important feature of PREMONN classification, which emerges from this
application is noise robustness. As has already been discussed, what matters
in PREMONN classification is not the absolute predictive accuracy but the
relative one; in other words even if a predictor predicts poorly, classification
will be accurate as long as the predictor predicts better than its competitors.
It can be seen that from Table 7.1 that the lowest classification accuracy is
observed in the category of unclassified disorders. We find this particularly in-
teresting, because this is not really a single separate category; rather it contains
all cases which cannot be further classified. It would be interesting to consider
122
8.1 INTRODUCTION
Short term load forecasting refers to the prediction of hourly electric loads in a
power system. Generally, predictions must be made one day ahead of time; for
instance every evening predictions must be made of 24 values, corresponding
to the electric loads of every hour of the next day. Accurate predictions are
required so that the operation of the power system generators can be sched-
uled for the next day and the security of the system (probability of failure to
satisfy power requirements) can be assessed. Hence, the formulation of eco-
nomic, reliable and secure operating strategies requires accurate short term
load predictions.
Electric loads are influenced by a variety of factors; for instance previous
loads, weather and temperature conditions, the day of the week for which fore-
1 We prefer to use the term "prediction", following our usage in the rest of this book; however
the problem discussed in this chapter has traditionally been described as a "forecasting"
problem. The terms "forecasting" and "prediction" may be considered to be equivalent.
2The work reported here was carried out jointly by us and A. Bakirtzis and S. Kiartzis, both
of the electrical and computer engineering department, Aristotle University of Thessaloniki;
we want to thank them for allowing us to present this work here.
casts are required (for instance loads are lighter during weekends) and so on.
System operators have an intuitive appreciation of such factors and are able to
apply expert knowledge to scheduling power generation. However, because of
the economic importance of solving the STLF problem, more formal approaches
have been attempted and a large number of computational techniques have been
applied. Statistical models, expert systems, artificial neural networks and hy-
brid fuzzy neural networks are some of the approaches that have been tried.
No completely satisfactory solution to the problem has been found; since the
improvement of prediction accuracy by even a fraction of one percent results
in very significant savings the STLF problem is the subject of intense research.
300, ......................................................................................................................................................................... ;
200
io
..J
100
o+---__-+______+-____-+______ +_~
1 6 11 16 21
Hour
8.3.2 Predictors
Three STLF predictors were developed, each using a different motivation. We
call these lumped predictors, to distinguish them from the "combined" PRE-
MONN predictor. Let us consider the characteristics of each lumped predictor.
3 Actually, we used data up to the year 1994, when this study was conducted.
PREDICTION OF SHORT TERM ELECTRIC LOADS 127
200
100
o+-------+-------+-------~------~
91 181 271 361
Day
E
1
= __ T241
. '"' '"' Ym,t - Ym,t .
'I
24·T ~~ ,
t=l m=l Ym,t
in other words it is the ratio of prediction error divided by the actual load, and
averaged over all days and hours of the training set; this turned out to be 2.30%.
We observed a "ceiling" effect regarding the possible reduction of forecast error:
while training error could be reduced below 2.30% by the introduction of more
regression coefficients, this improvement was not reflected on the test error.
This is a typical case of overfitting.
128
Neural Predictor. The final lumped predictor used was a fully connected
feedforward neural network, with sigmoid neurons and one hidden layer. The
neural network comprised of 57 input neurons, 24 hidden neurons and 24 output
neurons representing next day's 24 hourly forecasted loads. The first 48 inputs
represent past hourly load data for today and yesterday. Inputs 49-50 are
maximum and minimum daily temperatures for today. The last seven inputs,
51-57, represent the day of the week; for instance Mondays are encoded by
setting input no.51 equal to one and inputs 52 to 57 equal to zero. Other
input variables were also tested but they did not improve the performance
of the predictor. The neural network was trained by minimizing the total
squared prediction error, using an incremental back propagation algorithm;
i.e. input/output patterns were presented until the average error between the
desired and the actual outputs of the neural network over all training patterns
became less than a predefined threshold. Once again, "irregular days" were
removed form the training data. The training data set consisted of 90·4 + 30=
390 input/output patterns created from the current year and the four past
years historical data as follows: 90 patterns were created for the 90 days of
the current year prior to the forecast day. For everyone of the 4 previous
years, another 30 patterns were created around the dates of the previous years
that correspond to the current year forecast day. After an initial offline training
phase was completed, the neural network parameters were updated daily (using
the day's incoming data) for an additional one month period. The network was
trained continuously, until the average training error became less than 2.5%.
It was observed that further training of the network (for example to a training
error threshold equal to 1.5% ) did not improve the accuracy of the prediction
on the validation data. We believe this is also evidence of overfitting.
PREDICTION OF SHORT TERM ELECTRIC LOADS 129
where m = 1,2, ... ,24 corresponds to the hour of the day. We also developed 24
distinct PREMONN combined predictors. The m-th predictor combined the
output of the three lumped predictors by the formula
*_1.1+2.2+3.3
Ym,t - Pm,t Ym,t Pm,t Ym,t Pm,t Ym,t,
8.4 RESULTS
Let us now compare the performance of the PREMONN predictor with the
three lumped predictors. Twenty four separate cases must be considered, and
these are listed in Table 8.1. The results presented in this table correspond to
the period from April to June 1994. In particular,we present prediction errors
(averaged over this period) for the four types of predictors used, and for the 24
hours of the day.
The reader can observe that for most hours ofthe day, and on average perfor-
mance, the PREMONN outperforms all lumped predictors; even in cases where
a lumped predictor outperforms PREMONN, the difference is small. The main
point, however, is that, for a given hour of the day, it cannot be known in
advance for which time period a particular predictor will yield the best predic-
tion; it is exactly the role of PREMONN to track predictor performance online
and to select the best performing predictor. These points can be appreciated
by considering Figure 8.3, which presents a comparative plot of true loads and
forecasts for a representative day.
An even better understanding of the above point can be reached by looking
at the evolution of credit functions. Consider for example Fig. 8.4, where the
evolution of posterior probabilities for the predictors of Ipm load is plotted for
the period July 1st, 1994 to September 30th, 1994. Similarly, in Fig. 8.5 the
evolution of posterior probabilities is plotted for the predictors of Ipm load,
over the same time period.
The reader will observe that in Fig. 8.4 the highest credit is generally as-
signed to the LP LR predictor, even though over short time intervals one of the
130
Table 8.1. Hourly average relative errors for June to September 1994 (error is expressed
in percent units).
other two predictors may outperform it. Similarly, in Figure Fig. 8.5 the high-
est credit is generally assigned to the SP LR predictor, even though over short
time intervals one of the other two predictors may outperform it. These results
are consistent with the general test errors of Table 8.1; the additional infor-
mation presented in Figs. 8.4, 8.5 is that a predictor which generally performs
poorly, may still outperform its competitors over short time intervals; in such
cases the PREMONN will take this improved performance into account,as ev-
idenced by the adaptively changing posterior probabilities. This explains why
the PREMONN is generally better than the best lumped predictor.
Finally, it is quite instructive to compare average total errors for training
and test data as given for various values of the total number of regression coef-
ficients. These are presented in Table 8.2. The rows in bold letters correspond
to the lumped predictors actually used in the PREMONN combination.
PREDICTION OF SHORT TERM ELECTRIC LOADS 131
Figure 8.3. A typical daily load curve and the lumped and combined predictions.
300
250
200
150
100
50
8 10 12 14 16 18 20 22 24
Table 8.2. Dependence of relative prediction error on number of parameters (error is given
in percent units and training time is given in seconds).
Figure 8.4. Evolution of posterior probabilities for the predictors of 1pm load is plotted
for the period July 1st, 1994 to September 30th, 1994. The solid line corresponds to the
credit of the "Long Past" linear predictor, the dotted line to the credit of the "Short Past"
linear predictor and the dashed line to the credit of the neural predictor.
!5
'fi
.rc: 0.5
O+-----~----~------+_----~----_+------r
16 31 46 61 76 91
Day
The reader can see that an increase in the number of regression coefficients
yields improved training errors but test errors remain the same or even increase.
This is an instance of over fitting. On the other hand, the PREMONN also uses
an increased number of coefficients, namely the sum of the numbers of coeffi-
cients of the three predictors. In our case this would be 1992+2400+1200=5592.
While we have not tried to train any lumped predictor with 5592 free regres-
sion coefficients, extrapolating from Table 8.2, one expects the test error to
be actually larger than that of any lumped predictor with fewer coefficients.
However, PREMONN increases the number of coefficients in a judicious and
structured way, resulting in the marked decrease of test error to 2.07%. Simi-
larly, training time scales very efficiently for the PREMONN. It is equal to the
total time for training the three lumped predictors, which on a 66 MHz 486
PC was 1.38+0.86+2.35=4.59 seconds. To compute the time for training one
lumped predictor with 5592 coefficients, one could use the data in Table 8.2 and
extrapolating linearly, to obtain an expected training time between 7 and 25
seconds, i.e. anywhere between 50% to 500% of the PREMONN training time.
In fact, however, linear extrapolation is probably too optimistic. For the LR
predictors, it is known that matrix inversion time scales cubically with the size
PREDICTION OF SHORT TERM ELECTRIC LOADS 133
Figure B.S. Evolution of posterior probabilities for the predictors of lam load is plotted
for the period July 1st, 1994 to September 30th, 1994. The solid line corresponds to the
credit of the "Long Past" linear predictor, the dotted line to the credit of the "Short Past"
linear predictor and the dashed line to the credit of the neural predictor.
r , ,\
\ I' 1\
I' '
"I IJ I J
,,
........ J \'
/ \
~
,.
,',
f
I', I \ ,
I
, V \ " -.J \ It, I \ I, , \
, I \ I \I I I \ I \ I \
0.5 \ / I " I \ I ,I \ I
\ I \ I I /,' "
I I I I
"
O+-----_r-----+------~----+_----_r----_+
16 31 46 61 76 91
of the problem; as for the neural network predictor, increasing the size of the
network may result in either exponentially long training time or, even worse,
complete failure of the training procedure (e.g. entrapment at local minima).
8.5 CONCLUSIONS
The results indicate that PREMONN predictor combination outperforms all
conventional, "lumped" prediction methods in the test problem we have con-
sidered and yields a significant decrease of STLF error, which may have serious
economic impact. The use of PREMONN enables us to pick the best fea-
tures of each lumped predictor in a dynamic and unsupervised manner. From
a somewhat different point of view, PREMONN can be seen as a judicious
and systematic method for combining a large number of regression coefficients,
avoiding overfitting problems.
9 PARAMETER ESTIMATION FOR
AND ACTIVATED SLUDGE PROCESS
9.1 INTRODUCTION
While the method of activated sludge is very widely used for waste water treat-
ment, it appears that the process is so complex that it has not yet been accu-
rately modeled. Nevertheless the so called IAWPRC no.l Model (Henze et al.,
1983) is widely used by chemical engineers for computer simulation purposes.
Indeed it is stated in (Henze et al., 1983) that: "[Computer] Modeling is an
inherent part of the design of a wastewater treatment system, regardless of the
approach used.... [because] ... limitations of time and money prevent explo-
ration of all potentially feasible solutions." The final goal of using the model
is gaining a better understanding of the activated sludge process under various
operating conditions.
IThis is joint work we have carried out with Manos Paterakis, doctoral candidate at the
Department of Electrical and Computer Engineering, Aristotle University; we want to thank
him for allowing us to incorporate this work here, as well as for his assistance in the writing
of this chapter.
WasteWater
~
Aeration
Tank
Precipitation
f----t Tank
--
Recirculation
Pump
Recirculated
Sludge Surplus
Sludge
The process can be briefly described as follows. The waste water is placed
in an aeration tank. and brought in contact with a microorganism-containing
solution. The organic material comes in contact with the microorganisms and
is removed from the liquid phase; then, hydrolytic enzymes are added to the so-
lution which enable the microorganisms to metabolize the organic matter. The
mixture of organic matter and microorganisms is forwarded from the aeration
tank to a precipitation tank, where the microorganisms settle down and then
PARAMETER ESTIMATION FOR AND ACTIVATED SLUDGE PROCESS 137
are recycled into the waste water treatment plant, while the processed waste
water is removed from the system.
Several complex chemical processes are involved in this procedure and con-
siderable effort has been devoted to modeling these. A major landmark in
modelling the activated sludge process was the introduction of the so-called
IAWPRC Model no.l (Henze et al., 1987) which is described by the following
nonlinear differential equations.
d8NH . 8 80
- - = -zxbMH- - . X H-
dt 8 + ks 80 + kOH
d80 = _1-YH M H _ 8_ . 8 0 X H-
dt YH 8 + ks 8 0 + kOH
d8 1 = -~MH 81 . kOH 8 No 1
dt YH 8 1 + k S k OH + 8 0 8 NOI + k NO ngXH1 +
kOH SNOI
S k ngXH1 +
kOH + So NOl+ NO
It can be seen that the model is described by ten state variables: S, XH, So,
XA,SNHl> SNH, SNO, Sl, XH1,SNOl, Four of these states are observable,
namely: S (readily biodegradable substrate); XH (active heterotrophic bio-
mass); So (oxygen concentration); SNO (nitrate and nitrite nitrogen concen-
tration). Because of space limitations it is not possible to give an explanation
of the physical significance of the above equations and variables; the reader is
referred to (Henze et al., 1983) for details.
It can be seen that a large number of parameters are also involved in the
above equations. Of these parameters some are known a priori, some can be
estimated using experimental methods and, finally, some must be estimated al-
gorithmically. The following table lists some important parameters of the latter
category, their physical significance and the values we have used in our simu-
lations. These values are provided in (Henze et al., 1987) and are customarily
used in studies of the model.
While the IAWPRC Model no.1 has been widely used as a reasonable model
of the activated sludge process, it is generally accepted that using it to describe
the operation of a particular waste water treatment plant is not an easy task.
One of the main difficulties is estimation of the model parameters. The values
appearing in Table 9.1 are simply "reasonable" values; parameter estimation
must be performed to obtain the values corresponding to a particular plant.
During the last decade, various methods have been applied to the parame-
ter estimation problem and considerable effort expended towards its solution.
Many methods have been used. In the original report (Henze et al., 1987)
various experimental methods have been proposed for measuring some of the
model parameters. Apart from the considerable effort required to perform the
necessary measurements, the obtained results may be fairly inaccurate. Hence
PARAMETER ESTIMATION FOR AND ACTIVATED SLUDGE PROCESS 139
other approaches has been tried, involving the use of various parameter es-
timation algorithms. For instance, (Jeppson and Olson, 1994) use extended
Kalman filtering to perform parameter estimation for a reduced model; (Ayesa
et al., 1994) also applies extended Kalman filtering in conjunction with a sen-
sitivity analysis; (von Sperling, 1994) uses multiple Monte Carlo simulations
of the model (with various parameter values) and then applies a classification
algorithm to separate satisfactory from nonsatisfactory (in terms of accuracy)
simulations; (Finnson, 1994) essentially applies a trial and error strategy in-
volving multiple simulations.
In all of the above cases, it is generally accepted that a very accurate estima-
tion of all parameters is not feasible. In fact, for some parameters estimation
errors of up to 50% appear to be acceptable; the general goal is to obtain pa-
rameter values which capture the qualitative behavior of the actual (in fact
computer simulated) system; of course the final goal is to obtain accurate esti-
mates, which will yield quantitatively correct results.
Taking e = Bo, where Bo is obtained by using the values listed in Table 9.1, and
using eqs.(9.1)-(9.10) we obtain a particular instance of the activated sludge
model; this is then discretized in time and simulated on the computer, resulting
in a vector time series of system outputs: Yl, Y2, ... , YT, where
The system is simulated for a real time interval of 20 days; the time discretiza-
tion step used is Ti =1.5 min and the sampling step Ts =3 min. This results in
a total number of T =9600 observations. Representative time series of biomass
X H and oxygen concentration So are plotted in Figures 9.2 and 9.3.
On the other hand, picking any value of B and running the discretized system
£or t Imes
· - 1, 2, ... , T ,we 0 b
t -, ' a new time
taln . ' . Yl'
senes. (J ... 'YT'
(J Y2' (J Now
we can write the cumulative squared output error as a function of B:
T
J(B) =L IYt - yfl
t=l
140
8~.......•..............................................................................................•...•.•................................,
10 15 20 25 30
Time Steps X 10
1000 , ......................•..................................................•.......•......................•........•.....................................................,
~ 500
10 15 20 25 30
Time Steps X 10
9.4 RESULTS
We have run several parameter estimation experiments using the above setup.
Following the same noise model used in (Ayesa et al., 1994; von Sperling, 1994),
the measurements are contaminated with multiplicative, white, zero-mean noise
with Gaussian distribution. Noise level is characterized by the standard devi-
ation (T1J of the noise; this is taken to be 0, 0.01, 0.03 and 0.05 resulting in
four experiment groups. For every choice of noise level 50 experiments are run;
the accuracy of the parameter estimates for every experiment is expressed by
a relative error 8 defined as follows
",S 69(n)
L.Jn=l (ir.iI
8= 0
8
Here O~n) is the value of the n-th true parameter, while co(n) is the n-th pa-
rameter error, i.e. the difference between the true n-th parameter and the one
estimated at the conclusion of the algorithm.
For comparison purposes we have also attempted to solve the parameter es-
timation (error minimization) problem using a genetic algorithm with selection
probabilities proportional to the inverse of mean square error (we call this a
M8E genetic algorithm). In Table 9.2 we present the performance of the predic-
tive modular genetic algorithm for four different noise levels ((T1J= 0.00, 0.01,
0.03 and 0.05) and of the standard (MSE) genetic algorithm with noise free
observations. Each row in the table shows the number of experiments (out of
a total of fifty) for which the relative error 8 was less than the indicated value.
It can be seen that the predictive modular genetic algorithm gives significantly
better results than the MSE genetic algorithm. In addition, this is true even
when we compare the performance of the predictive modular algorithm with
noisy observations to that of the standard one with noise free observations.
To better illustrate the performance of the predictive modular genetic algo-
rithm, we also present histograms of the MH estimates for four levels of noise
((T1J=0.00, 0.01, 0.03, 0.05). These are presented in Figures 9.4 to 9.7.
In Figures 9.8 and 9.9 we present graphs of the 80 (oxygen concentration)
time series obtained from the true and the estimated model (Fig.9.8 corresponds
142
Figure 9.4. Histogram of MH estimates (over all experiments at noise level (Tv= 0.00).
0.2 - , - - - - - - - - - - - - - - - - - - - - - - ,
o+-----,-----~------~-----~
o 2SO SOO 7SO 1000
Mh
Figure 9.5. Histogram of MH estimates (over all experiments at noise level (Tv= 0.01).
0.2 ..------------~-----------.,
o+------,------~--------~-------~
o 2SO SOO 750 1000
Mh
PARAMETER ESTIMATION FOR AND ACTIVATED SLUDGE PROCESS 143
Figure 9.6. Histogram of MH estimates (over all experiments at noise level (Tv= 0.03).
0.2 - , - - - - - - - - - - - - + - - - - - - - - - - - - ,
O+--------,--_____~I--------,_------~
o 250 500 750 1000
Mh
Figure 9.7. Histogram of MH estimates (over all experiments at noise level (Tv= 0.05).
0.2,-----------t------------,
~
c:
~ 0.1
~
u.
Or-----~----~II~--~----~
o 250 500 750 1000
Mh
144
Table 9.2. Parameter estimation results for the predictive modular genetic algorithm and
(last column) for the standard genetic algorithm.
Figure 9.8. So time series obtained from the true and estimated (at noise level 0.00)
model.
O+--------,---------r--------~------~
10 15 20 25 30
Time Steps X 100
Figure 9.9. So time series obtained from the true and estimated (at noise level 0.05)
model.
o+---____-.________-.________.-______ ~
10 15 20 25 30
Time Steps X 100
tion. Experiment duration depends on the noise level: more noisy experiments
take longer because the genetic algorithm requires more epochs to converge.
9.5 CONCLUSIONS
Estimation of the eight parameters of the IAWPRC model no.l is a hard prob-
lem. Hence the results which we have obtained are highly satisfactory, since
they provide parameter estimates which are quite accurate and capture both the
qualitative and quantitative behavior of the true system, even in the presence
of noise in the observations. In short, the predictive modular genetic para-
meter estimation algorithm works well on a challenging problem. It would be
worthwhile to test the algorithm on real (rather than simulated) data and check
whether the resulting parameter estimates capture the behavior of a real world
activated sludge process.
III Unknown Sources
10 SOURCE IDENTIFICATION
ALGORITHMS
In this chapter we explore the problem of black box time series identification
for the case of source switching. This amounts to unsupervised development of
models for a time series which is generated by a collection of alternately acti-
vated, initially unknown sources. We present two algorithms which accomplish
this task and present guidelines which can be used to develop variations of these
algorithms. A concept which is central to our presentation is data allocation.
Numerical experiments are presented to illustrate our point of view.
10.1 INTRODUCTION
Up to this point we have been mainly concerned with classification and pre-
diction of time series which are generated with known sources. However, in
Chapter 5 we have considered the identification problem, where one or more
models of the input/ output behavior of the time series must be developed. In
this part of the book we will concentrate on this problem. We will consider
the case where initially no information at all is available regarding such input/
output behavior. The only assumption we will make is that the time series is
produced by more than one sources.
Under the circumstances, our goal is to discover the number of sources in-
volved in the generation of the time series and to develop a black box input/
output model for each such source. We refer to this problem as source identi-
2. Their basic components are a data allocation phase and a predictor training
phase.
6. Predictors may be added as needed, until several well trained predictors are
obtained, one predictor corresponding to each active source.
Hence the proposed source identification algorithms are modular (since sev-
eral predictors are involved) and predictive (since data allocation depends on
a predictive criterion). In our presentation we assume that neural predictors
are used, but this is not a crucial point; according to previous remarks, any
convenient predictor model can be employed.
We consider predictor training to be a straightforward task and are mostly
concerned with the data allocation component. In other words, we believe
that if the training data are separated into groups, each group containing data
generated by a single source, it will be an easy matter to use one of the many
available neural network training algorithms so as to obtain a well trained
predictor for every source.
Hence the critical component of source identification is the data allocation
scheme; in the rest of this chapter we consider in detail two such schemes, one
implementing parallel data allocation and the other implementing serial data
allocation. These terms will be explained in detail later; for the time being it
suffices to say that they stand in two extremes of a spectrum which also con-
tains various hybrid (partly serial, partly parallel) data allocation schemes. In
the next two chapters we will consider the convergence properties of data allo-
cation schemes and we will prove that, subject to certain reasonable conditions,
"correct" data allocation can be successfully performed.
As soon as the source identification phase is completed (i.e. as soon as
sufficiently well trained predictors become available) any of the PREMONN
classification algorithms can also be executed. Actually, the source identifica-
tion and time series classification (or prediction) algorithms will usually run in
parallel. Given the convergence properties of the identification algorithms, it
may be expected that well trained predictors will always be available to the
classification algorithm, excluding transient periods which are associated with
the activation of new sources. Hence it may be expected that the classifica-
tion (and prediction) algorithms will perform successfully, in accordance with
SOURCE IDENTIFICATION ALGORITHMS 151
terized by the use of multiple models which are compared according to their
predictive performance. This approach is also followed in the data allocation
problem: data are allocated to predictors according to their predictive perfor-
mance.
To illustrate this point, consider a very simple example, which involves a
time series generated by two sources, and two predictors, each modeling (im-
perfectly) one source. Now suppose that one of the two sources is activated and
generates Yt, the next observation of the time series. Assume that the active
source is the one corresponding to predictor no.l; then we can use Yt to retrain
predictor no.l and hence improve its predictive (modeling) accuracy. How can
we test our assumption? Well, if predictor no.l is "reasonably well trained",
then we can expect the prediction error IYt - yll to be "small". Finally, there
are at least two ways to make the term "small" operationally meaningful.. We
I,
can compare IYt - yll either to a fixed number d or to IYt - y~ the error of the
second predictor. In other words, two strategies can be followed for allocating
each datum Yt to one of the two available predictors.
1. The errors can be compared to each other and Yt allocated to the predictor
with minimum prediction error:
~f IYt - Y{I
~ IYt - Y~I
If Yt - Yt > Yt - Yt
then Yt is allocated to pred. no.l;
then Yt is allocated to pred. no.2.
Because in this case the two predictor errors are used simultaneously, we
refer to this data allocation scheme (and its generalizations to the case of
more predictors) as parallel data allocation.
2. The errors can be compared, one at a time, to a threshold d and Yt allocated
to the first predictor with error less than the threshold:
Because in this case the two predictor errors are used one at a time, we refer
to this data allocation scheme (and its generalizations to the case of more
predictors) as serial data allocation.
What can be said about the behavior of these data allocation schemes? In
particular, what can we expect in case the predictors are not well trained? The
answer is that either of the above data allocation strategies is self reinforcing:
even if initially the predictors are not well trained, eventually each predictor will
tend to collect data which "predominantly" originate from one source, rejecting
all other data.
We will attempt to justify the above claim in Section 10.3 and (more rigor-
ously) in Chapters 11 and 12. However, the basic idea should be clear at this
point. Let us again consider the case of two sources and two predictors. In both
the parallel and serial case, each predictor initially is not "specialized" in any
SOURCE IDENTIFICATION ALGORITHMS 153
particular source. However, if one predictor happens to collect more data from
one source, as soon as it is trained on such data it will tend to accept more data
from the same source and reject data from the other source. This will result
in further specialization in the same source, which will lead to the predictor
collecting more data generated by it; at the same time, the other predictor
will start collecting data from the other source and hence start specializing in
it. Under appropriate conditions this process will be self reinforcing and hence
lead to complete specialization of each predictor in one source.
Schemes C and D are hybrids, lying between the purely parallel scheme A
and the purely serial scheme B. Since the labeling of predictors is arbitrary, the
above list essentially exhausts all possibilities involving three predictors. For
instance, a scheme comparing predictors 2 and (1,3) serially and then predictors
1 and 3 in parallel, is essentially equivalent to scheme 3. Hence, in the case of
0
three predictors, there are essentially four possible arrangements, which can be
illustrated graphically in Figures 10.1-10.4.
As the number of predictors increases, so does the number of possible com-
binations of serial and parallel comparisons, resulting in a multitude of hybrid
data allocation schemes. The possibilities are further increased if the options
of rejecting data and/or adding new predictors are also included. For instance,
an algorithm may be devised which adds a new predictor in case the smallest
error of all existing predictors is above the error threshold d.
154
Figure 10.1. Scheme A: a fully parallel architecture for data allocation to three predictors.
Figure 10.2. Scheme B: a fully serial architecture for data allocation to three predictors.
Figure 10.3. Scheme C: a hybrid (serial/parallel) architecture for data allocation to three
predictors.
SOURCE IDENTIFICATION ALGORITHMS 155
Main Routine
If the size of the k-th training set is larger than Nc and 2::: i IE! I >
d.
SetK-K+I
Replace the k-th predictor by two identical copies of itself.
Allocate all data of the replaced predictor to both new predic-
tors.
End If
SOURCE IDENTIFICATION ALGORITHMS 157
Next k
Next t
the input/output behavior of source no. 2 at the same time. Hence, if source
no. 2 is activated at a later time, predictor no. 1 has a high likelihood of
accepting yt's generated by source no. 2. In this manner, we may obtain a
predictor which is a satisfactory input/output model of both sources, but we
will never be aware of the fact that two sources have been active. In case we
are interested in classification applications, this may be a serious problem.
So it appears that the success of the data allocation scheme will depend on
the similarity of input/output behavior of the active sources. It must be em-
phasized that this similarity is relative, not absolute. In particular, it depends
strongly on the type of predictors used. A predictor with rich structure and a
large number of parameters may be capable of simultaneously capturing the in-
put/output behavior of two fairly distinct sources; conversely a predictor with
few parameters may furnish a poor model of even two fairly similar sources.
This may be expressed more formally: it can be expected that the data allo-
cation will succeed when the predictor capacity is not much higher than the
source complexity.
Euclidean distance of each datum from each cluster's centroid. This is quite
similar to our approach, except that we compute "distance" from each cluster
through the use of prediction error. In k- means the centroids are periodically
recomputed using the new cluster members; this corresponds to the periodic
retraining of predictors, which our algorithms employ. Also, there are variants
of the k-means algorithm, for instance the ISODATA algorithm (Ball and Hall,
1965) which allow for splitting or merging clusters, similarly to the mecha-
nism we provide for adding new predictors. k-means is mainly related to the
parallel data allocation algorithm; however it is possible to setup a k-means
algorithm which assigns incoming data to clusters by serial comparisons; this
would correspond to our serial algorithm.
k-means is usually employed for the clustering of a fixed data set, rather
then for online clustering of an incoming data stream.. This suggests that our
algorithm could also be used for offline tasks, involving fixed data sets. Con-
versely, there are online versions of k-means. A notable example is Kohonen's
self organizing maps (SOM), which can be interpreted as an online k-means
algorithm.
If the hidden Markov model interpretation of our time series model is adopted
(as explained in Part I), then the source identification problem can be seen as a
joint state and parameter estimation task, with the state Zt taking values in a
discrete, finite set. This is essentially is a nonlinear filtering problem. Because
the state space has no particular structure (for instance a notion of distance) it
is a rather hard problem. A possible method of solution (for a fixed data set)
appears in (Levin, 1993) and is essentially a version of the EM (Dempster et al.,
1977) algorithm. It is conceivable that some online HMM parameter estima-
tion algorithm may be modified into a form suitable for the source specification
problem; see for instance (Baldi and Chauvin, 1996).
Finally, there is an obvious connection with constructive and / or growing
algorithms which have appeared in the neural networks literature of the last
decade. Tree growing algorithms are especially relevant. We do not furnish
any bibliographic references at this point; the subject is treated in Chapter 13,
where abundant references are provided.
10.3.5 Implementation
In practical implementation of the parallel and serial algorithms, certain para-
meter values must be chosen carefully to optimize performance. We list some
of these parameters below and discuss some issues which are related to the
determination of their values.
1. N,the length of data block. This is related to the switching rate ofZt. Let
Ts denote the minimum number of time steps between two successive source
switchings. While Ts is unknown, we operate on the assumption of slow
switching, which means that Ts will be large compared to N. Since the N
data points included in a block will all be assigned to the same predictor,
it is obviously desirable that they have been generated by the same source.
162
In practice this cannot be guaranteed. In general, a small value of N -will
increase the likelihood that most blocks contain data from a single source.
On the other hand, it has been found that small N leads to an essentially
random assignment of data points to sources, especially in the initial stages
of segmentation, when the predictors have not specialized sufficiently. The
converse situation holds for large N. In practice, one needs to guess a value
for Ts and then take N somewhere between 1 and Ts. This choice is consis-
tent with the hypothesis of slow switching rate. The practical result is that
most blocks contain data from exactly one source, and a few blocks contain
data from two sources. It should be stressed that the exact value of Ts is
not required to be known; a rough guess suffices.
2. L, the retraining period. If this is too large, then retraining requires a long
time. If it is too small, then not enough data are available for the retraining
of the predictors, which may result in overfitting, especially in the early
stages of the algorithm. Of course, if N is relatively large, meaning that
each data block contains relatively many data, then L can also be small,
since L counts data blocks, rather than isolated data points.
3. J, the number of training iterations. This should be taken relatively small,
since the predictors must not be overspecialized in the early phases of the
algorithm, when relatively few data points are available. The choice of J is
closely connected to that of L; if L is relatively small (frequent retraining)
then J can be small, too; i.e. it may be preferable to retrain the predictors
often and by small increments.
4. Finally, there are the growing parameters: Nc and d in the case of the parallel
algorithm and d in the case of the serial algorithm. As already remarked,
we have no specific recommendations to make regarding the choice of these
parameters, but we have found that, within reasonable bounds, their exact
values are not crucial to the performance of the algorithms.
10.4 EXPERIMENTS
In this section we present three groups of data allocation experiments which
were used to evaluate the performance of the parallel and serial data allocation
schemes.
Yt = fZt(Yt-J);
in other words the time series is generated by functions h (.), h (.). Specifically,
we have
h(x) = 4x· (1- x) (a logistic function);
SOURCE IDENTIFICATION ALGORITHMS 163
1.0000 T·····:·····;······································,..·"";"•.......:..............•...•••.•........•.•...............•••••..•...............••.............•••••...;..••••..1
l!!
'fij
en 0.5000
~"
2x if x E [0,0.5)
h(x)= { 2.(1-x) if x E [0.5,1]
(a tent-map function).
The two sources are activated consecutively, each for 200 time steps, resulting
in a period of 400 time steps. The data allocation task consists in discovering
that two sources are active and separating the data Yb Y2, ... into two groups,
one group corresponding to each source. 200 time steps of the composite time
series are presented in Figure 10.5.
This particular segment of the time series includes a source switching. The
reader may be interested in guessing where this source switching takes place,
by looking at the Yt values (the times shown in the graph are not the real ones).
The answer is given in footnote 2, at the end of the chapter.
A number of experiments are performed using the time series described
above, observed at various levels of noise, i.e. at every step Yt is mixed with ad-
ditive white noise uniformly distributed in the interval [-A/2, A/2]. Six values
of A are used: 0.00 (noise free case), 0.02, 0.04, 0.10, 0.14, 0.20. The predictors
used are 1-4-1 sigmoid neural networks which are trained using a Levenberg-
Marquardt algorithml; the algorithm parameters are taken to be as follows:
block length N=lO, retrain period L =100 and J =5, Nc = 500 and d = 0.1.
Data allocation is performed using both parallel and serial data allocation
schemes. In every experiment performed, both schemes succeed in discovering
1.20 . , - - - - - - - - - - - - - - - - ,
0.80
"
-- ................
..... ',,\
,,
-----,
,,
,,
'----- ---- ..
0.40
0.00 +----+-----+-----+-------1
0.00 0.05 0.10 0.15 0.20
Noise Level A
the existence of two sources and proceed in allocating data to the two corre-
sponding predictors. Two quantities are of interest: the time Tc at which the
sources are discovered and the classification accuracy c after time Tc. Tc is
computed as follows: a running average of prediction errors is calculated for
every time t; if for every data allocation the predictor which receives the in-
coming sample Yt has prediction error less than one half that of the remaining
predictors, and if this condition holds for 50 consecutive data allocations (Le.
for 500 observations), then it is assumed that all sources have been discov-
ered and all predictors specialized and Tc is set equal to the current t. Then
classification accuracy is computed for the 200 data blocks (2000 observations)
corresponding to times Tc + 1, Tc + 2, ... , Tc + 200; i.e. c is set equal to T /200
where T is the number of correctly classified samples.
For both parallel and serial data allocation, six experiments are performed
at every noise level and the resulting c and Tc values are averaged. The average
c is plotted as a function of noise level A in Figure 10.6 and the average Tc is
plotted as a function of noise level A in Figure 10.7.
It can be seen that both schemes perform very well at low to medium noise
levels. At high noise levels the parallel data allocation scheme still shows very
good performance; the serial scheme achieves a relatively low level of correct
classification and, in particular, fails to satisfy the Tc computation criterion
(hence the respective part of the graph is missing in Figure 10.7. On the other
hand, at low to middle noise levels, the serial scheme achieves classification
faster: the average values of Tc are lower than these of the parallel scheme.
SOURCE IDENTIFICATION ALGORITHMS 165
2000
1500
u
I-
CD
E
1=
!5 1000
~..,.,
0
..
500
O+-----~-----r-----r----~
0.00 0.05 0.10 0.15 0.20
Noise Level A
2 _ _ Source
....... Predictor
o
o 400 800
Time Steps
3
c::
0
'"~ 2 , ' , _ _ Source
:: ::, '.'.
\ / \
0
~ :.'::, ' . f ....... Predictor
.!!l
'"
Cl
0
0 400 800
Time Steps
in other words the time series is generated by functions 11 (.), 12 (.), is (.).
The first two functions are as described in the previous section, while 13(.) =
11(11(.)) (i.e. a double logistic).
The three sources are activated consecutively, each for 200 time steps, re-
sulting in a period of 600 time steps. The data allocation task consists in
discovering that three sources are active and separating the data Yl, Y2, ... into
three groups, one group corresponding to each source. 200 time steps of the
composite time series are presented in Figure 10.11.
SOURCE IDENTIFICATION ALGORITHMS 167
1.00 ,--;;................-1"""••••••.--....•-•.-._.,-••.•-.•..•-····--·-----,--··-·······-···-r-···:·····r,.·:·········J
This particular segment of the time series includes a source switching. The
reader may be interested in guessing where this source switching takes place,
by looking at the Yt values (the times shown in the graph are not the real ones).
The answer is given in footnote 2, at the end of the chapter.
Once again, experiments are performed using the above time series, observed
at various levels of noise, i.e. at every step Yt is mixed with additive white noise
uniformly distributed in the interval [- A/2, A/2]. Six values of A are used: 0.00
(noise free case), 0.02, 0.04, 0.10, 0.14, 0.20. Block length N, predictor types,
training algorithm, and algorithm parameters are taken the same as in the
previous section.
Again both the parallel and serial data allocation schemes are used. In
every experiment performed, (with the exception of serial data allocation and
noise level A ~ 0.14) both schemes succeed in discovering the existence of
three sources and proceed in allocating data to the corresponding predictors.
It should be noticed that in the case of parallel data allocation, there is an ini-
tial phase where data are allocated to two groups; after specialization in these
two groups (composite sources) takes place, two new predictors are introduced
for each group and the new incoming data are further allocated to four sub-
groups, two for each original group. However, it is easily established (using
a predictive error comparison criterion) that in one group the two subgroups
really correspond to one source, and so these subgroups are merged, resulting
in three final groups and predictors. The quantities c and Tc are computed as
in the previous section.
168
1.20 r .............. ·...... ··.... · .......... · .......... · ..........·.......... ··· .. ··· .. ··· ........................................................,
0.80
,
,
-,
,,
',----- ---1
0.40
For both parallel and serial data allocation, six experiments are performed
at every noise level and the resulting c and Tc values are averaged. The average
c is plotted as a function of noise level A in Figure 10.12 and the average Tc is
plotted as a function of noise level A in Figure 10.13.
Again it can be seen that both schemes perform very well at low to medium
noise levels (where the serial scheme achieves faster classification) and the par-
allel data allocation scheme also has very good performance at high noise levels.
3000
~
CD
E
F
15 2000
~
i..
tl
1000
1.00 IV
0."
~
TimeSleps
Once again, experiments are performed using the above time series, observed
at various levels of noise, i.e. at every step Yt is mixed with additive white noise
uniformly distributed in the interval [- A/2, A/2]. Six values of A are used: 0.00
(noise free case), 0.02, 0.04, 0.10, 0.14, 0.20. Block length N is equal to 10,
170
'--.
0.80
0.40
the predictors are 5-5-1 sigmoid neural networks. The training algorithm is the
same as in the previous sections; the algorithm parameters are taken to be as
follows: L =100, J =10, Nc = 100 and d = 0.05.
Once again both the parallel and serial data allocation schemes are used
and in every experiment both schemes succeed in discovering the existence of
three sources and allocating data to the corresponding predictors. Similarly to
experiment group B, in the case of parallel data allocation there is an initial
phase (first level data allocation) where data are allocated to two groups and a
second level phase, where two new predictors are introduced for each group. By
using the predictive error comparison criterion, it is established that the two
subgroups of one group really correspond to one source, and so these subgroups
are merged, resulting in three final groups and predictors. The quantities c and
Tc are computed as in the previous sections.
For both parallel and serial data allocation, six experiments are performed
at every noise level and the resulting c and Tc values are averaged. The average
c is plotted as a function of noise level A in Figure 10.15 and the average Tc
is plotted as a function of noise level A in Figure 10.16. Both data alloca-
tion schemes perform very well at all noise levels. The serial scheme achieves
considerably faster classification.
1000
~
co
E
F
"0 500
~..,
..'"
D
0+-----+-----1-----~----~
0.00 0.05 0.10 0.15 0.20
Noise Level A
should be added that the computation requirements are quite modest for both
schemes; for instance in experiment group B, allocating 1000 observations takes
on the average 3 minutes of processing, for both the serial and parallel data
allocation schemes. This processing time corresponds to implementation using
MatLab 5, running on a 200 MHz Pentium II computer. Optimized C code
would undoubtedly results in much shorter execution times. Hence the above
schemes are suitable for online implementation (keep in mind that MATLAB
is an interpreted language)2.
2Let us give here the answer regarding the source switching times in Figures 10.5, 10.11 and
10.14. In Figure 10.5 the switching time is t= 79; in Figure 10.11 the switching time is t=
52; finally, in Figure 10.14 the switching time is t= 51.
172
10.6 CONCLUSIONS
We have presented two online unsupervised PREMONN algorithms for source
identification. These algorithms, used in conjunction with a PREMONN clas-
sification or prediction algorithm can solve time series problems involving un-
known sources.
The two PREMONN source identification algorithms are quite similar. Both
of them employ a bank of (neural network) predictors and both consist of a
data allocrtion component and a predictor training component. Any neural
network training algorithm can be used to implement the predictor training
component. The crucial component is the one performing data allocation. The
first algorithm presented here use "parallel" data allocation, i.e. incoming
data are allocated by comparing all prediction errors concurrently. The second
algorithm uses "serial" data allocation, i.e. prediction errors are examined one
at a time and an incoming datum is allocated to the first predictor with error
below a threshold.
Both algorithms produce "correct" data allocation and "well trained" predic-
tors, which reproduce accurately the input/output behavior of the time series
generating sources. Both algorithms are fast (the serial algorithm is faster than
the parallel one) and have light computation requirements. The parallel algo-
rithm is very robust to observation noise. All of the above facts have been
established by numerical experimentation.
Moreover, the source identification algorithms follow the general PREMONN
philosophy, in that they operate on the basis of predictive error and are mod-
ular: training and prediction are performed by independent modules, each of
which can be removed from the system, without affecting the properties of
the remaining modules. This results in short training and development times.
Finally, the algorithms have good convergence properties; this fact will be es-
tablished by mathematical analysis in the following two chapters.
11 CONVERGENCE OF PARALLEL
DATA ALLOCATION
and
Vt ~ Xt - Xt- 1.
It should be obvious that the above definitions immediately imply the following
relationships (for i, j = 1,2)
t
Ni j = LM;j
.=1
and
t
Xt = LV..
• =1
While N;j and X t are the primary variables, it will be more convenient to work
with M;J and Vt. The following variables will also be useful for the convergence
analysis:
d ..:... M12
Mt -t + M21
t,
d ..:... N 12
Nt -t + N t21·
Again, it is rather obvious that
t
Nt=LMf .
• =1
CONVERGENCE OF PARALLEL DATA ALLOCATION 175
Vi = 1 {::} M td = 0; (11.2)
Vi = -1 {::} Mt = 1.
In other words, 2:;=1 M:; counts the number of times when Vi = -1.
Finally, we will need the following processes
Nl: No. of times source no.l has been activated up to time t
Ni No. of times source no.2 has been activated up to time t
In other words,
and
1. As the specialization level increases to plus infinity (which means that either
predictor 1 has received a lot more data from source 1 than from source 2,
or that predictor 2 has received a lot more data from source 2 than from
source 1, or both):
It must be stressed that whether the above assumptions Al and A2 are sat-
isfied depends on three factors: (a) the input/output behavior of the sources;
(b) the type of predictors used; (c) the training algorithm used. In short, as-
sumptions AI, A2 characterize the sources/ predictors/ training combination.
Given the above consideration and taking into account that X t = L::~=l Vs,
and also eqs.(l1.1), (11.2), we see that the specialization process is a species of
178
Figure 11.1. The specialization process is an inhomogeneous random walk on the integers .
Pr (lim
t-+CXl
IXtl = +00) = 1, (11.4)
Pr (lim
t~(X)
Xt = +00) + Pr (lim
t-+oo
X t = -00) = 1. (11.5)
Xt ---+ +00: Either predictor no.l will accumulate a lot more source no.l sam-
ples than source no.2 samples, or predictor no.2 will accumulate a lot more
source no.2 samples than source no.l samples, or both.
Xt ---+ -00: Either predictor no.l will accumulate a lot more source no.2 sam-
ples than source no.l samples, or predictor no.2 will accumulate a lot more
source no.l samples than source no.2 samples, or both.
The total probability that one of these two events will take place is one,
i.e. predictor no. 1 will certainly specialize in one of the two sources. Reverting
to the random walk interpretation, the particle will either wander off to plus
infinity, or will wander off to minus infinity; in either case, after a finite time
it will never fall below any given level of specialization.
Notice that Theorem 11.1 does not quite say that both predictors will spe-
cialize. But in fact, they will specialize, each in a different source and the
specialization is stronger than implied by Theorem 11.1. This is stated in the
next theorem.
1. If Pr( lim X t
t-+CXl
= +00) > 0 then
= 0 It-+CXl
N21 )
Pr ( lim
t-+CXl
Ntll
t
lim X t = +00 = 1, (11.6)
= 0 It-+CXl
N12 )
Pr ( lim
t-+CXl
N~2
t
lim X t = +00 = 1. (11. 7)
2. If Pr( lim X t
t-+CXl
= -00) > 0 then
Pr ( lim
t-+CXl
N~l
Nll
t
= 0 It-+CXl )
lim X t = -00 = 1, (11.8)
(11.9)
Theorem 11.2 states that, with probability one, both predictors will special-
ize, one in each source and in a "strong" sense . For instance, if X t ---+ +00,
180
then the proportion Nfl/Nl l (no. of source 2 samples divided by no. of source
1 samples assigned to predictor no.l) goes to zero; this means that "most" of
the samples on which predictor 1 was trained come from source 1 and, also,
that "most" of the time a sample of source 1 is assigned (classified) to the pre-
dictor which is specialized in this source. Hence we can identify source 1 with
predictor no.1. Furthermore the proportion Nl 2/Nt22 (no. of. source 1 samples
divided by no. of source 2 samples assigned to predictor no.2) also goes to zero
; this means that "most" of the samples on which predictor two was trained
come from source 2 and, also, that "most" of the time a sample of source 2 is
assigned (classified) to the predictor which is specialized in this source. Hence
we can identify source 2 with predictor no.2. A completely symmetric situation
holds when X t -+ -00, with predictor no.l specializing in source no.2 and pre-
dictor no.2 specializing in source no.1. Since, by Theorem 11.1, X t goes either
to +00 or to -00, it follows that specialization of both predictors (one in each
source) is guaranteed.
It must be stressed that for the conclusions of the above theorems to ma-
terialize, it is necessary that conditions AO, AI, A2 hold. Since the validity
of these conditions will depend not only on the behavior of the sources, but
also on the user's choice of predictors and training algorithm, it follows that
considerable skill is required to ensure actual convergence of the data allocation
scheme. Hence the above theorems are mostly of theoretical value, i.e. they
furnish conditions sufficient to ensure convergence. The actual enforcement of
these condition is left to the user.
Using the form of eqs.(I1.lO), (11.11) it follows that the generation of the time
series can be described by an equation of the form
(11.12)
In other words, we can consider yt to be produced by a new ensemble of two suc-
cessively activated sources, where source activation is denoted by the variable
Zt, taking values in {I,2} and each of the two new sources is actually a com-
posite of simpler sources. Now it follows from eq.(11.12) that the two-sources
analysis presented in the previous section also applies to the many sources case,
as long as each of the sets 61. 62 is considered as a composite source. In partic-
ular, if predictor type, training algorithm and allocation threshold are selected
so that the partition/ predictors/ training algorithm / threshold combination
satisfies assumptions AO, AI, A2, then the parallel data allocation scheme
will be convergent in the sense of Theorem 11.2. Hence the incoming data will
be separated into two sets; one set will contain predominantly data generated
by the composite source no. 1 and the other set will contain predominantly data
generated by the composite source no.2. To be more precise, if the variables
Ni i (i,j = 1,2) have the meaning explained in the previous section, but with
respect to the composite sources no.l and 2, then with probability one we will
have either
N 12
lim N~2
t-+oo t
= 0,
or
NIl N 22
lim
t-+oo
N~l
t
= 0, lim
t-+oo
Nt12
t
= O.
Consider, to be specific, the first case. In this case, the proportion of source
no.2 generated data that is found in the training data of source no.I goes to
zero. Suppose now that the parallel data allocation scheme is applied once
again, only to these data, using a new combination of predictors/training algo-
rithm. Suppose that there is a further partition of the source subset 6 1 into
sets 6 11 ,6 12 . If conditions AO, AI, A2 hold true for this new combination of
composite sources, predictors and training algorithm, and given that data from
sources belonging to set 6 2 will be contained in a vanishing proportion in the
training data, it follows from Theorem 11.2 that the training data will be fur-
ther separated into two subsets, one corresponding to each source subset 6 11 ,
6 12 , Of course, exactly the same argument applies to source set 6 2 which will
be separated into subsets 6 21> 6 22 , each with a corresponding training data
subset. This procedure continues until the original source set 6 is hierarchi-
cally partitioned into a number of sets for which no further partitions satisfying
conditions AO, AI, A2 are possible. By judicious choice of the predictors and
training algorithm it is possible to reduce the sets of the final partition to sin-
gletons, i.e. break down the original set 6 to K subsets of the form {k 1 }, {k 2 },
... , {kK} where (k 1 , k2 , ... , k K ) is a permutation of (1,2, ... , K). In other
words, exactly one predictor corresponds to each subset / source.
182
It must be pointed out that, for the above procedure to succeed, judicious
selection of the predictors and training algorithm is necessary at every stage of
the data allocation scheme. Some guidelines for fine tuning the data allocation
scheme have been presented in Section 10.4 of Chapter 10. As already remarked
for the two sources case, the value of the above convergence argument is mostly
theoretical, in pointing out conditions sufficient to ensure convergence.
11.3 CONCLUSIONS
tation of the specialization process, the particle will not oscillate around the
origin but will wander off either to plus or minus infinity.
Finally, let us note that while our data allocation scheme is based on pre-
dictive modular credit assignment, the convergence results presented here may
be applied in a more general context, encompassing static, as well as dynamic
(time series) classification problems. The generality of our conclusions follows
from the generality of the assumptions on which the data allocation analysis
rests.
2. At every time t the particle will move only one step, either to the left or to
the right.
3. The probability of moving to the left or to the right changes with the position
of the particle.
Using the source activation probabilities 7rl and 7r2 and the data allocation
probabilities an and bn , we have obtained the transition probabilities of X t .
Our first goal is to establish that X t does not pass from any particular state
infinitely often. Technically, this is expressed by saying that X t is a transient
Markov chain. This result is established using the classical Theorem 11-ll.B.l
and Lemma 11-ll.B.2 which shows that the conclusions of Theorem 11-ll.B.l
can be applied to the process X t .
Using Lemma 11-ll.B.2 we show that X t does not pass from any state
infinitely often; this is the first conclusion of Theorem 11.1. Then it follows
that X t must spend most of its time at either plus or minus infinity; this is
the second conclusion of the theorem. Finally, this is used to prove that X t
cannot oscillate between plus and minus infinity, since then it would have to
pass through the intermediate states infinitely often! Hence it is concluded that
either X t --+ 00 or X t --+ -00; which is the final part of Theorem 11.1.
Proof of Theorem 11.2
Theorem 11.2 describes the behavior of N;j, i,j = 1,2, i.e. it tells us how
many samples each predictor collects from each source. At every time t, the
N;j processes increase or remain unchanged (but cannot decrease) according
to certain probabilities which depend on X t , i.e. on the current specialization.
184
Rather than examining the N;j, we work with the process Vi and the associated
process Mt .As already remarked, at every t, Vi is either -lor 1. However,
because the associated probabilities depend on X t - 1 , which in turn depends on
Vi-I, Vi-2' ... the random variables VI, V2, ... , Vi, ... are not independent and
this renders an analysis of their behavior difficult. Rather than working with
the Vi's directly, we relate their behavior to that of an auxiliary process ~,
V~, ... , ~, .... The random variables ~, ~, ... , ~, ... are constructed in
such a manner that they are independent and they take the values -1, 1 with
appropriate time invariant probabilities. For instance, in case that X t -+ 00,
~ is constructed in such a manner that we always have Pr(Vi = -llfor all
T ~ n X T ~ m) $ Pr(~ = -1); here m and n are appropriately selected.
It is easy to prove (this is shown in Lemma 11-11.C.l) that with probability
one E:tJ v;' goes to a quantity ;:Y(m), which tends to zero as t goes to infinity.
Then, in Lemma 11-11.C.2 we show that the probability (conditioned on the
event "for all T ~ n X.,. ~ m") of E:tJ V. exceeding any number is less
than the probability of E:-/ v;' exceeding the same number. Combining the
results of Lemmas 11-11.C.l and 11-11.C.2 we obtain Lemma 11-11.C.3: the
probability (conditioned on the event "for all T ~ n X.,. ~ m") of E:t 1 V.
exceeding 2;:y(m) is zero for any m.
Now, using Lemma 11-11.C.3 we can prove Theorem 11.2. For instance, we
show that the probability (conditional on X t going to plus infinity, i.e. either
predictor no. 1 specializing in source no. 1, or predictor no.2 specializing in source
no.2, or both ) of all the following events is one.
1. !if. goes to zero (i.e. the total number of ''wrong'' allocations is very small);
N l2
2. from (1) follows that ~ goes to zero (i.e. the total number of ''wrong''
allocations of the type source no.l -+ predictor no.2 is very small);
N ll
3. from (1) also follows that ~ goes to 71'1 (i.e. the total number of "correct"
allocations of the type source no.l -+ predictor no.l is very large);
We repeat that all of the above events (and similar ones corresponding to
predictor no.2), happen with probability one, conditional on X t tending to +00.
Similar results are obtained for the case of X t tending to -00.
n E Z- {O} (ll.B.l)
1. 'Vn we have 0 ~ Un ~ Ij
r~) == 1
(t) . " (t) (1l.B.2)
rm = L.Jk#O qm,k t = 1,2, ...
An interpretation of r~ will be given a little later. For the time being, note
that for every m, n EZ-{O} we have qm,n= Pm,n= Pr (Xt = nlXt - l = m)j it
follows that Ek#oPm,k ~ 1. Then, from eq.(1l.B.2) it follows that
r(t+l)
m
= '" m,n = '"
L..J q(t+!) '" q(t) P = '"
~ L..J m,k k,n
" Pk ,n <
m,k 'L..J
L..J q(t) - '
L..J m,k = ret)
" q(t) m .
n#O n#O k#O k#O n#O k#O
So we see that for every m we have
1 = r~) ~ r~) ~ ... ~ r~ ~ r~+l) ~ '" ~ o.
186
When t ~ 00, this bounded and decreasing sequence has a limit. For any m in
Z-{O}, define
r
m
=t-+oo
lim r(t)
m -
> O·,
note that for all m in Z-{O}, and for t= 0,1,00' we have r~;I? ~ rm. Now, note
that
r(t+1)
m
= '" q(t+1) = '"
~ m,n L.., "'P
L.., m, kq(t)
k,n
= "'P
L.t m, k 'L.t
" q(t)
k,n
=
n#O n#Ok#O k#O n#O
'~Pm,k
" '~qk,n
" (t) = '~Pm,krk,'
" (t) (Il.B.3)
k#O n#O k#O
Take the limit as t ~ 00 in the above equation and interchange the order of limit
and summation (using the Bounded Convergence Theorem). Then eq.(1l.B.3)
becomes
rm = LPm,krk' (I1.B.4)
k#O
Continuing in this manner, it is seen that 8 m ::; r~ for all m, t, which implies
rm · r m(t) >
11m
= t-+CX) _ 8m . (Il.B.6)
Also,
r~) = LPm,kri1) =
k#O
Continuing in this manner, for t = 3,4, ... , T, it is seen that for any T
rm = Tlim
..... OO
r~) = Pr (Xt+1 f:. 0, X t+2 f:. 0, ... IXt = m) .
In other words,
rm = Pr (Xt never going to 0IXt starting at m).
Lemma 1l-1l.B.2 Suppose that conditions AD, Al, A2 hold and that the
specialization process X t has transition probability matrix P = [Pm,n]m,nEZ.
Then the system
n E Z - {O} (l1.B.7)
188
( Un+l - Un ) = ---.
Pn,n-l (
Un - Un-l
)
::::}
Pn,n+l
= P2,a
}~
U3 - U2 P2 1
U2 -
'"-"-=.
(
UI )
{ U4 - U3 = '-"=.
Pa 2 ( U3 -
Pa,4
U2 ) = '-"=
Pa 2 . P2 1
Pa,4 P2,a
'"-"-=.
( U2 - UI )
U - U _ = PN-l.N-2·PN-2,N-a'",·Pa.2·P2.1 • (U - U ).
N N I PN-l,N'PN-2,N-l'''''Pa,4'P2,a 2 I
_
UN - U2 + {~
L...J
Pn-l,n-2 . Pn-2,n-3 ..... P3,2 . P2,1}
.
(
U2 - UI
)
(ll.B.12)
n=3 Pn-l,n' Pn-2,n-1 ..... P3,4 • P2,3
Choose any UI such that 0 < UI < 1. Then, since PI,O,PI,2 > 0, also
U2 - UI > 0 ::::} U2 > UI > O.
Then, from eq.(I1.B.12) for N= 3, 4, ... we also have UN > O. So a solution
to eqs.(I1.B.8), (I1.B.9) has been obtained, which satisfies UN > 0 for N =
1,2, .... Now if
{
~
L...J'
Pn-l n-2 . Pn-2 n-3 ..... P3
,
2. I} (
".
P2
U2 - UI
)< 00. (ll.B.13)
n=3 Pn-l,n' Pn-2,n-1 ..... P3,4 . P2,3
It is evident that for N = 1,2, ... the u~'s satisfy eq.(I1.B.8), (I1.B.9) and
o < u~ ~ 1. So, it only needs to be shown that the inequality (I1.B.I3) will
always be true if conditions AO, AI, A2 hold. To show this, note that
Pn-l,n-2 =
7rl . (1 - an-lb) 7r2 . + (1 - bn - 1 ) }.
Pn-l,n =
7rl . an-l 7r2' n-l +
If we define
h(n) ~ Pn-l,n-2 ,
Pn-l,n
then, since lim n ...... oo an = 1 and lim n ...... oo bn = 1, it follows that limn ...... oo h(n) =
O. Hence for any 0 < p < 1 there is some no such that for all n ~ no we have
h( n) < p. Then we can write
00
L Pn-l,n-2 . Pn-2,n-3 ..... P3,2 . P2,1
n=3 Pn-l,n' Pn-2,n-l ..... P3,4 . P2,3
L
00
Pn-l,n-2 . Pn-2,n-3 ..... Pno-l,no-2
G(no) + H(no)' (1l.B.I5)
Pn-l,n . Pn-2,n-l ..... Pno-l,no
n=no
where G(no) and H(no) only depend on no. It follows that the expression
(I1.B.I5) is less than
00
00
' " ' Pn-l,n-2 . Pn-2,n-3 ..... P3,2 . P2,1
~ =------'-----'=-----::.!."'--=----....:.....:=......:.....:= < 00.
n=3 Pn-l,n' Pn-2,n-l ..... P3,4 . P2,3
Hence it has been proved that if AO, AI, A2 hold, eqs.(I1.B.8), (I1.B.9) and
so also eq.(I1.B.7) have a nontrivial admissible solution; consequently X t is
transient. It can also be proved that eqs.(11.B.IO), (I1.B.ll) have a nontrivial
admissible solution. The method of proof is quite similar to the one already
used and will not be presented here. •
We now prove Theorem 11.1 using Theorem ll-ll.B.I and Lemma 1l-1l.B.2.
Proof of Theorem 11.1: In Lemma 1l-1l.B.2 it has been proved that
eq.(I1.B.7) has an admissible solution, so by Theorem ll-ll.B.I, 0 is a transient
state of X t . Then, by Theorem A.lO, for all m, i EZ
Pr (X t =m Lo. IXo = i) = 0
~ Pr (X t =m Lo.IXo = i) . Pr (Xo = i) = 0 ~
Pr (X t =m Lo. and Xo = i) = 0
~ L Pr (Xt =m i.o. and Xo = i) = 0 ~
iEZ
190
(Recall that "a.a." means "almost always".) Define the event AM = {IXtl > M
a.a. }. Clearly, if N > M then AN C AM. So, Al,A2'." is a decreasing
sequence of sets; defining A =
M=l
n
AM we have (by Lemma A.2): Pr(A) = lim
M-oo
Pr(A M ) = 1. But
00
A=
M=l
n {IXtl > M a.a.} =
for t ~ to
fort=tl,t2,."
say, without loss of generality, that t < tl' Then, since X t must move through
neighboring states, there must also be a time t with t < t < tl and Xt = OJ
but this contradicts eq.(I1.B.17). So it has been proved that for all paths such
that lim IXtl = 00, either lim X t = +00 or lim X t = -00. From (ii) it
t..... oo t..... oo t--+oo
follows that the set of all paths for which it is not true that lim IXtl = 00 has
t--+oo
probability zero. This completes the proof of eq.(11.5), hence the proof of the
theorem is complete .•
( IX) == {P(Ylx) if x ~ m
qY p(ylm) if x < m.
In other words, q(Ylx) is identical to p{Ylx), except when x is less than m.
Both p(.I.) and q(.I.) are probability functions for any x, provided we limit Y
to values such that Ix - yl = 1.
Two Auxiliary Stochastic Processes
Before examining the properties of Nt we need to define two auxiliary sto-
chastic processes. These depend on certain probabilities which will now be
defined. Define (for all m E Z ) the following.
;y(m) ==sup -y(n), a(m) == 1- ;y(m).
n;:::m
Obviously a(m) +;y(m) = 1. Note that 0 ~ ;y(m) ~ 1 and hence 0 ~ a(m) ~ 1.
In addition, ;y(m) is monotonically decreasing, with limm..... oo;y{ m) = OJ it
follows that a( m) is monotonically increasing, with limn --+ oo a( m) = 1. Note
that for all m E Z we have
;y(m) ~ -y(m) and a(m) ~ a(m).
Since a( m) and ;y(m) are nonnegative and add to one, they can be considered
to be probabilities. Now, consider m fixed and for z E {-I, I}, define
_( ) == { a( m) if z = 1,
PZ ;y{m) if z = -1.
192
(We have suppressed, for brevity of notation, the dependence of p(.) on m.)
Consider a sequence of independent random variables V~, V;, ... which (for
t = 1,2, ... and Z E {-1,1} satisfy:
Pr(V;' = z) = p(z).
In other words, the V;"s are a sequence of Bernoulli trials; in particular they
are independent and identically distributed.
Finally define the stochastic process M;' by the following relationships
Mm
t
_
-
{O iff V;' = 1;
.-m (l1.C.1)
1 1ff V t =-1.
Clearly, for z E {O, I} we have
1. lit defines an inhomogeneous random walk and the random variables Vi, \t2,
... are dependent;
2. !:J: defines .an homogeneous random walk and the random variables v;n,
V2 , ... are mdependent;
3. E!=l Mf counts the number of lit moves to the left; E!=l M:n counts the
number of =
V t moves to the left;
4. if for all t greater than some to we have X t ~ m, then the probability of lit
taking a move to the left is no greater than :-y( m); that of V;' taking a move
to the left is always :-y(m).
Proof. This is essentially one half of the Central Limit Theorem. For any m,
take a positive 8 and define the following sets.
C
t
= {I Jt.
~!=1M"; -t·;y(m) I> J2810g(t)}
')'(m) . (1 - ')'(m)) - ,
Since ;y(m) -+ 0, it follows that for t large enough (say for all t greater than
some appropriate t;" ) we have
t· ;Y(m) > J28. ')'(m) . (1- ')'(m)) .Jtlog(t).
Hence for all t ~ t;" we have Ct C Ct.
Now, it is clear that AI';', M~n, ... is a sequence of Bernoulli trials, with
expectation E(M~n,) = ;Y(m). Then, using Theorem A.9 (see Mathematical
Appendix) it follows that there is some t: such that for all t > t';,. we have
Pr(Ct ) < 10.
Then it follows that Pr(Ct ) S Pr(C t ) S Pr(Ct} < In short 10.
for every m there is some tm = max (t;", t';,.) such that for all t ~ tm we have
Pr s=nt Md
( ~n+t s ~ c I\lT ~ n X'T ~ m
)
S Pr (~ts-~ 'JVl
s ~ c
) (11.C.2)
194
(If L is not an integer, then "L times" should be taken to mean "fLl times",
where fLl means the integer part of L plus one).
Now, choose any Xo E Z and define the following conditions on sequences
(Xl, X2, ... , xt) E zt.
CI For s = 1,2, .... , t we have Xs < Xs-I at least L times.
C2 For s = 1,2, .... , t we have \x s - Xs-I\ = 1.
C3 For s = 1,2, .... , t we have Xs ~ m.
1. Am(xo): the set of sequences that satisfy CI, C2, C3; and
Recall that
3. q(Xs\Xs-I) ~ 0;
CONVERGENCE OF PARALLEL DATA ALLOCATION 195
then, defining
(1l.C.3)
it follows that
:l:o=m,m+l, ...
Let us now bound Qt (xo); then it will be easy to bound Q as well. As can
be seen from eq.(I1.C.3), Qt(xo) consists of a sum of products of q(.I.) terms.
We will proceed in t steps, at every step producing a greater expression by
replacing one of the q(.I.) terms by a 15(.1.) term. Namely, at step 1 we will
replace q(XtIXt-l) by 15 (XtIXt-l); at step 2 we will replace q(xt-llxt-2) by
15 (xt-ll xt-2) and so on, until at the t-th step we obtain an expression which is
greater than Q t (xo) and comprises entirely of 15( .1.) terms.
To bound Qt (xo) , it will be useful to define some additional sets of sequences
(Yl, Y2, ... , Yt) E {-I, IP. We define
B ~ {(Yl, .. , Yt): for 8 = 1, .. , t we have Ys E {-I, I}, no. of -l's ~ L}.
Note that the sets A(xo) and B are in a one-to-one correspondence: for Xo fixed
and any (Xl, X2,", .Xt) E A(xo), a unique (Yl, Y2,", .Yt) E B is defined by taking
Ys = Xs-Xs-l (8 = 1,2, ... ); conversely, for Xo fixed and any (Yl, Y2, .. , .Yt) E B,
XS = Xs-l + Ys defines a unique (Xl, X2, .. , .Xt) E A(xo). Hence there are one-
to-one functions Y : A(xo) -+ B and X : B --+ A{xo),where X = y-l. Note
also that B is independent of Xo, i.e. for any xo,x~ E Z, we have Y (A(xo)) =
Y (A(x~)). Now, define three sets as follows.
{(Yl, Y2, ... , Yt) E Band Yt = -1 and (Yl, Y2, ... , Yt-l, 1)
- -1 -2
- B-B t -B t ·
(11.C.4)
(I1.C.5)
(11.C.6)
(I1.C.7)
Now, each q(x1Ixo)· q(x2Ixt}· .... q(xt-1Ixt) term in the above expressions
corresponds to a sequence (Xl. X2, ... , Xt). Recall the following facts.
1. Sequences (Xl. X2, ... , Xt) E A(xo) are in a one-to-one correspondence with
sequences (Y1, Y2, ... , Yt) E B; this correspondence is expressed by Xs = Ys +
X s -1 (s = 1,2, ... , T).
-1 -2
2. Sequences from B t and B t are also in a one-to-one correspondence: for
-1 . -2
every (Y!o Y2, ... , Yt-!o 1) E B t there IS a (Y!o Y2, ... , Yt-!o -1) E B t where
Y1 , Y2, ... , Yt-1 are the same in both sequences.
It follows that the terms in expressions (11.C.5), (11.C.6) are also in a one-to-
one correspondence, which can be expressed by the following rule: exactly one
q (x1Ixo) . q (x2Ix1) ..... q (Xt-1 - llxt-1) in expression (I1.C.6) corresponds to
every q (x1Ixo)·q (x2Ixt)· .... q (Xt-1 + llxt-1) in expression (I1.C.5). Using the
CONVERGENCE OF PARALLEL DATA ALLOCATION 197
above facts, we can rewrite expression (I1.CA), which is the sum of expressions
(I1.C.5), (I1.C.6), (1l.C.7), as
In expression (1l.C.8), the terms in square brackets add to one, so they can be
replaced by [P(I) + p( -I)J (which also equals one) without altering the value
of the expression. Suppose now that in the expression (I1.C.9), each term in
the sum is replaced by q(xllxo)' ... q(xt-llxt-2)· p(Xt - Xt-l), i.e. q(XtIXt-l)
is replaced by p(Xt - Xt-l). Recall that for sequences in X we have (En
Xt = Xt-l - 1; hence
p(Xt - Xt-l) = P(Xt-l - 1 - Xt-l), q(Xt - Xt-l) = q(Xt-l - 1 - Xt-l).
On the other hand
and
+
198
Recall that
-1 -2
It is also true that the elements of B t - 1 and B t - 1 are in a one-to-one cor-
respondence, that B;_l is the set of sequences with -1 in the t - 1 position for
which the no. of -1's is exactly equal to L , and that B;_l is the set of sequences
for which the no. of -1's is greater than L. The arguments to prove these claims
are much the same as the ones regarding B!
and will not be repeated.
Now let us continue the replacement procedure in the same way as in the
previous step.
(I1.C.12)
+
(Xl ,X2, ... ,xt}EX (B~_l)
(I1.C.14)
In eq.(I1.C.13), the terms in square brackets add to one, so they can be replaced
by
which also equals one. Also, in the sum in eq.(11.C.14), each term can be
replaced by q (xllxo)· q (x2Ixt)· .... q (xt-2Ixt-3)· p(Xt-2 -1- Xt-2)· p(XtIXt-l),
for the same reasons as previously. Hence, by replacing all the q(xt-lIXt-2)
terms with P(Xt-l - Xt-2) we find that Qt-l (xo) is no greater than
p(Xt - Xt-l) =
L q (Xllxo) . q (x2Ixt) ..... q (Xt-2I Xt-3) . P(Xt-l - Xt-2)·
(Xl ,X2 , ... ,Xt)EA(xo)
If we define
where
and
Qo(Xo) =
R(xo)'
xo=m,m+l, ...
L
xo=m,m+l, ...
R(xo)' L
(Xl,X2, ... ,xt}EA(xo)
q (xllxo) ..... q (xt-IIXt) ~
L
xo=m,m+l, ...
R(xo)' L P (Xl - xo) ..... p(Xt - Xt-l) ~
(Xl ,X2," .,xt}EA(xo)
L R(xo)'
L R(xo)'
(1l.C.16)
since
E:,!!+1
t Mt -> 10 IVT >
- n >
X T_ m) < Pr
-
(E!=lt
M;-> e)
and the lemma has been proved .•
202
Lemma 1l-1l.C.3 If conditions AD, At, A2 hold, for all m E Z, and for
all n E N, we have
Pr (2:;~~+1
t M'j > 'Y m ).z.o. I'v'T >_ n Xr >_
_ 2-( m) = O.
Proof. The idea is to show that
8 (2:
00
Pr
n
+t Md
s=nt s 22::y(m) I'v'T 2 n Xr 2 m
)
< 00.
Then, the conclusion of the Lemma will follow immediately from the Borel-
Cantelli Lemma (se Mathematical Appendix). Now, from Lemma 11-11.C.2,
setting c = 2::y(m), for any m E Z, for any n, tEN, we have that
(11.C.17)
Clearly
00
t~ Pr
("n+t
ws=nt
Md
s 2 2::y(m) I'v'T 2 n X(T) ~ m
)
< t~
00 1
t2 < 00.
8 ("n+t
00
Pr ws=nt
Md
s 2 2::y(m) I'v'T 2 n X(T) 2 m
)
< 00
Pr (lim Nl
t ...... oo t
=0 I.,.lim
......
00
X,. = +00) = 1. (1l.C.I8)
Pr (lim
t ...... oo
Nlt 2 = I
0 .,.lim
...... 00
X.,. = +00) = 1. (I1.C.I9)
and that
Pr (lim N;l = 0
t ...... oo t
I.,.lim
......
00
X.,. = +00) = 1 (I1.C.20)
Pr (lim
t ...... oo
Nlt l = 7rl I..,.lim
...... 00
X.,. = +00) = 1. (I1.C.2I)
and that
Pr (lim N;2 =
t ...... oo t
7r2 I.,.lim
...... 00
X.,. = +00) = 1. (I1.C.22)
Pr (lim
t ...... oo Nt
N~~ = I
0 .,.lim
...... 00
X.,. = +00) = 1. (I1.C.23)
and that
Pr (lim
t ...... oo
NN~~
t
= 0 Ilim X.,. +00)
.,. ...... 00
= = 1. (I1.C.24)
",n+t 1 Mds )
Pr ( L..s-n
- : < 2-y(m) a.a. IAmn = 1~
204
Pr (
I
L
n+t
M: < 27{m)·t IAmn
)
= 1 =>
s=n+l
Pr (
I
From the above equation follows that, for all m, n ~ 0, conditional on the event
A mn , we have (with probability 1)
"n+tMd
L.Js-l s <1={) t n
n + t < .&."( m . n + t + n + t'
In addition, the following inequalities are true (obviously with probability 1
and for all m, n ~ 0 )
t
\It -t- . 27{m) < 27{m), (1l.C.25)
+n
n
- - < ;:Y{m). (1l.C.26)
n+t
Taking tnm = max{t~m, t~m)' it follows that, for all m, n ~ 0 ,
Define
00 00
Am ~ U
n=l
Amn =n=l
U {"IT ~ n XT ~ m} =
{3 n: V T ~ n X T ~ m} = {XT ~ m a.a.} ;
also, define
00 00
A ~ n Am = n {XT ~ m a.a.} = {V m ~ 1 X T ~ m a.a.}.
m=l m=l
Em ~ { ~t ~ 3;:Y{m) a.a} ,
CONVERGENCE OF PARALLEL DATA ALLOCATION 205
On the other hand, for a fixed m and for n < n' we have BmnAmn C BmnAmn' .
Hence
(l1.C.30)
Since
Pr (Bm n Amn) = Pr (Amn) for all m, n 2: 0,
it follows from (l1.C.28), (l1.C.29) and (l1.C.30) that for all m > ml
Pr (Am) = Pr (Am n Bm). (l1.C.31)
For m < m' we have: (a) {Xt 2: m' a.a.} ~ {Xt 2: m a.a.} ,hence Am' CAm
and, since ;:Y(m) decreases monotonically to 0, Bm' C Bm. It follows that
(l1.C.33)
Then, from eqs.(l1.C.31), (11.C.32), (l1.C.33) and the assumption that Pr(A) >
0, it follows that
Pr (A n B)
0< Pr(A) = Pr(AnB) ~ Pr(A) = 1 ~ Pr(BIA) = 1.
206
In other words
Pr (V m 2:: 1 ~t < 3;:Y(m) a.a. IVm 2:: 1 X,. 2:: m a.a.) = 1 =>
Pr (lim
t-+oo
Nt
t
=0 I lim
7"-+00
X(T) = +00) = 1. (l1.C.34)
it follows that
N12 Nd
Pr ( 0::; lim sup _t_::; lim _ t = 0 I lim X(T) = +00 ) = 1 =>
t-+oo t t-+oo t 7"-+00
Pr ( lim
t-+oo
N12
_t_
t
I
= 0 7"lim
..... 00
)
X(T) = +00 = 1. (l1.C.35)
Pr ( lim NIl
t
+t N t12 = 7f1 ) => Pr (lim -t-
N 1 = 7f1 ) = 1. (ll.C.36)
Pr (lim X(T) =
7"-+00
+00) > O. (ll.C.37)
Pr (lim Nl
t-+oo t
= 71"11 lim X(T)
7"-+00
= +00) = 1. (ll.C.38)
Since
NIl
_t_ Nl 1 +NF _ Nl2 Nl Nl 2
-t- - -t-' (ll.C.39)
t t t
by taking limits in eq.(l1.C.39) and using eqs.(l1.C.35), (l1.C.38) it follows
that
Pr (lim Nl
t-+oo t
1
= 71"11 lim X(T) = +00) = 1.
7"-+00
Since the limits exist with conditional probability one, we conclude that
Pr (lim
t-+oo
NN~~ = 01
t
lim X(T)
7-+00
= +00) = 1.
By an exactly analogous procedure it can be shown that
Pr (lim
t-+oo
NNi~ = 0.1
t
lim X(T)
'T--+OO
= +00) = 1.
This completes the proof of eq.(11.C.23) and hence the part of the theorem
which refers to the case lim Xr = +00
is complete.
r--+co
The part of the proof concerning the case lim Xr =
r--+co
-00 follows exactly
the same pattern as the previously presented results, requiring the proof of ad-
ditionallemmas, corresponding to Lemmas 11-l1.C.l, 11-11.C.2 and 11-11.C.3.
This is omitted for the sake of brevity. •
12 CONVERGENCE OF SERIAL DATA
ALLOCATION
In this chapter we examine the convergence of the serial data allocation scheme
presented in Chapter 10. This chapter evolves in parallel lines to Chapter
11, where the parallel data allocation scheme was treated. We will provide
conditions which are sufficient to ensure the convergence of the serial data
allocation scheme.
The study of convergence of the serial data allocation scheme starts by consid-
ering the case of two sources and two predictors. In other words, we consider,
for the time being, a slightly modified form of the serial data allocation scheme,
where all incoming data are either accepted by the first predictor, or passed
to the second predictor, which must necessarily accept them. The general ver-
sion of the scheme, with K sources and a variable number of predictors will be
discussed in the next section.
vt ~ Xt - X t- l . (12.1)
vt and X t satisfy, as previously,
We will not ~e the process Mt in this chapter; instead we will work with the
processes M;J.
At the same time, source no.2 data (rejected by predictor no.I), will certainly
be accepted by predictor no.2, which will result in specialization of predictor
no.2 to source no.2. It may be expected that, under certain conditions, this
process will reinforce itself, resulting, in the limit of infinitely many samples,
to "absolute" specialization of both predictors.
To test this conjecture mathematically, in complete analogy to the parallel
data allocation case, we introduce three assumptions.
BO. For i = 1,2 the following is true:
Pr(Pred. nr.i accepts YtIZt,Zt-l, ... ,Xt- 1 ,Xt- 2, ... ,Yt-l,Yt-2, ... ) =
Notice that, while an is exactly the same as in Chapter 11, now bn is the
probability that predictor no.I, (rather than predictor no.2) accepts a sample
from source 2, given that so far it has accepted n more samples from source
no.I than from source no.2.
From assumption BO it follows that X t is a Markovian process on Z. The
transition probabilities of X t are defined by
also
and
Transitions for all other m, such that In - ml > 1 must be equal to zero. In
short we have
Pn,n+1 = 7r1an,
Pn,n-1 7r2bn,
°
Pn,n 7r1 • (1 - an) + 7r2 • (1 - bn )
Pn,m ifIn - ml #1.
Now, regarding the probabilities an, bn , the following assumptions are made.
1. As the specialization level increases to plus infinity (which means that pre-
dictor no.I has received a lot more data from source no.I than from source
no.2) predictor no.I is very likely to accept an additional sample from source
no.I, while it is very unlikely to accept a sample from source no.2,
Figure 12.1. The specialization process is an inhomogeneous random walk on the integers .
.. ,~..~ .. ,~ .. ,
now there are three possibilities for the imaginary particle: it may move one
step to the right or left, or stay in place:
Generally, when the particle is "far to the right", it is more likely to keep
moving to the right, than to stay in place or move to the left. Conversely,
when the particle is "far to the left", moves to the left are preferred. While it
seems reasonable that the particle will wander off either to the far right or to
the far left, the possibility of oscillation cannot be excluded and a more precise
analysis is required. The results of the analysis are Theorems 12.1 and 12.2
presented in the next section; the proof of these theorems is presented at the
end of the chapter.
Pr (lim
t--+oo
IXtl = +00) = 1, (12.3)
Pr (lim X t
t--+ CXJ
= +00) + Pr (lim
t--+oo
X t = -00) = 1. (12.4)
Xt -+ +00: Predictor no.l will accumulate a lot more source no.l samples than
source no.2 samples.
Xt -+ -00: Predictor no. 1 will accumulate a lot more source no.2 samples than
source no.l samples.
The total probability that one of these two events will take place is one, i.e.
predictor no. 1 will certainly specialize in one of the two sources.
214
Notice that Theorem 12.1 does not quite say that both predictors will spe-
cialize. But in fact, Theorem 12.2 implies that both predictors will specialize,
each in a different source and the specialization is stronger than that implied
by Theorem 12.1.
1. If Pr{ lim X t
t-+oo
= +00) > 0 then
N21 )
Pr ( lim Ntll
t-+oo t
= 0 It-+oo
lim X t = +00 = 1, (12.5)
Pr (lim
t-+oo
NN~: = 0 I lim
t t-+oo
X t = +00) = 1. (12.6)
2. If Pr{ lim X t
t-+oo
= -00) > 0 then
Nll )
Pr ( lim N~l
t-+oo t
= 0 It-+oo
lim X t = -00 = 1, (12.7)
N22 )
Pr ( lim Nt12
t-+oo t
= 0 It-+oo
lim X t = -00 = 1. (12.8)
Theorem 12.2 states that with probability one both predictors will specialize,
one in each source and in the "strong" ratio sense, as already discussed in
Chapter 11. Since, by Theorem 12.1, X t goes either to +00 or to -00, it follows
that specialization of both predictors (one in each source) is guaranteed.
(12.9)
CONVERGENCE OF SERIAL DATA ALLOCATION 215
or
· Nl1
11m 0 l' N't2 = O.
t--+oo
N21
t
=, 1m N12
t--+oo t
Consider, to be specific, the first case. In this case, the proportion of data
generated by composite source no.2 and collected by predictor no.l goes to zero.
Suppose now that after sufficient time has elapsed, a third predictor is added
to the serial data allocation scheme. The data reaching the second and third
predictor will contain a vanishingly small proportion of source no.l samples.
Hence Theorems 12.1 and 12.2 can now be applied to the pair of predictors
no.2 and no.3 and we can conclude that, if conditions BO, Bl, B2 hold true
for this new combination of source no.2, predictors and training algorithm,
predictor no.2 will specialize either in a simple source no.k2 or in the composite
source {2,3,4, ... ,K} - {k2}' In fact, without loss of generality, we can assume
that k2 = 2. Then it follows that either predictor no.2 or predictor no.3 will
specialize in source no.2. We can continue adding predictors after sufficient
time has elapsed and, by the previous argument, we can expect that as long as
conditions BO, Bl, B2 are satisfied for each active source (and the predictors/
training algorithm/ threshold combination), and given sufficient time the serial
data allocation algorithm will identify the K sources.
Since assumptions BO, Bl, B2 have to be satisfied by each active source
separately, it may be considered that serial data allocation has a higher chance
216
of success than parallel data allocation. On the other hand, the enhanced
competition of parallel data allocation (recall that all predictors compete for
the same data) may result in improved performance.
12.3 CONCLUSIONS
resents the evolution of the specialization process will not oscillate around the
origin but will wander off either to plus or minus infinity.
Theorem 12.2 describes the behavior of the processes Nl1 and Nl1. At every
time t, these processes increase or remain unchanged (but cannot decrease)
with certain probabilities which depend on X t , i.e. on the current specializa-
tion. Rather than examining the Ni
j processes directly, we introduce auxiliary
processes which are independent and hence fairly easy to analyze. Specifically,
we introduce three random walks: V t , ~, ~.
Lemmas 12-12.D.3, 12-12.E.3 and 12-12.F.3 are then used to prove Theorem
12.2 by showing that the following events, conditional on X t tending to +00,
happen with probability one.
N 21
1. ~ go to zero;
N 22
2. hence ~ goes to 71"2;
N ll
3. also ~ goes to 71"1;
N 12
4. hence ~ goes to zero;
N 21 N 12
5. and finally that both ~ and ~ goes to zero.
t t
Lemma 12-12.B.l Suppose that conditions BO, Bl, B2 hold and the special-
ization process X t has transition probability matrix P = [Pm,n]m,nEZ. Then
the system
n E Z - {O} (12.B.l)
}~
U3 - U2 = Ehl . (U2 - UI)
P2.3
{ U4 - U3 = P3 2
~. U3
P3.4
( - U2 ) = P3 2
~
P3.4
. --==.
P2 1
P2.3
( U2 - UI )
U - U _ = PN-l.N-2·PN-2.N-3·· .. ·P3.2·P2.1 • (U - U ).
N N I PN-l.N.PN-2.N-l ..... P3.4.P2.3 2 I
Choose any UI such that 0 < UI < 1. Then, since PI,O,PI,2 > 0, also
U2 - UI > 0 =? U2 > UI > O.
Then, from eq.(12.B.6) for N= 3, 4, ... we also have UN > O. So a solution
to eqs.(12.B.2), (12.B.3) has been obtained, which satisfies UN > 0 for N =
1,2, .... Now if
and, evidently, for N = 1,2, ... the u~s satisfy both eqs.(12.B.2), (12.B.3) and
o< u~ ::; 1. So, it only needs to be shown that relationship (12.B.7) will
always be true if conditions BO, B1, B2 hold. To show this, note that
If we define
h(n) ~ 71'2 • bn - 1 ,
71'1 a n -1
it is easy to see that limn--+oo h( n) = 0, so for any 0 < p < 1 there is some no
such that for all n ~ no we have h( n) < p. Consider
=
00
L Pn-1,n-2 . Pn-2,n-3 ..... P3,2 . P2,1
n=3 Pn-1,n' Pn-2,n-1 ..... P3,4 . P2,3
00
G( no+ ) ""' Pn-1 n-2
) H( nO·L....,,' . Pn-2 n-3 ..... P n o-1 no-2
, " (12.B.9)
n=no Pn-1,n . Pn-2,n-1 ..... P n o-1,no
where G(no) and H(no) depend only on no. It follows that expression (12.B.9)
is less than
00
00
""' Pn-1,n-2 . Pn-2,n-3 ..... P3,2 . P2,1
L...." ~~~--~--~~--~~~~ < 00.
n=3 Pn-1,n' Pn-2,n-1 ..... P3,4 . P2,3
Hence it has been proved that if BO, B1, B2 hold, eqs.(12.B.2), (12.B.3) and
so also eq.(12.B.1) have a nontrivial admissible solution; consequently X t is
transient. It can also be proved that eqs.(12.BA), (12.B.5) have a nontrivial
admissible solution. The method of proof is quite similar to the one already
used and will not be presented here. •
Now Theorem 12.1 can be proved using Theorem 11-11.B.1 and Lemma
12-12.D.1.
Proof of Theorem 12.1: In fact the proof is now exactly the same as that
of Theorem 11.1, so it is omitted. •
Let us first define some useful quantities. Recall that the transition probabil-
ity Pr (X t = nlXt - 1 = m) is denoted by Pm,n' Define, for nEZ, the following
quantities:
a(m) ~ Pm,m+1, ')'(m) ~ Pm,m-1, j3(m) ~ Pm,m'
These are just more convenient symbols for the transition probabilities of X t .
Now define the following for x, y EZ.
a(x) if x<y
p(ylx) ~ { j3(x) if x = y
')'(x) if x > y.
( IX) ~ {P(YIX) if x ~ m
qY p(ylm) if x < m.
In other words, q(ylx) is identical to p(Ylx), except when x is less than m.
When we restrict y to values such that Ix - yl ::; 1, both p(.I.) and q(.I.) are
probability functions for any x.
Here is an outline of the argument which we will follow in the next sections.
This is essentially the same argument presented in Section lI.C of Chapter II.
In sections sections 12.D, 12.E, 12.F we do the following.
Sec.12.D: Prob. of ~
t· ~c is less than Prob. of E ~'; ~c
Sec.12.E: Prob. of ~
t· ::;10 is less than Prob. of EM'"
t· ::;10
Sec.12.F: Prob. of ~
t· ~c is less than Prob. of EM'"
t· ~c
a(m) if z = 1,
p(z) = { {j(m) if z = 0,
;y(m) if z = -1.
(We have suppressed, for brevity of notation, the dependence of p(.) on m.)
Consider a sequence of independent random variables li7,
v;', ... which, for
t = 1,2, ... and z E {-1,0, I} satisfy:
Pr(V~ = z) = p(z).
In other words, the v;' s are a sequence of independent, identically distributed
random variables. Finally define the stochastic process M~ by the following
relationships
2. !:t: defines an homogeneous random walk and the random variables V7,
V2 , ... are independent;
3. L:!=l M;l counts the number of Vi moves to the left; L:!=11VF; counts the
number of V~ moves to the left;
4. If X t ~ m for all t greater than some to, then the probability that Vi takes
a move to the left is no greater than ;Y(m); the probability that V~ takes a
move to the left is always ;Y(m).
The last observation is very useful. Our ultimate goal is to prove that when
Xt -+ 00, the number of times that Vi equals -1 will be small. More specifi-
'\"' M21
cally, we want to show that ~ -+ 0. Because V1 ,V2 , .•• are dependent, it
is difficult to analyze the behavior ofL: 7;1 . Because V;.", V;" ... are inde-
pendent, it is easier to analyze the behavior of L: 7::'. This is the reason for
introducing the processes V~, M'; .
In particular, it is easy to obtain the following useful lemma, which describes
the behavior of the stochastic process M~.
Lemma 12-12.D.2 If conditions BO, Bl, B2 hold, then, for any m > m,
and for the associated process M'; we have (Jor any n ~ 0, € > 0, t ~ 0)
Proof. Choose some m > m and some n ~ 0, € > 0, t ~ 0; consider these fixed
for the rest of the proof. Recall that the choice of m determines V~ , through
the probability p(.), and that V~ determines M';. Now define L == €. t. We
have
224
(If L is not an integer, then "L times" should be taken to mean L1 times", "r
r
where L 1 means the integer part of L plus one).
Now, choose any Xo E Z and define the following conditions on sequences
(Xl,X2, ... ,Xt) E zt.
Cl For s = 1,2, .... , t we have Xs < Xs-l at least L times.
C2 For s = 1,2, .... , t we have Ixs - xs-Il ::; 1.
C3 For s = 1,2, .... ,t we have Xs ~ m.
1. Am(xo): the set of sequences that satisfy Cl, C2, C3j and
and
R(xo) ~ Pr(Xn = Xo I'lfT ~ n,Xr ~ m).
It follows that
L R(xo) = 1.
xoEZ
R(xo)'
xo=m,m+l, ...
Recall that
xo=m,m+l, ...
CONVERGENCE OF SERIAL DATA ALLOCATION 225
Let us now bound Qt (xo); then it will be easy to bound Q as well. As can be
seen from eq.(I2.D.3), Qt(xo) consists of a sum of products of q(.I.) terms. We
will follow the same replacement procedure as in Chapter 11, taking t steps,
at every step producing an expression which is greater than the previous one,
by replacing one of the q(.I.) terms by a 15(.1.) term. Namely, at step 1 we
will replace q (XtIXt-1) by 15 (Xtlxt-d; at step 2 we will replace q (xt-1Ixt-2) by
15 (xt-1I xt-2) and so on, until at the t-th step we obtain an expression which is
greater than Qt(xo) and comprises entirely of15(.I.) terms.
Once again we need auxiliary sets. The set B is definecl in a manner similar
to that of Chapter 11. Define
B ~ {(Yl, ... ,Yt): Ys E {-I,O,I} and for 8 = I,2, ... ,t no. of -1's ;::: L}.
Note that the sets A(xo) and B are in a one-to-one relationship: for Xo fixed and
any (Xl, X2, .. , .Xt) E A(xo), a unique (Y1, Y2, .. , .yt) E B is defined by taking
Ys = Xs-Xs-1 (8 = 1,2, ... ); conversely, for Xo fixed and any (Y1, Y2, .. , .Yt) E B,
XS = Xs-1 + Ys defines a unique (Xl, X2, .. , .xt) E A(xo). Hence there are one-
to-one functions Y : A(xo) -+ B and X : B -+ A(xo),where X = y-1. Notice
also that B is independent of Xo, i.e. for any xo,x~ E Z, we have Y (A(xo)) =
Y (A(x~)). Now, define four sets as follows .
..:...
..:...
{ (Y1 , Y2, ... , Yt) E Band Yt = °}
..:...
{(Y1,Y2, ... ,Yt) E Band Yt = -1 and (Y1,Y2, .. ·,Yt-1,I)
..:... B -
B1t - B2t - B3t·
The sets B~, i = 1,2,3,4 partition the set B for the same reasons as in Chapter
-1 -2 -3
11. Also, the elements of B t , B t and B t are in a one-to-one correspondence.
-1 -3 -1 -2
This is clear for the sets B t and B t . Regarding the sets B t and B t , note that
if some (Y1, Y2, ... , 1) E B! , and the no. of -1's is L' , then L' ;::: L; clearly
none of the -1's can be in the t-th position. But the same remarks hold for the
. -2
sequence (Y1,Y2, ... ,0). Th1s shows that (Y1,Y2, ... ,0) E B t ; so we have shown
-1 -2
that for every (Y1, Y2, ... , 1) E B t , there is exactly one (Yl, Y2, ... , 0) E B t . The
-2
argument can be reversed to show that for every (Y1, Y2, ... , 0) E B t , there is
-1 -1 -2
exactly one (Yl, Y2, ... , 1) E B t . So B t , B t are in a one-to-one correspondence;
-1 -3 -2
since B t , B t are also in a one-to-one correspondence, the same holds for B t ,
-3
Bt ·
Finally, by the same arguments as in Chapter 11, ~ is the set of sequences
ending in -1 for which the total no. of -1's is exactly equal to Land B~ is
the set of sequences ending in -1, for which the no. of -1's is greater than L.
226
Let us now proceed to implement the first step of the replacement procedure.
Since the argument is the same as that used in Chapter 11, it is presented briefly.
We have
(12.D.4)
(12.D.5)
(12.D.6)
(12.D.7)
(12.D.8)
Now, each q (xllxo)· q (x2Ixl)· .... q (xt-llxt) term in the above expressions cor-
-1 -2 -3
responds to a sequence (Xl, X2, ... , Xt). Since sequences from B t , B t and B t are
in a one-to-one correspondence, to every q (xllxo)·q (x2Ixl)· ... ·q (Xt-l + llxt-l)
in expression (12.D.5) we can correspond exactly one q (xllxo) . q (x2Ixt} .... .
q (xt-llxt-l) in expression (12.D.6) and exactly one q (xllxo) . q (x2Ixl) .... .
q (Xt-l -llxt-l) in expression (12.D.7). Using the above facts, we can rewrite
expression (12.D.4) as
+ (12.D.I0)
In expression (12.D.9), the terms in square brackets add to one, so they can
be replaced by [P(+I) + p(O) + p(-I)] (which also equals one) without altering
the value of the expression. Suppose now that in the expression (12.D.1O),
each term in the sum is replaced by q (xllxo) . ... q (xt-llxt-2) . p(Xt - Xt-l), i.e.
q (XtIXt-l) is replaced by p(Xt - Xt-l). Recall that for sequences in X (~) we
have Xt = Xt-l - 1; hence
p(Xt - Xt-t} = P(Xt-l - 1 - Xt-l), q(XtIXt-l) = q(Xt-l - ll xt-l).
CONVERGENCE OF SERIAL DATA ALLOCATION 227
+ (12.D.12)
+
(Xl ,X2 , ... ,Xt)EX (B:)
Recall that
Qt(xo) =
228
and
Qo(xo) =
Then, using exactly the same argument as in Appendix l1.C, it follows that
R(xo)'
xo=m,m+l, ...
(12.D.14)
expression (12.D.14) is the probability that, for s = 1,2, ... , t , M'; = 1 at least
L = € . t times. In short, what has been proved is that
z=ns=n;l
H M21 t 2: € I'lfT 2: n Xr 2: m ) ::; Pr ("t
ws=~ M"'
s 2: €
)
Proof. This is proved in exactly the same way as Lemma 11-11.C.2 in Chapter
11 and so the proof is omitted .•
CONVERGENCE OF SERIAL DATA ALLOCATION 229
°
Note that a(m) + (3(m) + 9(m)=1, $; 9(m) $; 1, $; (3(m) $; 1. Also note °
that (3(m) is monotonically decreasing with m, with limm--+ oo (3(m) = 11"2 and
9(m) is monotonically decreasing with m, with lim m......oo 9(m) = 0, and that
for all m we have
- m
M t
_
-
{o iff
1 iff
~m =
l'tm
~
°
= 1.
or ~m = -1; (12.E.1)
Lemma l2-l2.E.l For any m > m, 3tm such that for all t 2:: tm we have
Proof. We follow the usual method of proof. Choose some m > m and some
n 2:: 0, c > 0, t 2:: OJ consider these fixed for the rest of the proof. Set L = c . t.
We have
",n+t Mll )
Pr (
L..J~-n
- ;
1 s
~ c. I\lT 2:: n X T 2:: m =
Pr ( L
n+t
M}l ~ L I\lT 2:: n X T 2:: m
)
=
s=n+l
Pr (Xs > X s- 1 at most L times, for s = n + 1, .... , n + t I\lT 2:: n X T 2:: m).
Now, for any Xo E Z, define the following conditions on strings XIX2 ..• Xt E zt.
Dl For s = 1,2, .... , t we have Xs > Xs-l at most L times.
D2 For s = 1,2, .... , t we have Ixs - xs-Il ~ 1.
D3 For s = 1,2, .... , t we have Xs > m.
Taking into account the dependence on Xo, let us define
1. Am(xo): the set of sequences that satisfy Dl, D2, D3j and
Now, define
(12.E.2)
it follows that
xo=m,m+l, ...
As usual, our goal is to bound Q and we will achieve this through a replace-
ment procedure. We define some additional sets of sequences (YI, Y2, ... , Yt) E
{-I, 0, l}t. We define
ii ~ {(YI, ... ,yt}: for s = 1,2, ... , t we have Ys E {-I, 0, I}, no. of 1's ~ L}
As usual, the sets A(xo) and ii are in a one-to-one correspondence: for Xo fixed
and any (Xl,X2, .. , .Xt) E A(xo), a unique (YI,Y2, .. , .Yt) E ii is defined by taking
Ys = Xs-Xs-I (8 = 1,2, ... ); conversely, for Xo fixed and any (Yl, Y2, .. , .Yt) E ii,
Xs = Xs-I + Ys defines a unique (Xl, X2, .. , .Xt) E A(xo). Hence there are one-
to-one functions Y : A(xo) -+ ii and X : ii -+ A(xo),where X = y-l. Note
also that ii is independent of Xo, i.e. for any Xo, x~ E Z, we have Y (A(xo)) =
y (A(x~)). Now, define five sets as follows.
B~5t . .:. . {~
(Yl, Y2, ... , Yt) E Band Yt = -1 and (Yl, Y2, ... , Yt-l, 1) tJ. B~l}
t
By ~.uments similar to these~of the previous section, it can be shown that the
sets B:, i = 1, ... ,5 partition B. In addition, it can be shown that
1. iit is the set of ii sequences with Yt = 0 and no. of 1's equal to L, while
2. ii'fis the set of ii sequences with Yt = 0 and no. of 1's less than L;
similarly
232
3. ii~ is the set of ii sequences with Yt = -1 and no. of 1 's equal to L, while
4. ii"t is the set of ii sequences with Yt = -1 and no. of 1 's less than L.
It is also clear that the elements of iiI, ii; and ii"t are in a one-to-one
correspondence.
Let us now proceed to implement the usual replacement procedure. We have
(12.E.3)
(12.EA)
(12.E.5)
(12.E.6)
(12.E.7)
(12.E.8)
(12.E.9)
+
CONVERGENCE OF SERIAL DATA ALLOCATION 233
In expression (12.E.9), the terms in square brackets add to one, so they can
be replaced by W(l) + p(O) + p( -l)J (which also equals one) without altering
the value of the expression. Suppose now that in the expression (12.E.I0) each
term in the sum is replaced by q (xllxo) . ... q (xt-lIXt-2) . p(Xt - Xt-l), i.e.
q (XtIXt-l) is replaced by p(Xt - Xt-l). Recall that for sequences in X (~) we
have Xt = Xt-l; hence
and
(
q Xt-l IXt-l -
)- { p(Xt-llxt-d = f3(Xt-l)
p(mlm) = f3(m)
if Xt-l ~ m;
if Xt-l < m.
+
234
and
Qo(xo) =
Then, using exactly the same argument as in Appendix ll.C, it follows that
R(xo)·
xo=m,m+l, ...
(12.E.13)
2:n+t
.=nt
Mll
since 2: R(xo) = 1. Expression (12.E.12) is exactly Pr ( t ~e
xoEZ
I'v'r :?: n X T :?: m), while expression (12.E.13) is the probability that, for s =
1,2, ... , t , M~ = 1 at most L = e· t times. In short, what has been proved is
that
2: ns=n+l
+t Mll
t
t < e I'v'r >
- -
n X > m ) < Pr
T_ -
("tL...s=l
t-
Ws < e)
Lemma 12-12.E.3 If conditions AO, AI, A2 hold, for all m 2:: jVfm, and for
all n E N, we have
Proof. This is proved in exactly the same way as Lemma 11-11.C.2 in Chapter
11 and so the proof is omitted.e
Note that Ci(m) + ~(m) + ;Y(m)=I, that 0::; Ci(m) ::; 1 and that 0::; ~(m) ::; 1;
obviously, Ci( m), ~(m), ;Y( m) are probabilities. Note also that that Ci( m) is
decreasing with m, with limm-+oo Ci(m) = 1f1 and lim m-+ oo ~(m) = 1f2; also, for
all m we have
Ci(m) 2:: a(m), ;Y(m)::; ')'(m).
Define ji as follows (for z E {-I, 0,1})
Ci(m) if z=l,
ji(z) ={ ~(m) if z = 0,
;Y(m) if z = -1.
(We have suppressed, for brevity of notation, the dependence of ji(.) on m.)
Consider a sequence of independent random variables Vim, V2m , ... which, for
t = 1,2, ... and z E {-I, 0, I} satisfy:
Pr(Vr = z) = ji(z).
In other words, the Vr's are a sequence of independent, identically distributed
random variables. Finally define the stochastic process Mtm by the following
relationships
Mm
t
= {O 1 iff
iff ~r =
~m=1.
0or ~m = -1; (12.F.1)
some number is smaller than that of 2: ~;n . Because Vi, V2, ... are dependent,
while Vr, V2m , ... are independent, it is easier to analyze the behavior of E ~;n
" " M21
than that of ~.
In particular, it is easy to obtain the following useful lemma, which describes
the behavior of the stochastic process MF.
Lemma 12-12.F.I For any m > M m , 3tm such that for all t ~ tm we have
Pr ( ""t Mm
L..,.s-~ s ~ a(m) +m
1) 1 < t2 '
Lemma 12-12.F.2 If conditions BO, BI, B2 hold, then for any mE Z, and
for the associated process MF we have (for any n ~ 0, € > 0, t ~ 0)
Proof. Choose some m, n ~ 0, € > 0, t ~ OJ consider these fixed for the rest of
the proof.Set L = €. t. We have
Now, for any Xo E Z, define the following conditions on strings XIX2 ... Xt E zt.
R{xo)'
3:o=m,m+1, ...
As usual, the sets A{xo) and ii are in a one-to-one correspondence, which can
be expressed by one-to-one functions Y : A(xo) - ii and X : ii - A(xo),where
X = y-1 and ii is independent of Xo. Next, define four sets as follows.
-1
Bt - {(YbY2, ... ,Yt) Eii and Yt = -I}
Eii and Yt = O}
-2
Bt - {(Y1,Y2, ... ,Yt)
-3
Bt - {(Y1,Y2, ... ,Yt) Eii and Yt = 1 and (Y1,Y2, ... ,Yt-b-1) Eiil}
-4 - -1 -2 -3
Bt - B-Bt -Bt -Bt ·
As usual, the sets iii, i = 1,2,3,4 partition ii and the elements of iii, iil and
iil are in a one-to-one correspondence. In addition (by the usual arguments)
iil is the set of ii sequences with no. of l's > L , while iit is the set of
ii sequences with no. of l's = L. Nowlet us proceed with the replacement
procedure.
(12.F.2)
238
In expression (12.F.7) the terms in the square bracket add up to one, so they can
be replaced by [P( +1) + p(O) + p( -l)J (which also equals one) without altering
the value of the expression. Suppose now that in expression (12.F.8) each
term in the sum is replaced by q (Xl!XO) . ... q (Xt-l!Xt-2) . p(Xt-l + 1 - xt-d,
i.e. q (Xt-l + l!Xt-l) is replaced by p(Xt-l + 1 - Xt-l) = P(1). By definition,
p(l)= a(m). We have a(m) ;::: q(Xt-l + l!Xt-l), because: if Xt-l < m, then
q(Xt-l +l!Xt-l) = a(m) ~ a(m); whereas if Xt-l ;::: m, then q(Xt-l +l!xt-d =
a(xt-d ~ a(m).
Hence, replacing all the q(Xt-l + l!Xt-l) terms with p(Xt-l + 1 - xt-d =
p(Xt - Xt-l) the expression is not decreased and it follows that Qt(xo) is no
greater than
Recall that
Qt(xo) =
and define
and
Then, using exactly the same argument as in Appendix I1.C, it follows that
R(xo)·
xo=m,m+l, ...
(12.F.1O)
240
expression (12.F.1O) is the probability that, for s = 1,2, ... , t , M;" = 1 at least
L = c; • t times. In short, what has been proved is that
Pr ( ",n+t
ws=n+1
t
MIl
s >
-
c; I'v'T > n
-
> m) <
X 7"_ -
Pr ("'t
ws=l
Mms
t
? c;
)
and the proof of the lemma is complete .•
Long Term Behavior of Ml1
The previous lemma compared the behavior of MP with that of M;" over
finite times. The next lemma tells us something about the behavior of Ml1 in
the long run (and without connection to M;").
Lemma l2-l2.F.3 If conditions AD, AI, A2 hold, for all m E Z, and for
all n E N, we have
Proof. This is proved exactly like Lemma 11-Il.C.l and so the proof is
omitted .•
Pr ( NIl
lim _t_ = 71"1 !lim X7" = +00 ) = 1; (12.G.3)
m-+oo t 7"-+00
Pr ( lim
m_oo
N~~
Nt
I"_00
= 0 lim X" = +00) = 1 (12.G.6)
",n+t
L..s=n+1 M21
s _ )
Pr ( t < 2'Y(m) a.a. jAmn = 1 =>
,
( ,
Pr 3tnm : 'It > tnm L
n+t
s=n+1
M;l < 2;Y(m) . t jAmn
)
= 1 =>
Pr ( , ,
3tnm : 'It > tnm
n+t
~ M;l < 2;:Y(m) . t +n jAmn
)
= 1 =>
,
( ,
Pr 3tnm : 'It > tnm
2: ns-l
n+t
+t M21
s
t
< 2;y(m). - - + ~ jAmn
n+t n+t
)
= 1.
It follows that, for all m ~ m and for all n ~ 0 , and conditional on the event
A mn , we have (with probability 1)
",n+t M21
L..s-1 s
=~~~< 'I'm ._-+--.
2-() t n
n+t n+t n+t
In addition, the following inequalities are true (obviously with probability 1
and for all m and n)
t
'It t +n . 2;Y(m) < 2;Y(m), (12.G.7)
n
- < ;:Y(m). (12.G.8)
n+t
242
Am ~ U Amn = n=l
n=l
U {VT;::: n Xr ;::: m} =
also, define
A ~
m=m
n Am = n_ {Xr ;::: m a.a.} = {V m;::: m
m=m
X r ;::: m a.a.}.
Note that, since lim ;;Y(m) = 0, and since N;l ;::: 0, we have
m-+oo
On the other hand, for a fixed m and for n < n' we have BmnAmn C BmnA mn ,.
Hence
(12.G.12)
Since Pr (Bm n Amn) = Pr (Amn) for all m > m and n ;?: 0, it follows from
(12.G.I0), (12.G.ll) and (12.G.12) that for all m > m
(12.G.13)
For m < m' we have: {Xt ;?: m' a.a.} => {Xt ;?: m a.a.} , hence Am' C Am
and Bm' C B m , since ,?(m) decreases monotonically. It follows that
lim Pr (Am) = Pr (
m--+oo
rt Am) = Pr (A) ,
m=m
(12.G.14)
(12.G.15)
Pr ( '11m;?: M
-=-=1n
-+- N2l
< 3,?(m) a.a. IVm;?: ml X-r ;?: m a.a.) = 1 =>
t-oo
N2l
Pr ( lim _t_
t
=0 I lim
-r-oo
X-r = +00 ) = 1. (12.G.16)
Pr (lim Nl =
t-oo t
7r2 ) = 1;
Since by assumption Pr( lim X-r = +00) > 0, it follows by Lemma A.6 that
-r-oo
P(1·
r 1m -
t-oo
N;
t
= 7r2 (12.G.17)
Pr (lim N;2
t-oo t
= 7r21 -r-oo
lim X-r = +00) = 1.
244
Cm =. {Nll
-t- 2: a~()
m - 2 a.a } ,
m
C n
=. 00 Cm = { ~
V m 2: m Nl 1
-- 2: a~()
m - -2 a.a } .
=~ t m
Since lim a(m)
m--+oo
= 7rI, it follows that
D m =. -t- ~ a-()
{Nll m +;;
2 a.a } ,
n Dm
D ~ 00
m=l
= { Vm 2: 1 N
_t_ ~
t
ll
a( m) +-
m
2}
a.a .
Since lim a( m)
m--+oo
= 7rl, it follows that
D = { Vm E Z Nll
_t_
t
~ a(m) +- 2} {
m
a.a = lim sup
m--+oo
N
_t_
t
ll
~ 7rl } .(12.G.20)
Pr ( lim inf
m--+oo
Nll
_t_
t
2: 7rl and lim sup
m--+oo
N ll
_t_
t
I
~ 7rl lim X-r = +00 ) = 1 =}
-r--+oo
Pr ( lim
m-+CX)
Nlt 1 = 7rl I lim T-+CX)
X-r = +00) = 1
Pr ( lim
t--+oo
N12
_t_
t
= 0 and lim
t--+oo
N ll
_t_
t
= 7rl I lim X-r
-r--+oo
= +00 ) = 1 =}
Pr (lim
t-+CXJ
NNt~:
t
=0 I lim X-r
T--+OO
= +00) = 1,
CONVERGENCE OF SERIAL DATA ALLOCATION 245
In this chapter we review the literature of modular and, more generally, multiple
models methods. We discuss a number of related approaches from the neural
network literature, as well as from the areas of statistical pattern recognition,
econometrics, statistics, fuzzy sets and control theory.
13.1 INTRODUCTION
samples ({Yt-M, Yt-M+l. ... ,Yt-l}, Yt) can be considered as static patterns.
Hence there is considerable overlap between the methods used for static and
dynamic problems. So far in this book we have restricted ourselves to time
series problems. However, because of the considerable overlap between static
and dynamic methods, in this and the next chapter we will consider together
both the static and dynamic case.
A large amount of work has been carried out on the subject of multiple net-
works architectures. A good starting point is two special issues of the journal
Connection Science. Namely vol.8, no.3/4 is devoted to ensemble approaches
and vol.9, no.l is devoted to modular approaches. The leading article in each
issue (Sharkey, 1996; Sharkey, 1997) discuss the difference between the two
approaches; according to Sharkey's definition modular approaches are charac-
terized by the use of specialized networks, while ensemble approaches employ
nonspecialized networks. While Sharkey does not explicitly introduce the in-
terchangeability criterion, it more or less follows from her elaboration of the
above categorization.
Keeping in mind that the above distinction is fuzzy to a considerable extent,
we can still consider two broad categories of multiple neural networks systems:
those which employ task-specialized networks and those which employ ensem-
bles of networks which perform the same task.
252
1996; Saito and Nakano, 1996; Waterhouse and Robinson, 1996) are particularly
interesting.
An interesting reformulation of the mixtures of experts idea in terms of
"managers" relegating tasks to "sub-managers" appears in (Dayan and Hinton,
1993); other interesting points of view appear in (Dayan and Zemel, 1995;
Esteves and Nakano, 1995; Schaal and Atkeson, 1996; Xu, Hinton and Jordan,
1995).
Tree topologies can be fixed in advance, but one of the most useful character-
istics of tree networks is that they can grow as necessary during training (ofHine
or online). This property has been exploited both within the neural networks
community, as will be seen in the next section, and in other disciplines (see
Section 13.3.3).
13.2.4 ART
The ART networks of Carpenter and Grossberg also utilize and process several
prototypes concurrently, so they san be considered to belong in the category of
mUltiple models. Basic papers are (Carpenter and Grossberg, 1990; Carpenter,
Grossberg and Rosen, 1991; Carpenter, Grossberg and Reynolds, 1991). The
convergence properties of the ART networks have been examined thoroughly;
look for example in (Georgiopoulos, Heileman and Huang, 1990).
258
and Dubes, 1988). Naturally, the basic algorithm can be applied recursively,
with cluster splitting, to provide tree-shaped hierarchical clustering.
The k-means algorithm and its variations are often used in the neural net-
works context. (Chinrungrueng, 1995). In particular, this approach is used
often for initialization of RBF networks (Moody and Darken, 1989).
It is clear that the k-means algorithm and its variations fall within the mul-
tiple models context, with each cluster /centroid corresponding to one model.
In Chapter 10 we have remarked on possible generalization of k-means, where
clustering is performed according to the degree of constraint satisfaction.
While decision trees are usually treated separately from classification and deci-
sion trees (CART) they are rather similar. In both cases some incoming datum
must be processed by one of several available models; the appropriate model
is chosen by traversing a tree where a decision (selection of a model subset)
is taken at every node. This results in successive refinement of the candidate
models set, until a single model is selected.
This approach has been used in the neural networks context, as already re-
marked (Section 13.2.1). Similar work has appeared in the context of statistical
pattern recognition, as well as statistics proper. For instance an early example
of a regression tree is presented in (Friedman, 1979). The seminal work on
classification and regression trees is (Breiman, Friedman, Olshen and Stone,
1984). An interesting recent paper is (Brodley and Utgoff, 1995). For decision
trees, see (Quinlan, 1993).
We have already remarked that trees are essentially a device for organizing
multiple models in an efficient manner. These models can be simple or rel-
atively complex (e.g. neural networks). Regarding the construction of trees,
many methods are available. (Breiman, Friedman, Olshen and Stone, 1984)
gives a very complete exposition for supervised construction of CARTs. For
a more recent review see (Buntine, 1994a); also see (Chaudhuri, Huang, Loh
and Yao, 1994; Chaudhuri, Lo, Loh and Yangi, 1995) and, for an application
to time series prediction (Farmer and Sidorowich, 1988). For decision trees,
see (Quinlan, 1986; Quinlan, 1993); a recent review appears in (Breslow and
Aha, 1997). In addition the methods presented in Section 13.2.2 (for neural
networks combination) also apply here. Finally, for the sake of completeness,
note that some useful techniques for pruning trees can be found in (Kubat and
Flotzinger, 1995).
Great effort has been expended for the theoretical analysis of the properties
of trees. Here we only list a few samples of such work (Breiman, 1996c; Ehren-
feucht and Haussler, 1989; Helmbold and Schapire, 1995; Quinlan and Rivest,
1989).
260
In a certain sense all fuzzy rule systems can be considered as multiple mod-
els systems: one rule corresponds to one model. The analogy becomes more
obvious if we consider the usual implementation of fuzzy rules by radial basis
functions; in this case all the remarks of Section 13.2.1 apply here as well. Sev-
eral papers can be cited which discuss this point of view; consider, for example,
(Hunt and Brown, 1995; Hunt, Haas and Murray-Smith, 1996; Jang and Sun,
1993; Kim and Mendel, 1995).
BIBLIOGRAPHIC REMARKS 261
then possible to obtain recursive equations which describe the evolution of this
mixture. An early example of this idea appears in (Srinivasan, 1969). Bucy
uses a similar but perhaps more general idea in (Bucy, 1969; Bucy and Senne,
1971). This approach is related to the mixtures of experts discussed in Section
13.2.1 and can be considered a multiple models method for the same reasons.
We pay special attention to Lainiotis' work because of his prolific output and
also because it influenced to a considerable extent the development of our own
methods. Lainiotis originally presented his idea in a pattern recognition con-
text (Hancock and Lainiotis, 1965; Hilborn and Lainiotis, 1968; Hilborn and
lainiotis, 1969a; Hilborn and Lainiotis, 1969b; Lainiotis, 1970). In all of these
cases Lainiotis essentially considered the problem of classifying a time series
generated by an unobservable source. He first applied his results to a control
theoretic problem in (Sims, Lainiotis and Magill, 1969), which was a rsponse
to Magill's 1965 paper. Finally, (Lainiotis, 1971b) presented the essentials of
a methodology to treat the problems of state and parameter estimation and
control of a system with unknown parameters. This methodology depended
crucially on the use of multiple models, namely a bank of Kalman filters, each
filter being tuned to one of the candidate parameter values of the actual sys-
tem. Later contributions in the control theory context include (Lainiotis, 1971a;
Lainiotis, 1971b; Henderson and Lainiotis, 1972; Park and Lainiotis, 1972;
Lainiotis and Deshpande and Upadhyay, 1972; Lainiotis, 1973); this is just a
small sample of the great number of Lainiotis' papers. The theory is presented
in comprehensive form in the collection (Lainiotis, 1974a) which includes the
papers (Lainiotis, 1974b; Lainiotis, 1974c; Lainiotis, 1974d). Later contribu-
tions include (Petridis, 1981; Lainiotis and Likothanasis, 1987), an application
to seismic signals (Lainiotis, Katsikas and Likothanasis, 1988) and, recently,
applications related to neural networks problems (Lainiotis and Plataniotis,
1994a; Lainiotis and Plataniotis, 1994b; Lainiotis and Plataniotis, 1994c).
Several versions of the multiple models idea have appeared in the control liter-
ature of the last twenty five years; for instance see the book (Mariton, 1990).
A classical applications-oriented paper in this direction is (Athans et al., 1977);
(Kashyap, 1977) is also relevant. Theoretical analysis appears in (Tugnait
and Haddad, 1980; Greene and Willsky, 1980; Anderson, 1985) among other
places. More recent developments in the use of multiple models are described in
(Caputi and Moose, 1995; Kulkarni and Ramadge, 1996; Murray-Smith, 1994;
Murray-Smith and Gollee, 1994; Narendra and Balakhrishnan, 1994; Narendra,
Balakhrishnan and Ciliz, 1995; Narendra and Balakrishnan, 1997; Pottman,
Unbehauen and Seborg, 1993; Skepstedt, Ljung and Millnert, 1992; Xiaorong
and Bar-Shalom, 1996). A recent book-length treatment of multiple models
BIBLIOGRAPHIC REMARKS 263
et ai., 1994, Tan et al., 1995). The approach we have presented in Chapter 9
in connection to the waste water treatment plant is also relevant.
13.7 STATISTICS
Most of the work we have discussed in the previous sections could equally be
classified as statistical procedures. There are however two important statistical
methodologies which remain to be discussed: Hidden Markov Models (HMM)
and Graphical Models.
erful set of training methods is available for training graphical models; these
include the EM algorithm and variational methods (Jordan, Ghahramani and
Saul, 1998).
It is of particular interest that a number of papers document the relation-
ships between graphical models and neural networks, notably ( Ghahramani
and Jordan, 1997; Ghahramani, 1998; Hofman and Tresp, 1995; Jordan, 1994;
Jordan, Ghahramani and Saul, 1997; Meila and Jordan, 1997; Neal, 1992; War
terhouse and Robinson, 1995). Finally, a good overview of learning methods
appears in (Buntine, 1995; Buntine, 1996).
14 EPILOGUE
We have presented the PREMONN family of algorithms for time series classifi-
cation, prediction and identification. The PREMONN algorithms are modular,
in the sense that they concurrently employ a number of time series models,
each of which may be modified or removed from the PREMONN system with-
out affecting the remaining modules. Hence PREMONNs belong to the larger
family of modular or multiple models algorithms which, as we have seen in the
previous chapter, have a long and successful history in various disciplines.
We believe the success of multiple models methods is due to the employment
of two important components. The first component is, quite obviously, the use
of multiple models, which may be considered as an implementation of the fun-
damental problem solving approach of divide-and-conquer. The advantages of
this approach are so well understood that there is no need for further comments
here. However, this is not the whole story. The use of multiple models in itself
would be ineffective if there was not an organizing framework within which the
multiple models can be employed to advantage. Regarding the choice of such
a framework there is considerable diversity of opinion; hence the plethora of
approaches and algorithms which is evident from the bibliographic references
of the previous chapter.
In our view, graphical models provide a good framework for reconciling most,
if not all, of the approaches which we have discussed. The operation of multiple
models or modules is organized along the edges of a graph, which delineates
the flow of information and computation. Classification and prediction can be
1. We are interested in obtaining a more rigorous convergence proof for the case
of many sources and / or predictors; the arguments presented in Chapters
11 and 12 can be considered as heuristic. A more rigorous treatment may
require more delicate tools.
3. We believe that our convergence conditions are sensible but hard to verify
for a practical problem, since they depend on the combination of active
sources, training algorithm and error threshold. We would like to develop
more applicable conditions. In addition we would like to understand the
existing ones better.
4. To this goal it will probably be useful to consider issues relating to the com-
plexity of the sources we want to identify and the capacity of the models we
employ. Concepts and tools form PAC, complexity and information theory
will probably be useful.
5. Finally, on a more applied note, we would like to further compare the perfor-
mance of serial and parallel data allocation algorithms and also to examine
the potential advantages of hybrid data allocation.
EPILOGUE 269
A.I NOTATION
Here are a few symbolisms which we use throughout the book.
4. We will often make use of the indicator /unction, denoted by l(A), where
A is some event (see also the next section). When the event is true, the
indicator function takes the value one; other wise it is zero. More formally
eN = e x ex. ... x e .
, I
N times
272
For instance R N denotes the set of N - tuples of real numbers (Xl, X2, ... , X N ),
Xn E R for n = 1,2, ... ,N.
A.2.1 Fundamentals
We will use the standard setup of probability theory. Our exposition is brief;
for more details the reader is referred to (Billingsley, 1986).
We start with a probability space (0, F, P). Here 0 is the universal set. F
is a sigma field in 0 , i.e. a collection of subsets of 0 which contains 0 and
is closed under complements and countable unions. P is a probability measure
defined on elements of F, i.e. a set function P : F ---+ [0,1], which satisfies
P(0) = 0, P(O ) = 1 and is countably additive.
Random variables are P-measumble functions X(w) of the form
X : 0 ---+ <p,
where <p is an appropriate range. Stochastic processes are sequences of random
variables, for instance: Xo(w), Xl (w), X 2 (w), .... Following standard usage
(for reasons of brevity) we usually omit denoting the dependence on w, writing
a random variable as X, and a stochastic process as X o, XI> ....
Events are simply elements of F, i.e. P-measurable sets. For instance we
may talk of the event that X t < 1 and write something like
A= {Xt < I};
this actually is a shorthand for
A = {w such that Xt(w) < I}.
In many occasions we will consider the probability of an event A under the
probability measure P. This is denoted as P(A) or, more generically, as Pr(A)
(if it is clear from the context which measure P is referred to).
By fixing a particular point Wo EO, we obtain Xo(wo), XI(wo), X 2 (wo), ...
which is a sample path or realization of the stochastic process Xo(w), XI(w),
X 2 (w), ....
Next we define stationary and ergodic stochastic processes.
Definition 2 A stochastic process X o, XI, X 2 , ... is called stationary if for
every tEN and for every set A E Foe we have
Pr ([Xo, Xl, ... J E A) = Pr ([Xt+I> X t +2 , ••• J E A) .
Consider some set <p and the set <poe, i.e. the set of infinite sequences
from <P; finally take a set A c <poe. A is called shift-invariant if for every
(<Po, (PI , <P2, ... ) E A we have
(<Po, <P1> <P2, ... ) E A {:} (<PI> <P2, ... ) EA.
APPENDIX A: MATHEMATICAL CONCEPTS 273
Definition 3 A stochastic process Xo, Xl, X 2 , •.• is called ergodic if for every
shift-invariant set A, we have that Pr(Xo, Xl, X 2 , ... E A) is equal to either 0
or 1.
A.2.2 Limits
Lemma A.2 (i) If for all n we have Dn+l C Dn then lim Pr(Dn)
n-HXl
= Pr(D),
00
where D = n=l
n Dn.
00
(ii) If for all n we have E n+l :J En then lim Pr(En )
n--+oo
= Pr(E), where E = U
n=l
En.
Proof. Part (i) is proved in (Royden, 1968). This is essentially the Monotone
Convergence Theorem (which will be stated more generally in the next section)
applied to the indicator functions fn(w) =l(w E D n ), which have as limit the
indicator function f(w) =l(w E D). Part (i) can then be used to prove part
(ii). Consider the sets Dn = E~, n = 1,2,.... Then Dn+l C Dn and so
lim Pr(D n ) = Pr(D), where D =
n~CXJ
n
Dn. Next it is shown that D = E C (and
n=l
hence Pr(E) = 1- Pr(D)). Indeed:
1. If xED, then x E Dn. for all n. This means that x E E~. for all nand
so x rf- En for any n and so x rf- n~l En. So, x E
DC EC.
C~l En r = EC. Hence,
2. If, on the other hand, x E E C = C~l En) C , then x rf- En for any n and so
= E~, for every n. =n=l
00
xED In other words, xED n Dn. Hence E C cD.
Another way to describe "Lo." is this: A= {w : tin 3kn ;::: n such that
wE Ak n } ' Note that "An occurs infinitely often" does not depend on n.
274
Definition 5 Given a sequence of events Al, A 2 , ... , the event A = "An occurs
almost always" ("a.a") is defined to be B = U n A k . We also write A = {An
00 00
n=lk=n
a.a.}. Another way to describe "a.a." is this: A= {w : 3no such that Vn ~ no
wEAn}. Note that "An occurs almost always" does not depend on n.
Lemma A.3 The negation of "infinitely often" is "almost always", i.e. {An
i.o.Y = {A~ a.a.}.
Proof. Take a sequence of events A 2 , A 2 , ... E F. First we show that:
00
nCllk~n Ak
00
C
( 00 00
n~lk~n Ak
)C
(A.l)
Vn
r
To see this consider
wE (nQlk~n Ak =>
00 00
w ¢ u n Ak =>
n=lk=n
There is no no such that Vn ~ no wEAn =>
Vn 3kn ~ n such that w ¢ Akn =>
Vn 3kn ~ n such that w E Akn =>
00 00
wEn U Ak
n=lk=n
and so we have proved eq.{A.2) and the proof of the lemma is finished .•
We will use the Borel-Cantelli Lemma several times in connection to events
occurring infinitely often or almost always. This Lemma is stated as follows.
Lemma A.4 (Borel-Cantelli) If L:nPr{An) < 00, then Pr{A n i.o.) = O.
Proof. It appears in (Billingsley, 1968, pp.53-55).
APPENDIX A: MATHEMATICAL CONCEPTS 275
It follows immediately from the above definition that for a random variable
X taking values in R we have
Pr(X ~ x) = [~ dx(y)dy.
If X is differentiable, then certainly it does not take values in a countable set.
Conversely, a countable valued random variable does not have a probability
density in the sense of the above definition. However, it is convenient to define
for a countable valued random variable X the following quantity
dx(x) ~ Pr(X = x). (AA)
We use the same symbol dx(x) because the quantities defined in eqs.(A.3),
(AA) are, in a sense, analogous.
A.2.4 Expectation
E(X) ~ k X(w)dP(w).
It should be pointed out that in case n is countable, then the expectation reduces
to a sum:
E(X) = L X(n)P(n).
nEO
276
We also define the variance of a random variable X as follows
Theorem A.5 (Bounded Convergence) If Xo, Xl, ... and X are random
variables and there is a constant M such that with probability one and for
n = 0, 1, ... we have IXnl < M,then
lim Xn = X => lim E(Xn) = E(X).
n~oo n~oo
A.2.5 Conditioning
The reader is probably familiar with the notions of conditional probability, de-
fined as follows.
The following lemma will be useful in the proof of Theorems 11.2, 12.2.
Lemma A.6 Consider two sets A and B such that Pr(A) = a > 0 and
Pr(B) = 1. Then we have
Pr(BIA) = 1.
Pr(BIA) = pr~:(~t) .
It is clear that
(B n A) U (Be n A) = A,
and that
(B n A) n (Be n A) = 0.
It then follows that
Pr(B n A) = Pr(A) - Pr(B e n A). (A. 6)
APPENDIX A: MATHEMATICAL CONCEPTS 277
(since Pr(B) = 1). Hence from eqs.(A.6) and (A.7) we have that
Pr(B n A)
Pr(B n A) = Pr(A) ~ Pr(BIA) = Pr(A) = 1.
{X E A}, {Y E B}
are F-measurable. Furthermore, we can define a sigma field Q to be Q ==
{{Y E B}, {Y E B}C, n, 0}. Then it makes sense to talk about the conditional
probability of X given Y, denoted by Pr(XIY), where we define
We say that two random variables are dependent if they are not independent.
dx(xlY E B) ~ d~Fx(xIY E B)
In particular, we will be interested in the case when the set B is {y}. Then
we have
dx(xlY = y) ~ d~Fx(xIY = y)
We will sometimes use in place of dx(xlY = y) a shorter (and somewhat
abusive) notation, writing the conditional probability density of X, given Y = y,
as dx(aIY); the meaning should be clear from the context.
Theorem A.7 (Strong Law of Large Numbers) If X o, Xl, ... are inde-
pendent and identically distributed and E(IXol) < 00, then
and, for any measumble function f(.,., ... ) such that E (If(Xo, Xl. ... )1) < 00,
we also have
Pr (lim f(Xo, Xl, ... ) + ... + f(X n , X n +1, ... ) = E (f(XO, Xl, ... ))) = 1.
n-+oo n
and
.~
V ar ( XO+XI+ ... +Xn-l) = yna"{.
n
Hence, rather than studying the behavior of xQ±x~ik·±Xn, we will consider
instead the normalized average
Xo + Xl + ... + X n - l - n"{
..jna"{
which has expectation equal to zero and variance equal to one. For this nor-
malized average we can prove the following.
Theorem A.9 If X o, Xl, ... is a sequence of Bernoulli trials with Pr(Xt =
0) = a and Pr(Xt = 1) = ,,{, and 8 is any number greater than one, then we
have
L
00
Proof. The theorem is stated and proved (in a somewhat more general form)
in (Billingsley, 1986).e
In view of the above theorem we can characterize the entire Markov chain
(provided it is irreducible) as persistent or transient.
References
Carpenter, G. A., Grossberg, S., and Rosen, D. (1991). ART 2-A: An adap-
tive resonance algorithm for rapid category learning and recognition. Neural
Networks,4:493-504.
Castellano, G., Fanelli, A. M., and Pelillo, M. (1997). An iterative pruning
algorithm for feedforward neural networks. IEEE Trans. on Neural Networks,
8:519-531.
Catfolis, T. and Meert, K. (1997). Hybridization and specialization of real time
recurrent learning based networks. Connection Science, 9:51-69.
Chaer, W. S., Bishop, R. H., and Ghosh, J. (1997). A mixture-of-experts frame-
work for adaptive Kalman filtering. IEEE Trans. on Systems, Man and Cy-
bernetics, Part B, 27:452-464.
Chaudhuri, P., Huang, M. C., Loh, W. Y., and Yao, R. (1994). Piecewise-
polynomial regression trees. Statistica Sinica, 4:143-167.
Chaudhuri, P., Lo, W. D., Loh, W. Y., and Yang, C. C. (1995). Generalized
regression trees. Statistica Sinica, 5:641-666.
Chee, P. L. and Harrison, R. F. (1997). An incremental adaptive network for
on-line supervised learning and probability estimation. Neural Networks,
10:925-939.
Giles, C. L. et al. (1995). Constructive Learning of Recurrent Neural Networks:
Limitations of Recurrent Casade Correlation and a Simple Solution. IEEE
Trans. on Neural Networks, 6:829-836.
Chen, K., Wang, L., and Chi, H. (1997). Methods of combining multiple classi-
fiers with different features and their applications to text-independent speaker
identification. International Journal of Pattern Recognition and Artificial In-
telligence, 11:417-445.
Chen, S., Cowan, C. F. N., and Grant, P. M. (1991). Orthogonal least squares
learning algorithm for radial basis function networks. IEEE Trans. on Neural
Networks, 2:302-309.
Chen, S., Yu, D., and Moghaddamjo, A. (1992). Weather sensitive short-term
load forecasting using nonfully connected artificial neural network. IEEE
Trans. on Power Systems, 7:1098-1105.
Chen, T. and Chen, H. (1995). Approximation capability to functions of several
variables nonlinear functionals and operators by radial basis function neural
networks. IEEE Trans. on Neural Networks, 6:904-910.
Cheng, E. S., Chen, S., and Mulgrew, B. (1996). Gradient radial basis function
networks for nonlinear and nonstationary time series prediction. IEEE Trans.
on Neural Networks, 7:190-194.
Cheng, W., Fadlalla, A., and Lin, C.-H. (1996). Improve forecasting perfor-
mance of neural networks through the use of a combined model. In World
Congress on Neural Networks, pages 447-450.
Chiang, C. C. and Fu, H. C. (1994). A divide-and-conquer methodology for
modular supervised neural network design. In Int. Con! on Neural Networks,
pages 119-124.
288
Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood
estimation via the EM algorithm. J. of the Stat. Roy. Soc. B, 39:1-38.
Deng, L. (1992). A generalized hidden Markov model with state-conditioned
trend functions of time for the speech signal. Signal Processing, 27:65-78.
Deng, L. et al. (1991). Phonemic hidden Markov models with continuous mix-
ture output densities for large vocabulary word recognition. IEEE Trans. on
Signal Processing, 39:1677-168l.
Deng, L. et al. (1992). Modelling acoustic transitions in speech by state - inter-
polation hidden Markov models. IEEE Trans. on Signal Processing, 40:265-
272.
Dersch, D. R. and Tavan, P. (1995). Asymptotic level density in topological
feature maps. IEEE Trans. on Neural Networks, 6:230-236.
Deutsch, M., Granger, C., and Terasvirta, T. (1994). The combination of fore-
casts using changing weights. Int. Journal of Forecasting, 10:47-57.
Dharmadikari, S. (1963). Functions of finite Markov chains. Ann. of Math. Stat.,
34:1022-1032.
Dickinson, J. (1973). Some statistical results in the combination of forecasts.
Operational Research Quarterly, 24:252-260.
Dickinson, J. (1975). Some comments on the combination of forecasts. Opera-
tional Research Quarterly, 26:205-210.
Doob, J. (1953). Stochastic Processes. Wiley.
Drabe, T., Bressgott, W., and Bartscht, E. (1996). Genetic task clustering
for modular neural networks. In Proc. of International Workshop on Neural
Networks for Identification, Control, Robotics, and Signal/Image Processing,
NICROSP, pages 339-347.
Drucker, H. and Cortes, C. (1996). Boosting decision trees. In Advances in
Neural Information Processing Systems 8, pages 479-485.
Drucker, H. et al. (1994). Boosting and other ensemble methods. Neural Com-
putation,6:1289-130l.
Drucker, H., Schapire, R., and Simard, P. (1993a). Boosting performance in
neural networks. International Journal of Pattern Recognition and Artificial
Intelligence, pages 61-76.
Drucker, H., Schapire, R., and Simard, P. (1993b). Improving performance in
neural networks using a boosting algorithm. In Advances in Neural Infor-
mation Processing Systems 5, pages 42-49.
Duda, R. and Hart, P. (1973). Pattern Classification and Scene Analysis. Wiley.
Dufour, F. and Bertrand, P. (1994). The filtering problem for continuous-time
linear systems with Markovian switching coefficients. System and Control
Letters, 23:453-46l.
Dumitras, A. et al. (1994). A quantitative study of evoked potential estimation
using a feedforward neural network. In IEEE Workshop: Neural Networks
for Signal Processing, pages 606-613.
Dunn, J. (1973). A fuzzy relative of the ISODATA process and its use in de-
tecting compact well-separated clusters. Journal of Cybernetics, 3:32-57.
290
Frean, M. (1990). The Upstart algorithm: A method for constructing and train-
ing feed-forward neural networks. Neural Computation, 2:198-209.
Freeman, J. S. and Saad, D. (1995). Learning and generalization in radial basis
function networks. Neural Computation, 7:1000-1020.
Freund, Y., Schapire, R. E., Singer, Y., and Warmuth, M. K. (1997). Using and
combining predictors that specialize. In 29th Annual ACM Symposium on
Theory of Computing, pages 334-343.
Friedman, J. H. (1979). A tree structured approach to nonparametric multi-
ple regression. In Smoothing techniques for curve estimation, pages 5-22.
Springer.
Friedman, J. H. (1991). Multivariate adaptive regression splines. Annals of
Statistics, 19: 1-67.
Breiman, L., J. H. Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984).
Classification and Regression Trees Wadsworth International.
Fritsch, J. (1996). Modular Neural Networks for Speech Recognition. Tech.
Report CMU-CS-96-203. Department of Computer Science, Carnegie Mellon
Univ.
Fritsch, J., Finke, M., and Waibel, A. (1997a). Adaptively growing hierarchical
mixtures of experts. In Advances in Neural Information Processing Systems
9.
Fritsch, J., Finke, M., and Waibel, A. (1997b). Context-dependent hybrid HME
/ HMM speech recognition using polyphone clustering decision trees. In Int.
Conf. on Acoustics, Speech and Signal Processing.
Fritzke, B. (1991). Unsupervised clustering with growing cell structures. In Int.
Joint Conf. on Neural Networks, pages 531-536.
Fritzke, B. (1994a). Growing cell structures - A self-organizing network for
unsupervised and supervised learning. Neural Networks, 7:1441-1460.
Fritzke, B. (1994b). Supervised learning with growing cell structures. In Ad-
vances in Neural Information Processing Systems 6, pages 255-262.
Fun, M. H. and Hagan, M. T. (1996). Levenberg-Marquardt training for mod-
ular networks. In Int. Conf. on Neural Networks, pages 468-473.
Funahashi, K. (1989). On the approximate realization of continuous mappings
by neural networks. Neural Networks, 2:183-192.
Fung, K. et al. (1996). Visual evoked potential enhancement by an artificial
neural network filter. Biomedical Mater. Engin., 6:1-10.
Gallant, S. I. (1986). Three constructive algorithms for network learning. In
Proc. of the 8th Annual Conf. of the Cognitive Science Society, pages 652-
660.
Gath, I. and Geva, A. (1989). Unsupervised optimal fuzzy clustering. IEEE
Trans. on Pattern Analysis and Machine Intelligence, 11:773-78l.
Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the
bias/variance dilemma. Neural Computation, 4:1-58.
Georgiopoulos, M., Heileman, G. L., and Huang, J. (1990). Convergence prop-
erties of learning in ARTl. Neural Computation, 2:502-509.
292
Helmick, R, Blair, W., and Hoffman, S. (1996). One - step fixed -lag smoothers
for Markovian switching systems. IEEE Trans. on Automatic Control,41:1051-
1056.
Henderson, T. and Lainiotis, D. (1972). Digital matched filters for detecting
gaussian signals in gaussian noise. Information Science, 4:233-249.
Henze, M., Jr., C. G., Gujer, W., Marais, G., and Matsuo, T. (1983). Activated
sludge model. Tech. report No.1, IAWPRC.
Hergert, F., Finnoff, W., and Zimmermann, H. G. (1992). A comparison of
weight elimination methods for reducing complexity in neural networks. In
Proc. Int. Joint Con! on Neural Networks, pages 980-987.
Hering, K, Haupt, R, and Villmann, T. (1995). An improved mixture of ex-
perts approach for model partitioning in VLSI-design using genetic algo-
rithms. Tech. report. Universitat Leipzig, Fakultat fur Mathematik und In-
formatik, 1995.
Hering, K, Haupt, R, and Villmann, T. (1996). Hierarchical strategy of model
partitioning for VLSI-design using an improved mixture of experts approach.
In Tenth Workshop on Parallel and Distributed Simulation - PADS 96, Proc.,
pages 106-113.
Hilborn, C. and Lainiotis, D. (1968). Optimal unsupervised learning multicate-
gory dependent hypotheses pattern recognition. IEEE Trans. on Information
Theory, 14:468-470.
Hilborn, C. and Lainiotis, D. (1969a). Optimal estimation in the presence of
unknown parameters. IEEE Trans. on Systems Science and Cybernetics,
1:38-43.
Hilborn, C. and Lainiotis, D. (1969b). Unsupervised learning minimum risk pat-
tern classification for dependent hypotheses and dependent measurements.
IEEE Trans. on Systems Science and Cybernetics, 5:109-115.
Hilliorst, R, van Amerongen, J., and Lohnberg, P. (1991). Intelligent adaptive
control of mode-switch processes. In Proc. IFAC International Symposium
on Intelligent Tuning and Adaptive control, Singapore.
Ho, K, Hsu, Y., Chen, C., Lee, T., Liang, C., Lai, T., and Chen, K (1990).
Short term load forecasting of Taiwan power system using a knowledge-
based expert system. In IEEE/PES 1990 Winter Meeting. Paper 90 WM
259-2 PWRS.
Ho, K, Hsu, Y., and Yang, C. C. (1992). Short term load forecasting using a
multilayer neural network with an adaptive learning algorithm. IEEE Trans.
on Power Systems, 7:141-149.
Hochberg, M., Cook, G., Renals, S., and Robinson, A. J. (1994). Connectionist
model combination for large vocabulary speech recognition. Neural Networks
for Signal Processing, pages 269-278.
Hofmann, Rand '!resp, V. (1995). Discovering structure in continuous vari-
ables using Bayesian networks. In Advances in Neural Information Process-
ing Systems 1-
Hogarth, R (1989). On combining diagnostic forecasts: thoughts and some
evidence. Int. Journal of Forecasting, 5:593-597.
REFERENCES 295
Holmstrom, L., Koistinen, P., and and, Laaksonen, J. (1997). Neural and sta-
tistical classifiers-taxonomy and two case studies. IEEE Trans. on Neuml
Networks, 8:5-17.
Hong, L. and Lynch, A. (1993). Recursive temporal-spatial information fusion
with applications to target identification. IEEE Trans. on Aerospace and
Electronic Systems, 29:435-444.
Hrycej, T. (1992). Modular learning in neuml networks. Wiley.
Hu, Y. H., Palreddy, S., and Tompkins, W. J. (1995). Customized ECG beat
classifier using mixture of experts. Neuml Networks for Signal Processing,
pages 459-464.
Hunt, K, J.C. Kalkkuhl, Fritz, H., and T.A. Johansen, T. (1996). Construc-
tive empirical modeling of longitudinal vehicle dynamics using local model
networks. Control Engineering Pmctice, 4:167-178.
Hunt, K J., Haas, R., and Murray-Smith, R. (1996). Extending the functional
equivalence of radial basis function networks and fuzzy inference systems.
IEEE Trans. on Neuml Networks, 7:776-781.
Hunt, K R. H. and Brown, M. (1995). On the functional equivalence of fuzzy
inference systems and spline-based networks. Int. Journal of Neuml Systems,
6:171-184.
Hwang, J.-N., You, S.-S., Lay, S.-R., and Jou, I.-C. (1993). What's wrong with
a cascaded correlation learning network: A projection pursuit learning per-
spective. In Int. Symposium on Artificial Neuml Networks, pages Ell-E20.
Iso, K and Watanabe, T. (1990). Speaker-independent word recognition using
a neural prediction model. In IEEE ICSP, pages 441-444.
Jacobs, R. A. (1989). Initial experiments on constructing domains of expertise
and hierarchies in connectionist systems. In Proc. of the 1988 Connectionist
Models Summer School, pages 144-153.
Jacobs, R. A. (1995). Methods for combining experts probability assessments.
Neuml Computation, 7:867-888.
Jacobs, R. A. (1997). Bias/variance analyses of mixtures-of-experts architec-
tures. Neuml Computation, 9:369-383.
Jacobs, R. A. and Jordan, M. I. (1991). A competitive modular connectionist
architecture. In Advances in Neuml Information Processing Systems 3, pages
767-773.
Jacobs, R. A. and Jordan, M. I. (1993). Learning piecewise control strategies in
a modular neural-network architecture. IEEE Trans. on Systems, Man and
Cybernetics, 23:337-345.
Jacobs, R. A., Jordan, M. I., and Barto, A. G. (1991). Task decomposition
through competition in a competitive modular connectionist architecture:
The what and where vision tasks. Cognitive Science, 15:219-250.
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive
mixtures of local experts. Neuml Computation, 3:79-87.
Jacobs, R. A., Peng, F. C., and Tanner, M. A. (1997). A Bayesian approach
to model selection in hierarchical mixtures-of-experts architectures. Neuml
Networks, 10:231-241.
296
Kiartzis, S., Petridis, V., Bakirtzis, A., and Kehagias, A. (1997). Short term
load forecasting using a Bayesian combination algorithm. Electrical Power
and Energy Systems, 19:171-177.
Kim, H. and Mendel, J. (1995). Fuzzy basis functions: Comparison with other
basis functions. IEEE Trans. on Fuzzy Systems, 3:158-168.
Klagges, H. and Soegtrop, M. (1992). Limited fan-in random wired Cascade-
Correlation. Ftp from archive.cis.ohio-state.edu in /pub/neuroprose
Kohonen, T. (1982). Analysis of a simple self-organizing process. Biol. Cyber-
netics,44:135-140.
Kohonen, T. (1988a). An introduction to neural computing. Neural Networks,
1:3-16.
Kohonen, T. (1988b). Self-Organization and Associative Memory. Springer.
Kohonen, T. (1990). The self-organizing map. Proc. of the IEEE, 58:1464-1480.
Kohonen, T. (1995). Self Organization Maps. Springer.
Kosko, B. (1991a). Neural Networks and Fuzzy Systems. Prentice-Hall.
Kosko, B. (1991b). Stochastic competitive learning. IEEE Trans. on Neural
Networks, 2:522-529.
Krishnan, R. and Doran, F. (1987). Study of parameter sensitivity in high-
performance and inverter-fed induction motor drive systems. IEEE Trans.
on Ind. Appl., 23:263-265.
Krishnan, R. and Doran, F. (1991). A review of parameter sensitivity and
adaptation in indirect vector controlled induction motor drive. IEEE Trans.
on Power Electronics, 6:695-703.
Krogh, A. and Sollich, P. (1997). Statistical mechanics of ensemble learning.
Physical Review E, 55:811-825.
Krogh, A. and Vedelsby, J. (1995). Neural network ensembles, cross validation,
and active learning. In Advances in Neural Information Processing Systems
7, pages 231-238.
Krolzig, H. (1997). Markov switching vector autorgressions. Springer.
Krzysztoforowitcz, R. (1990). Fusion of detection probabilities and compari-
son of multisensor systems. IEEE Trans. on Systems, Man and Cybernetics,
20:665--677.
Krzyzak, A., Linder, T., and Lugosi, G. (1996). Nonparametric estimation and
classification using radial basis function nets and empirical risk minimization.
IEEE Trans. on Neural Networks, 7:475-487.
Kubat, M. and Flotzinger, D. (1995). Pruning multivariate decision trees by
hyperplane merging. Lecture Notes in Artificial Intelligence, 912:190-199.
Kulkarni, S. and Ramadge, P. (1996). Model and controller selection poli-
cies based on output prediction errors. IEEE Trans. on Automatic Control,
41:1594-1604.
Kung, S. Y. and Taur, J. S. (1995). Decision - based neural networks with
signal/image classification applications. IEEE Trans. on Neural Networks,
6:170-181.
Kuensch, H., Geman, S. and Kehagias, Ath. Hidden Markov Random Fields.
The Annals of Applied Probability, 5:577-602.
REFERENCES 299
Liu, Y. and Yao, X. (1997). Evolving modular neural networks which generalize
well. In Proc. of the IEEE Conference on Evolutionary Computation, pages
605-610.
Ljung, L. (1987). System Identification: Theory for the User. Prentice Hall.
Lo, Z. and Bavarian, B. (1991). On the rate of convergence in topology pre-
serving neural networks. Biol. Cybernetics, 65:55-63.
Lo, Z.-P., Yu, Y., and Bavarian, B. (1993). Analysis of the convergence prop-
erties of topology preserving neural networks. IEEE Trans. on Neural Net-
works, 4:207-220.
Lu, C., Wu, H., and Vemuri, S. (1993). Neural network based short term load
forecasting. IEEE Trans. on Power Systems, 8:336-342.
Luttrell, S. P. (1991). Code vector density in topographic mappings: Scalar
case. IEEE Trans. on Neural Networks, 2:427-436.
Luttrell, S. (1994). A Bayesian analysis of self-organizing maps. Neural Com-
putation, 6:767-794.
Luttrell, S. (1997). Self organization of multiple winner take all neural networks.
Connection Science, 9:11-30.
MacKay, D. (1996). Equivalence of linear boltzmann chains and hidden Markov
models. Neural Computation, 8:178-181.
MacQueen, J. (1965). Some methods for classification and analysis of multi-
variate observations. In Proc. of the Berkeley Symposium on Math. Sciences
and Probability.
Magill, D. (1965). Optimal adaptive estimation of sampled stochastic processes.
IEEE Trans. on Automatic Control, 10:434-439.
Makridakis, S. (1989). Why combining works? Int. Journal of Forecasting,
5:601-603.
Mangeas, M., C.Muller, and A.S.Weigend (1995). Forecasting electricity de-
mand using a mixture of nonlinear experts. In World Congress on Neural
Networks, 2:48-53.
Mani, G. (1991). Lowering variance of decisions by using artificial neural net-
work portfolios. Neural Computation, 3:484-486.
Mariton, M. (1990). Jump linear systems in automatic control. Marcel Dekker.
McGillem, C., Aunon, J., and Pomalaza-Raez, C. (1985). Improved waveform
estimation procedures for event related potentials. IEEE Trans. on Biome-
dial Engineering, 39:371-379.
McGillem, C., Aunon, J., and Yu, K. (1985). Signals and noise in evoked brain
potentials. IEEE Trans. on Biomedial Engineering, 32:371-379.
Meila, M. and Jordan, M. I. (1997). Markov mixtures of experts. In Multiple
Model Approaches to Modelling and Control. Taylor and Francis.
Meir, R. (1995). Bias, variance and the combination of least squares estimators.
In Advances in Neural Information Processing Systems 7, pages 295-302.
Mezard, M. and Nadal, J.-P. (1989). Learning in feedforward layered networks:
The Tiling algorithm. Journal of Physics A: Math. Gen., 22:2191-2203.
Miller, D. and Rose, K. (1996). Hierarchical unsupervised learning with growing
phase transitions. Neural Computation, 8:425-450.
302
Sims, F., Lainiotis, D., and Magill, D. (1969). Recursive algorithm for the calcu-
lation of the adaptive Kalman filter coefficients. IEEE Trans. on Automatic
Control, 14:215-218.
Sin, S.-K. and DeFigueiredo, R. J. (1993). Efficient learning procedures for
optimal interpolative nets. Neural Networks, 6:99-113.
Sirat, J. A. and Nadal, J.-P. (1990). Neural trees: a new tool for classification.
Network-Computation in Neural Systems, 1:423-438.
Sjogaard, S. (1991). A Conceptual Approach to Generalisation in Dynamic
Neural Networks. PhD thesis, Aarhus University.
Sjogaard, S. (1992). Generalization in Cascade-Correlation networks. In Work-
shop on Neural Networks for Signal Processing 1992, Vol. 2, pages 59-68.
Skepstedt, A., Ljung, L., and Millnert, M. (1992). Construction of composite
models form observed data. Int. Journal of Control, 55:141-152.
Smieja, F. (1996). The pandemonium system of reflective agents. IEEE Trans.
on Neural Networks, 7:97-106.
Smotroff, Friedman, and Conolly (1991). Self organizing modular neural net-
works. In Proc. Int. Joint Con! on Neural Networks, pages 187-192.
Sokol, S. (1976). Visually evoked potentials: theory, techniques and clinical
applications. Review Surv. Ophtalm., 21:18-44.
Sorheim, E. (1990). A combined network architecture using ART2 and back
propagation for adaptive estimation of dynamical processes. Modeling, Iden-
tification and Control, 11:191-199.
Srinivasan, D., Chang, C., and Liew, A. (1995). Demand forecasting using fuzzy
neural computation, with special emphasis on weekend and public holiday
forecasting. Paper 95 WM 158-6-PWRS presented at IEEE/PES 1995 Win-
ter Meeting.
Srinivasan, K. (1969). State estimation by orthogonal expansion of probability
distributions. IEEE Trans. on Automatic Control, 15:3-10.
Strang, G. and Nguyen, T. (1996). Wavelets and Filter Banks. Wellesley -
Cambridge Press.
Stromberg, J., Gustaffson, F., and Ljung, L. (1991). Trees as black box model
structures for dynamical systems. In European Control Conference, Greno-
ble, pages 1175-1180.
Sugeno, M. and Kang, G. (1986). Fuzzy modelling and control of multilayer
incinerator. Fuzzy sets and systems, 18:329-346.
Sugeno, M. and Kang, G. (1988). Structure identification of fuzzy model. Fuzzy
sets and systems, 26:15-33.
Sugeno, M., Murofushi, T., Mori, T., Tatematsu, T., and Tanaka, J. (1989).
Fuzzy algorithmic control of a model car by oral instructions. Fuzzy Sets and
Systems, 32:207-219.
Sugeno, M. and Yasukawa, T. (1993). A fuzzy logic-based approach to qualita-
tive modeling. IEEE Trans. on Fuzzy Systems, 1:7-32.
Swiercz, M., Grusza, M., and Sobolewski, P. (1997). Analysis of visual evoked
potentials using neural networks. In 4th Int. Conference on Computers in
Medicine, pages 127-132.
REFERENCES 309
Weigend, A. S., Mangeas, M., and Srivastava., N. (1995). Nonlinear gated ex-
perts for time-series - discovering regimes and avoiding overfitting. Interna-
tional Journal of Neural Systems, 6:373-399.
Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. (1990). Back - propaga-
tion, weight - elimination and time series prediction. In Proc. Connectionist
Models Summer School, pages 105-116.
Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. (1991). Generalization
by weight-elimination with application to forecasting. In Advances in Neural
Information Processing Systems 3, pages 875-882.
Weymaere, N. and Martens, J.-P. (1991). A fast and robust learning algorithm
for feedforward neural networks. Neural Networks, 4:361-369.
Whitehead, B. (1996). Genetic evolution of radial basis function coverage using
orthogonal niches. IEEE Trans. on Neural Networks, 7:1525-1528.
Whitehead, B. A. and Choate, T. D. (1994). Evolving space-filling curves to dis-
tribute Radial Basis Functions over an input space. IEEE Trans. on Neural
Networks, 5:15-23.
Whitehead, B. and Choate, T. D. (1996). Cooperative-competitive genetic evo-
lution of radial basis function centers and widths for time series prediction.
IEEE Trans. on Neural Networks, 7:869-880.
Whittaker, J. (1990). Graphical models in applied multivariate statistics. Wiley.
Windham, M. (1982). Cluster validity for the fuzzy c-means clustering algo-
rithm. IEEE Trans. on Pattern Analysis and Machine Intelligence, 4:357-
363.
Winkler, R. (1989). Combining forecasts: a philosophical basis and some current
issues. Int. Journal of Forecasting, 5:605-609.
Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5:241-259.
Wong, Y. (1993). Clustering data by melting. Neural Computation, 5:89-104.
Wynne-Jones, M. (1992). Node splitting: A constructive algorithm for feed-
forward neural networks. Advances in Neural Information Processing Sys-
tems 4, pages 1072-1079.
Xiao-Rong, L. and Bar-Shalom, Y. (1996). Multiple model estimation with
variable structure. IEEE Trans. on Automatic Control, 41:478-493.
Xu, L., Hinton, G., and Jordan, M.1. (1995). An alternative model for mixtures
of experts. In Advances in Neural Information Processing Systems 7, pages
633-640.
Xu, L. and Jordan, M. I. (1993). EM learning on a generalized finite mixture
model for combining multiple classifiers. In World Congress on Neural Net-
works, pages 227-230.
Xu, 1., Krzyzak, A., and Oja, E. (1993). Rival penalized competitive learning
for clustering analysis RBF net and curve detection. IEEE Trans. on Neural
Networks, 4:636-649.
Xu, L., Krzyzak, A., and Suen, C. (1992). Methods of combining multiple clas-
sifiers and their applications to handwriting recognition. IEEE Trans. on
Systems, Man and Cybernetics, 22:418-434.
312