Predictive Modular Neural Networks - Applications To Time Series-Springer KEDMADO

PREDICTIVE
MODULAR NEURAL
NETWORKS
Applications to
Time Series
THE KLUWER INTERNATIONAL SERIES IN ENGINEERING
AND COMPUTER SCIENCE
PREDICTIVE
MODULAR NEURAL
NETWORKS
Applications to
TimeSeries
by
Vassilios Petridis
Aristotle University ofThessaloniki, Greece
petridis@vergina.eng.auth.gr
and
Athanasios Kehagias
American College of Thessaloniki
and Aristotle University ofThessaloniki, Greece
kehagias@egnatia.ee.auth.gr
....
.,
Springer Science+Business Media, LLC
ISBN 978-1-4613-7540-1 ISBN 978-1-4615-5555-1 (eBook)
DOI 10.1007/978-1-4615-5555-1
Library of Congress Cataloging-in-Publication Data
A C.I.P. Catalogue record for this book is available

from the Library of Congress.
Copyright © 1998 by Springer Science+Business Media New York

Originally published by Kluwer Academic Publishers in 1998
Softcover reprint ofthe hardcover Ist edition 1998
AII rights reserved. No part of this publication may be reproduced, stored in a

retrieval system or transmitted in any form or by any means, mechanical, photo-
copying, recording, or otherwise, without the prior written permission of the
publisher, Springer Science+Business Media, LLC.
Printed an acid-free paper.

Contents
Preface ix
1. INTRODUCTION 1
1.1 Classification, Prediction and Identification: an Informal Description 1
1.2 Part I: Known Sources 3
1.3 Part II: Applications 5
1.4 Part III: Unknown Sources 5
1.5 Part IV: Connections 7
Part I Known Sources 9

2. PREMONN CLASSIFICATION AND PREDICTION 11
2.1 Bayesian Time Series Classification 11
2.2 The Basic PREMONN Classification Algorithm 14
2.3 Source Switching and Thresholding 18
2.4 Implementation and Variants of the PREMONN Algorithm 19
2.5 Prediction 23
2.6 Experiments 24
2.7 Conclusions 32
3. GENERALIZATIONS OF THE BASIC PREMONN 39

3.1 Predictor Modifications 39
3.2 Prediction Error Modifications 40
3.3 Credit Assignment Modifications 41
3.4 Markovian Source Switching 49
3.5 Markovian Modifications of Credit Assignment Schemes 53
3.6 Experiments 54
3.7 Conclusions 56
4. MATHEMATICAL ANALYSIS 59
4.1 Introduction 59
4.2 Convergence Theorems for Fixed Source Algorithms 60
VI
4.3 Convergence Theorem for a Markovian Switching Sources Algorithm 65

4.4 Conclusions 67
5. SYSTEM IDENTIFICATION BY THE PREDICTIVE MODULAR APPROACH 81

5.1 System Identification 81
5.2 Identification and Classification 82
5.3 Parameter Estimation: Small Parameter Set 84
5.4 Parameter Estimation: Large Parameter Set 86
5.5 Experiments 90
5.6 Conclusions 95
Part II Applications 99
6. IMPLEMENTATION ISSUES 101

6.1 PREMONN Structure 101
6.2 Prediction 102
6.3 Credit Assignment 104
6.4 Simplicity of Implementation 107
7. CLASSIFICATION OF VISUALLY EVOKED RESPONSES 109

7.1 Introduction 109
7.2 VER Processing and Classification 111
7.3 Application of PREMONN Classification 112
7.4 Results 118
7.5 Conclusions 119
8. PREDICTION OF SHORT TERM ELECTRIC LOADS 123

8.2 Short Term Load Forecasting Methods 124
8.3 PREMONN Prediction 125
8.4 Results 129
8.5 Conclusions 133
9. PARAMETER ESTIMATION FOR AND ACTIVATED SLUDGE PROCESS 135

9.2 The Activated Sludge Model 136
9.3 Predictive Modular Parameter Estimation 139
9.4 Results 141
9.5 Conclusions 145
Part III Unknown Sources 147
10. SOURCE IDENTIFICATION ALGORITHMS 149

Contents vii
10.2 Source Identification and Data Allocation 151

10.3 Two Source Identification Algorithms 156
10.4 Experiments 162
10.5 A Remark about Local Models 171
10.6 Conclusions 172
11. CONVERGENCE OF PARALLEL DATA ALLOCATION 173

11.1 The Case of Two Sources 173
11.2 The Case of Many Sources 180
12. CONVERGENCE OF SERIAL DATA ALLOCATION 209

12.1 The Case of Two Sources 209
12.2 The Case of Many Sources 214
Part IV Connections 247
13. BIBLIOGRAPHIC REMARKS 249

13.2 Neural Networks 251
13.3 Statistical Pattern Recognition 258
13.4 Econometrics and Forecasting 260
13.5 Fuzzy Systems 260
13.6 Control Theory 261
13.7 Statistics 264
14. EPILOGUE 267
Appendices 270
A- Mathematical Concepts 271
A.1 Notation 271
A.2 Probability Theory 272
A.3 Sequences of Bernoulli Trials 279
A.4 Markov Chains 280
References 283
Index 313
Preface
The subject of this book is predictive modular neural networks and their ap-
plication to time series problems: classification, prediction and identification.
The intended audience is researchers and graduate students in the fields of
neural networks, computer science, statistical pattern recognition, statistics,
control theory and econometrics. Biologists, neurophysiologists and medical
engineers may also find this book interesting.
In the last decade the neural networks community has shown intense interest
in both modular methods and time series problems. Similar interest has been
expressed for many years in other fields as well, most notably in statistics,
control theory, econometrics etc. There is a considerable overlap (not always
recognized) of ideas and methods between these fields.
Modular neural networks come by many other names, for instance multiple
models, local models and mixtures of experts. The basic idea is to independently
develop several "subnetworks" (modules), which may perform the same or re-
lated tasks, and then use an "appropriate" method for combining the outputs
of the subnetworks. Some of the expected advantages of this approach (when
compared with the use of "lumped" or "monolithic" networks) are: superior
performance, reduced development time and greater flexibility. For instance, if
a module is removed from the network and replaced by a new module (which
may perform the same task more efficiently), it should not be necessary to
retrain the aggregate network.
In fact, the term "modular neural networks" can be rather vague. In its
most general sense, it denotes networks which consist of simpler subnetworks
(modules) . If this point of view is taken to the extreme, then every neural
network can be considered to be modular, in the sense that it consists of neurons
which can be seen as elementary networks. We believe, however, that it is more
profitable to think of a continuum of modularity, placing complex nets of very
simple neurons at one end of the spectrum, and simple nets of very complex
neurons at the other end.
We have been working along these lines for several years and have developed
a family of algorithms for time series problems, which we call PREMONN's (i.e.
x
PREdictive MOdular Neural Networks). Similar algorithms and systems have

also been presented by other authors, under various names. We will generally
use the acronym PREMONN to refer to our own work and retain "predictive
modular neural networks" as a generic term.
This book is divided in four parts. In Part I we present some of our work
which has appeared in various journals such as IEEE Transactions on Neural
Networks, IEEE Transactions on Fuzzy Systems, Neural Computation, Neural
Networks etc. We introduce the family of PREMONN algorithms. These
algorithms are appropriate for online time series classification, prediction and
identification. We discuss these algorithms at an informal level and we also
analyze mathematically their convergence properties.
In Part II we present applications (developed by ourselves and other re-
searchers) of PREMONNs to real world problems. In both these parts a basic
assumption is that models are available to describe the input / output behavior
of the sources generating the time series of interest. This is the known sources
assumption.
In Part III we remove this assumption and deal with time series generated
by completely unknown sources. We present algorithms which operate online
and discover the number of sources involved in the generation of a time series
and develop input/ output models for each source. These source identification
algorithms can be used in conjunction with the classification and prediction
algorithms of Part 1. The results of Part III have not been previously published.
Finally, in Part IV we briefly review work on modular and multiple models
methods which has appeared in the literature of neural networks, statistical
pattern recognition, econometrics, fuzzy systems, control theory and statistics.
We argue that there is a certain unity of themes and methods in all these fields
and provide a unified interpretation of the multiple models idea. We hope that
this part will prove useful by pointing out and elucidating similarities between
the multiple models methodologies which have appeared in several disparate
fields.
Indeed, we believe that there is an essential unity in the modular approach,
which cuts across disciplinary boundaries. A good example is the work re-
ported in this book. While we present our work in "neural" language, its
essential characteristic is the combination of simple processing elements which
can be combined to form more complex (and efficient) computational struc-
tures. There is nothing exclusively neural about this theme; it has appeared in
all the above mentioned disciplines and this is why we believe that a detailed
literature search can yield rich dividends in terms of outlook and technique
cross fertilization.
The main prerequisite for reading this book is the basics of neural network
theory (and a little fuzzy set theory). In Part I, the mathematically involved
sections are relegated to appendices, which may be left for a second reading,
or omitted altogether. The same is true of Part III: convergence proofs (which
are rather involved) are presented in appendices, while the main argument
can be followed quite independently of the mathematics. Parts II and IV are
PREFACE Xl
nonmathematical. We have also provided an appendix, which contains the

basic mathematical concepts used throughout the book.
Many people helped us in writing this book. We want to especialy ac-
knowledge the help given by A. Bakirtzis, A. Greene, M. Grusza, V. Kaburla-
zos, L. Kamarinopoulos, S. Kiartzis, P. Lincoln, M. Swiercz, M. Paterakis,
P. Sobolewski, S. St.Clair, K. Vezeridis, D. Vlahodimitropoulos. Finally, we
dedicate this book to our families.
VASSILIOS PETRIDIS AND ATHANASIOS KEHAGIAS
Aristotle University,
Thessaloniki, Greece,
1 INTRODUCTION
1.1 CLASSIFICATION. PREDICTION AND IDENTIFICATION: AN

INFORMAL DESCRIPTION
Consider the following problem of time series classification. A source is a

mechanism for generating a time series. K sources are available; one of them
is selected at time t = 0; the selected source generates a time series Yl, Y2,
... ,Yt. The time series is observable, but the source is hidden; however, the
number of sources (K) and the input I output behavior of each one is known
in advance. It is required to find which source generates the time series. This
can be considered as a classification task: it is required to classify the entire
time series Yl, Y2, ... ,Yt, ... to one of K classes, each class corresponding to
one source.
A generalization of the above problem is obtained by assuming that, out of
the K sources, a new one is selected at every time step t to generate the current
observation Yt. In this case it is required to identify the selected source at every
time step t; in other words it is required to classify each Yt to one of the K
possible classes.
This type of problem arises in many applications. For example, in speech
recognition the observed time series Yl, Y2, ... ,Yt, ... ,is the speech signal
(either in its raw form, or after some preprocessing, such as extracting the FFT
or LPC coefficients). One source corresponds to each phoneme (such as [ah] ,
[00], ... etc. What is required is to correspond segments of the time series to
V. Petridis et al., Predictive Modular Neural Networks

© Kluwer Academic Publishers 1998
2
phonemes; for instance to be able to say that (for some T and n) the segment
YT+l, YT+2, ... , YT+n corresponds to the phoneme [00]. Other instances of time
series classification arise in connection to processing of radar, sonar or seismic
signals, the analysis of electrocardiographic or electroencephalographic signals,
DNA sequencing and so on. An extensive list of classification applications can
be found in (Hertz, Krogh and Palmer, 1991).
The model of a time series generated by a collection of sources can be (and
has been) used not only in classification problems, but also for prediction and
identification. Consider the case of prediction: if the input / output behavior of
the sources is known, then a predictor can be developed for each source. If it is
known that a segment of the observed time series is generated by a particular
source, future observations within this segment can be predicted by using the
corresponding predictor. Various methods of predictor combination build upon
this idea. In identification problems it is required to obtain, for instance, an
input / output model of the time series. Such a model may be global, i.e. hold
for all possible values of the observed time series, or local, i.e. several models
may be combined, each describing the input / output behavior for a particular
range of observations. In the latter case, each local model may be considered as
a source; under this interpretation the time series is generated by the resulting
collection of sources. It can be seen that in all problems discussed above the
primary task is establishing, at every time step t, the active source. In other
words, classification is a prerequisite for prediction and identification.
Multiple sources and local models are well suited to modular methods, a topic
which has recently attracted a great deal of attention in the neural networks
community. There is no universally accepted definition of what a modular
neural network is, but generally the term denotes a network which is composed
of subnetworks (modules). Each module may be specialized in a particular task,
or several modules may perform the same task in slightly different manner
(perhaps because each underwent a different training process). In the latter
case the term ensemble of networks is sometimes preferred. At any rate, it is
hoped that the combination of modules will yield superior performance, greater
noise robustness, a shorter training cycle and so on. This is a reasonable hope,
since modular neural networks implement the time-honored divide-and-conquer
approach to problem solving. The results reported in the literature indicate
that modular neural networks indeed outperform "lumped" or "monolithic"
networks.
In this book we present a family of modular neural algorithms which we have
found to work well in practical time series problems. In addition to experimen-
tal results, we present a mathematical framework to explain the success of the
proposed algorithms. This framework is based on a phenomenological point of
view which can be summarized thus.
If a module predicts a time series with greater accuracy than competing mod-
ules, then it should receive higher credit as a potential model of the time series;
however, credit must be assigned in connection with past, as well as present,
predictive accuracy.
INTRODUCTION 3
This rather simple principle can yield mathematically precise results such as
convergence to correct classification.
Part I of the book is mainly devoted to the presentation and analysis of
a family of recursive, online predictive modular time series classification algo-
rithms. The proposed algorithms are modular in the sense that they combine
a collection of modules; each module can be developed independently of the
rest and replaced or removed from the system, without affecting the remaining
modules. The algorithms are characterized as predictive because the modules
are actually predictors (one predictor corresponding to each active source) and
classification is performed by the use of credit functions which are obtained
from the modules' predictive error.
Part II presents the application of the proposed algorithms to real world
problems of time series classification, prediction and identification.
All the methods presented in the first two parts of the book, are based on
the assumption that the collection of active sources is known in advance and
that an input / output model of each source is available. If these assumptions
are not satisfied, a different approach is required; this is discussed in Part III.
Namely, we propose a family of algorithms which can be used to identify the
active sources and develop one predictive module per source. Hence the results
of Part III complement those of Part I.
Finally, in Part IV we discuss the use of multiple models methods (which may
be considered a superset of modular methods) in the fields of neural networks,
statistical pattern recognition, control theory, fuzzy set theory, econometrics
and statistics, and provide a framework which allows a unified interpretation
of such methods; extensive bibliographic references are also provided.
We will now discuss each part of the book in more detail.
1.2 PART I: KNOWN SOURCES
As already mentioned, we consider classification to be the primary problem,

with prediction and identification building upon classification results. Hence,
most of Part I is devoted to the development of a family of classification algo-
rithms and the study of their convergence properties. Let us repeat that all
these algorithms are based on the assumption that the source set is known and
training data are available for each source.
Chapter 2 is devoted to the development of a "basic" time series classification
algorithm. The main idea is to use a source credit function. This is recomputed
at every time step t, according to the predictive accuracy of each source model.
If a model is better than the remaining ones in predicting the currently observed
Yt, the credit of the respective source increases. The credit of a source, however,
does not depend only on the current predictive performance of the respective
model; credit computation also depends on previous credit values.
More specifically, we discuss variations of the following general credit assign-
ment algorithm, which operates on a collection of K known sources.
4
In the offiine learning phase K predictors are trained, each predictor using
data generated by one of the K sources.
At t = 0, equal credit is assigned to all sources.
The online phase (for t= 1,2, ... ) consists of the following steps.
Each predictor computes a prediction of the next observation Yt.

When the actual Yt becomes available, the prediction error of each pre-
dictor is computed.
The credit of each source is updated, taking into account: the relative
magnitude of the respective error (compared with the errors of the re-
maining predictors) and the credit values of all sources, as computed at
t -1.
The time series is classified (for the present time t) to the source of
highest credit.
Since previous credit values are used to compute the current ones, the above
algorithm is recursive. In addition, as will become obvious later, credit assign-
ment is competitive: the new credit of a source does not depend on the absolute
magnitude of the respective prediction error, but on relative magnitude of all
K prediction errors. The algorithm is characterized as predictive, since credit
assignment depends on predictive error. In addition, the algorithm is modular:
each predictor is an independent, separately trained module and can be easily
replaced or removed from the system, without affecting the operation of the
remaining modules.
As discussed in Chapter 3, each predictor is trained independently (during
the offline phase) and can be implemented by several different kinds of neural
networks (for instance, feedforward or recurrent networks employing sigmoid,
linear, RBF or polynomial neurons). The credit assignment module is also
independent of the prediction modules, and may be replaced without affecting
their operation. Various credit computation schemes can be used, which may
be, for instance, of a multiplicative, additive or "fuzzy" character. By varying
the characteristics of the predictive and credit assignment modules, a family of
Predictive Modular Neural Networks (PREMONN) is developed, which can be
used to perform time series classification, prediction and parameter estimation
(of dynamical systems). We present numerical experiments to compare the
performance of the various algorithms introduced.
In Chapter 4, we prove mathematically (for most of the algorithms intro-
duced) that, under mild assumptions, the credit functions converge to the "cor-
rect" classification. Roughly speaking, this means that credit function of the
module with maximum predictive power converges to one and all the remaining
credit functions converge to zero.
INTRODUCTION 5
In Chapter 5, we show how PREMONNs can be applied to the problem

of time series identification. There are several ways to do this, depending
on the nature of the identification problem. We adopt a control theoretic
point of view and distinguish between black box identification and parameter
estimation. We relegate the discussion of the former to Part III. Regarding
parameter estimation we present a hybrid neural genetic algorithm which uses
PREMONNs to compute selection probabilities and we present a numerical
study of this algorithm.
1.3 PART II: APPLICATIONS

In Chapter 6 we discuss various implementation issues and present guidelines
for fine tuning the algorithms presented in Part I and selecting appropriate
values of the operating parameters.
In Chapter 7 we present an application of PREMONN to medical diagno-
sis by classification of Visually Evoked Responses (these are electroencephalo-
graphic signals).
In Chapter 8 we present an application of PREMONN to prediction of short
term electric loads. The prediction horizon is twenty four hours.
In Chapter 9 we present an application of the hybrid PREMONN / genetic
algorithm to a the problem of estimating the parameters of a waste water
treatment plant.
1.4 PART III: UNKNOWN SOURCES

In Parts I and II it is assumed that the active sources are known a priori and
that data from each source are available for training the respective predictor.
Part III examines the consequences of removing these assumptions. In short,
Part III is devoted to the problem of unknown sources time series classification.
AB it will become clear later, our approach to this problem results in solving not
only the classification but also the black box identification problem. In other
words, we present algorithms which
1. take as input data from a composite time series, generated by alternately

activated sources;
2. identify the active sources;
3. produce an input/output model (i.e. predictor) for each active source;
4. and use these models (in conjunction with the algorithms of Part I) to per-
form classification.
In Chapter 10 we present two basic algorithms (in addition to guidelines for

producing a family of hybrid algorithms) which perform online unsupervised
source identification and time series classification. The approach used to obtain
these algorithms rests on the following analysis.
6
The classification algorithms of Part I depend on the predictive error; in turn,

computation of the predictive error depends on the availability of one predictor
for every active source. Such predictors can be obtained in a straightforward
manner (using one of the many available neural network training algorithms)
providing that a set of training data, generated by a single source, is available
for each predictor. However, if the sources are unknown and the training data
unlabeled, it is not possible to obtain the required predictors, since it is not
possible to allocate the appropriate data to each predictor. In other words,
in the case of unknown sources and unlabeled data, the main problem is data
allocation, i.e. the assignment of samples Yt to predictors.
The solution of the data allocation problem is vital to the development of
online unsupervised source identification algorithms. To this end we propose
a data allocation method which depends on predictive error. Consider the
following algorithm.
Initialization: Set K = 1. Initialize predictor no. 1 with random values. Set

a threshold d.
For t = 1,2, ...
Observe Yt.
For k = 1,2 compute yf (the prediction of Yt by predictor no.k).
Compute the prediction error IYt - yf I.
Set k* = arg mink=1,2, .. ,K IYt - yfl.
If IYt - yf 1 < d, allocate Yt to predictor no. k*. Else increase K by one
and allocate Yt to predictor no. K.
Retrain each predictor on all data assigned to it.
Next t.
This is a completely unsupervised learning algorithm. Data are allocated

in a competitive, winner-take-all manner, to the minimum error predictor. We
use the name "Parallel Data Allocation Algorithm", in contradistinction to
the "Serial Data Allocation Algorithm" variation. In the serial case, at every
time step the prediction error is computed sequentially (for one predictor at a
time) until a predictor is found with error smaller than the threshold d; the
observation Yt is allocated to the first such predictor.
Both the parallel and serial algorithms allocate the observed time series data
to the predictors according to their respective prediction errors; the allocated
data is used for retraining at every time step. This procedure may, under
appropriate conditions, produce one "well-trained" predictor per source. This
is corroborated by numerical experiments presented in Chapter 10.
INTRODUCTION 7
In Chapter 11 we examine the convergence properties of the parallel source

identification algorithm. We provide conditions sufficient to ensure convergence
to "correct" allocation, where "correctness" of allocation is defined in a math-
ematically precise sense. These conditions are of a very general nature and
apply to a large family of algorithms, which may be used not only for time se-
ries, but also for static problems. We conjecture that the satisfiability of these
conditions depends on the complexity of the source generating the time series,
on the capacity of the predictors and on the nature of the training algorithm.
The analysis we present depends on a so-called specialization variable X t which
describes to what degree a predictor is "associated" with a source. Using the
theory of Markov processes, we show that X t is a transient Markov chain (ac-
tually an inhomogeneous random walk). The convergence results follow from
this. Generally speaking this chapter is mathematically more demanding than
the preceding ones. However, we separate the statement and the discussion of
the relevant theorems from the proofs, which are relegated to a mathematical
appendix.
Chapter 12 contains a convergence theorem for the serial source identification
algorithm and its proof. The presentation is very similar to that of Chapter
11.
1.5 PART IV: CONNECTIONS

In Chapter 13 we provide an overview of the literature of multiple models meth-
ods. Such methods have appeared in several disciplines, including neural net-
works, fuzzy systems, statistical pattern recognition, econometrics and statis-
tics. Modular neural networks may be considered as particular example of
the multiple models approach. We do not present an exhaustive bibliographic
analysis, since the multiple models literature in anyone of the above named
fields is vast. Rather, we present our own literature map, we list papers which
we have found stimulating and provide some pointers to more specialized biblio-
graphic resources. We also point out some (perhaps not so obvious) similarities
between various approaches, especially across the aforementioned fields.
Finally, in Chapter 14 we present our personal synthetic view of multiple
models and modular methods. We indicate the position of our own methods in
this "big picture" and speculate on possible research directions for the future.
I Known Sources
2 PREMONN CLASSIFICATION AND
PREDICTION
In Chapter 1 we introduced the time series classification problem informally. In

this section a mathematically more precise formulation is presented; this leads
to the development of a probabilistically motivated time series classification
algorithm.
2.1 BAYESIAN TIME SERIES CLASSIFICATION

Consider a random variable z taking values in the set e = {I, 2, ... , K} with
probability p~, k = 1,2, ... , K; in other words p~ ,,;, Pr (z = k). The term source
variable is used to denote z; e is called the source set.
Suppose that appropriate values Y-M+1, Y-M+2, ... ,Yo are fixed (perhaps
selected with uniform probability from a bounded subset of RM) and that, at
time t = 0, z takes a value in e, according to the probabilitiesp~, k = 1,2, ... ,K.
Then, at times t = 1,2, ... a time series Yl, Y2, ... ,Yt, ... is produced according
to the following equation
Yt = Fz(Yt-lt Yt-2, ... ,Yt-M) + et, (2.1)
where et is a Gaussian white noise time series, taking values in R, with zero
mean and standard deviation a, and Fk(') : RM -+ R, k = 1,2, ... , K, are
appropriate functions. Hence Yt will be a random variable and Yl, Y2, .. , ,Yt
will be a stochastic process.
The time series classification task consists in using the observations Yl, Y2,
... ,Yt, ... to obtain an estimate of z. More specifically, estimates Zt, will be

12
computed for t = 1, 2, '" ; in other words, the additional information provided

by every incoming observation will be used to refine the estimate of z.
To obtain the Zt'S we will use the conditional posterior probability pf(Yb Y2,
... , Yt) (written more briefly as pf), which is defined for k = 1,2, ... , K and
t = 1,2, ... , by
pf ~ Pr(z = k I Yl, Y2, ... , Yt). (2.2)
If the pf's are known, a natural choice for Zt is the following:

~ . k
Zt = arg max Pt .
k=1,2, ... ,K
In other words, at time t it is claimed that Yl, Y2, ... , Yt has been produced
by source Zt, where Zt maximizes the posterior probability. This is called the
Maximum A Posteriori (MAP) estimate of z. Note that, while the source
variable is fixed, its estimate may be time varying.
The classification problem has now been reduced to computing pf, for t =
1,2, .... and k = 1,2, ... , K. This computation can be performed recursively.
The main result in this direction is the following theorem.
Theorem 2.1 (Computation of Posterior Probabilities) Suppose thatYt

is given by eq.{2.1} andfork=l, 2, ... ,K andt=l, 2, ... we define
(2.3)
Then the posterior probabilities pf evolve in time according to the following
equation.
(2.4)
This theorem is proved rigorously in Appendix 2.A. The proof basically

consists of an application of Bayes' rule; however there are some mathematical
details which require careful treatment. Let us present here a short and informal
"proof" , which will provide useful motivation in the remainder of the chapter.
Denote the (time invariant) Gaussian probability density of et by
1 1.1 2
G(e) = - - . e-'2,;"2"
V2iiu
Consider now the probability density of Yt - Fz(Yt-l, Yt-2, ... , Yt-M), condi-
tioned on z = k and on the observed values of Yt-l, Yt-2, ... , Yt-M. Since
et = Yt-Fz(Yt-l, Yt-2, ... , Yt-M) and Fz(Yt-l,Yt-2, ... , Yt-M) is determined (not
randomly) by z and Yt-b Yt-2, ... , Yt-M, it may be expected that Yt- Fz(Yt-b
Yt-2, ... , Yt-M) has the same probability density (under the above conditioning)
as et (this claim is proved in Appendix 2.A). Also, since z is a discrete valued
PREMONN CLASSIFICATION AND PREDICTION 13
variable, its probability density function is the same as its probability function.
Then, for k = 1,2, ... , K and t = 0, 1, 2, ... we have 1
k_p( -kl )_dYt,z(a,kIYl,Y2, ... ,Yt-d (2.5)
Pt - r z- YbY2, .. ·,Yt-l,a - d ( I ) .
Yt a Yb Y2, ... , Yt-l
The above equation is a form of Bayes' rule: it says that the conditional density
of z is the same as the joint conditional density of Yt and z, divided by the
conditional density of Yt (where all conditioning is on Yb Y2,'" ,Yt-l). A
rigorous derivation of eq.(2.5) is presented in Appendix 2.A. From eq.(2.5) we
can obtain (see Appendix 2.A)
k _ dyt,z(a, k I Yl, Y2, ... , Yt-d
Pt - K . (2.6)
Ej=l dyt,z(a,j I Yl, Y2, ... , Yt-l)
We also show in Appendix 2.A that
dyt,z(a, k I Yb Y2, ... , Yt-l) = d Yt (a I Yl, Y2, ... , Yt-b z = k).
dz(k I Yl, Y2, ... , Yt-l). (2.7)

Since z is discrete valued, probability functions are equivalent to probability
densities and we have
dz(k I Yl, Y2, ... , Yt-l) = Pr(z = k I Yl, Y2, ... , Yt-l) = P:-l'
Therefore
dyt,z(a, k I Yl, Y2, ... , Yt-l) = dYt (a I Yb Y2, ... , Yt-l, z = k) . P:-l' (2.8)
Now (2.6), (2.8) imply the recursion (for k = 1,2, ... , K, t = 0,1,2, ... ):
k dYt (a I Yl, Y2, ... , Yt-b z = k) . pf-l (2.9)
~= K .. ;
Ej=l dyt(a I Yl, Y2, ... , Yt-b z = J) . ~-l
all that is left to complete the recursion for pf, is to compute the quantity
dyt(a I Yl,Y2, .. ·,Yt-b Z = k). In Appendix 2.A it is shown that
(2.10)
ITo denote probability densities of random variables, we use the following notation: the
probability density of the random variable x is denoted by d., (a); the joint probability density
of the random variables x, y is denoted by d",II(a, b); the conditional probability density of
the random variables x, given that y = b, is denoted by d.,(aly = b), or also by d.,(alb) or
d.,(aIY).
It must also be kept in mind that the probability density of a continuous valued random
variable does not exist always, but if it does, then it satisfies the following relationship
d.,(a) = -dad Pr(x < a).

The reader should check the Mathematical Appendix A for a more detailed description of
the above matters.
14
in other words, Yt , conditioned on Yl, Y2, ... , Yt-l and z = k, has a Gaussian
probability density with mean yf and standard deviation CT. (Extensions for
vector valued Yt are obvious.) Setting a = Yt in eq.(2.10), substituting in eq.
2.9 and cancelling the V1::
211'0'
terms, we obtain the desired posterior probabilities
update equation:
k
IYt_~kI2
t
k Pt-l . e
=
20'
Pt -";:"':---::'---'---"':'2;;-' (2.11)
K n _IUt-Yil
I:n=l Pt-l . e 20'
This completes the "proof" of the theorem. Finally, it is reasonable to set Zt

equal to the number of the source with maximum posterior probability:
. k
(2.12)
~
Zt = arg max Pt .
k={1,2, ... ,K}
Hence the classification method is summarized by eqs.(2.1), (2.3), (2.11) and

(2.12).
This method has been introduced by Lainiotis in a general form in (Hilborn
and Lainiotis, 1969a; Hilborn and Lainiotis, 1969b) and in the context of control
and estimation in (Sims, Lainiotis and Magill, 1969; Lainiotis, 1971b); this
latter version is called the Partition Algorithm. The method also appears in
the context of pattern recognition, in (Patrick, 1972).
In all of the above references, it is assumed that dYt (a I Yl, Y2, ... , Yt- b Z = k)
can be computed explicitly. We have also used this assumption in the previous
paragraphs. In particular, assuming that et is Gaussian and that the source
functions Fk (.) are known, we have been able to obtain an explicit form for
dYt (a I Yl,Y2, ... ,Yt-l,Z = k), as described in eq.(2.1O). However, our main in-
terest is in a somewhat different point of view, as will be explained in the next
section.
2.2 THE BASIC PREMONN CLASSIFICATION ALGORITHM
2.2.1 Phenomenological Motivation

The noise component of a time series will not, in general, be an additive
Gaussian process. In addition, the form of the functions Fk(') will generally
be unknown to us. In short, the probabilistic analysis of the previous section
is applicable only to a very restricted class of problems. However, the update
equation (2.11) can also be interpreted in a nonprobabilistic, phenomenological
manner, which is applicable to a large class of problems and can be generalized
in various useful ways.
The phenomenological interpretation depends only on the observed behavior
of the time series, does not require any probabilistic assumptions, and is based
on the following simple observations. In eq.(2.11) the denominator is the same
for all p~ and it serves simply to normalize the p~ quantities and does not
influence the choice of Zt. The important part of the update is produced by
the numerator. Indeed, upon dividing the update equations for p~ and for pf',
the denominators cancel and we obtain
_IYt_;~12
p~ ptl e 2"
(2.13)
pf' Pr.:l IYt-y;n12 .
e 20'2
Eq.(2.13) shows that the likelihood ratio oftwo sources is updated at every time
step according to the "prediction" error of each source; namely, sources with
larger error become less likely to have produced the observations). However,
the likelihood ratio at time t, also depends on the likelihood ratio at time t - 1.
Hence, the operation of the update equation is essentially the following: at
every time step eq.(2.1l) penalizes more heavily sources with higher prediction
error; but past performance of each source is also taken into account.
From the above analysis it is obvious that eq.(2.1l) performs a "sensible"
update of the posterior probabilities. In fact, the update is sensible even when
the probabilistic assumptions are dropped. The only assumptions necessary to
justify the update equation are the following.
1. The time series Yl, Y2, ... is produced by some unknown source functions
Fk(')' k = 1,2, .. , K; a noise process (of unspecified characteristics) may
distort the observations.
2. The number K (number of possible sources and respective source functions)

is known; for each k a sufficiently long sample time series (generated by the
k-th source) is available.
3. The k-th sample time series is used to train offiine a sigmoid neural predic-
tor fk (.); in light of the well known universal approximation properties of
sigmoid neural networks, it is reasonable to assume that, given a sufficient
sample of the time series, fk(') approximates F k (.).
These are phenomenological assumptions: they only relate to the observed

behavior of the time series. If they hold true, then, by the analysis of Section
2.2.1, we may expect that a high value of p~ indicates that the k-th predictor
is a good model of the observed time series behavior and hence the k-th source
is likely to have produced the time series. Let us repeat that no probabilistic
assumptions are necessary to reach this conclusion. Consequently, there is no
reason to use the term "posterior probabilities"; from now on we will refer to the
p~ quantities by the neutral term credit functions; a high credit value indicates
that the respective source is likely to have generated the observed time series 2 .
In short, the phenomenological approach can be summarized thus:
2Note that, if we have (for all k) 0 < P~ < 1 and L::=l p~ = 1, then we also have (for all k
and t) that 0 < p~ < 1 and L::=l p~ = 1.
16
an observed sample Yt will be classified to the source/predictor which furnishes

the prediction yf closest to Ytj however, not only the current prediction error
e~ ,but also previous ones: eL 1, eL2,... must be taken in account.
2.2.2 The Algorithm

Now we can present an algorithm for the recursive, online computation of the
credit functions. The algorithm is named PREdictive MOdular Neural Network
(PREMONN) Classification Algorithm and is implemented by the parallel op-
eration of K predictive neural modules
Basic PREMONN Classification Algorithm

Initialization.
For k = 1,2, ... , K train (ofHine) sigmoid neural network predictors fk(.).
At t = 0 choose initial values p~ which are arbitrary, except for the fact that
they satisfy
K
o < p~ < 1, LP~ = l. (2.14)
k=l
Main online phase.
For t = 1,2, ...

For k = 1,2, ... , K compute
predictions:
Y~ = h(Yt-l, Yt-2, ... , Yt-M); (2.15)
prediction errors:
-N<
et = Yt - Yt;
k .
(2.16)
credit functions:
k le~r
k Pt-l . e- 2 ..
Pt = ---'--"----I---,r"""·
en
(2.17)
"'i:"'K n _ t
L.m=l Pt-l . e 2 ..
Next k.
At time t classify the entire time series to the source no.Zt, where
Zt = arg max p~.
k=1,2, ... ,K
Next t.
We have introduced this algorithm in (Petridisand Kehagias, 1996). The

motivation for this algorithm has been probabilistic, but its justification is
based solely on the phenomenological approach. However, before the PRE-
MONN algorithm can be used with confidence, some analysis of its behavior is
necessary. Let us present an informal argument herej a more rigorous analysis
will be presented in Chapter 5. It has already been remarked that the oper-
ation of the update equation eq.(2.17) results in the following ratio of credit
functions
le~l~
Ptk k
Pt-l e -~2"
m
Pt
= 'm- .
Pt-l _~
le!"12 . (2.18)
e ~"
This ratio at time t is equal to the same ratio at time t -1, multiplied by the
le~12 Ie!" 12
term e- ~ 2" /e- ~2" ,which expresses the relative error of the k-th and m-th
predictors. More specifically, if the m-th predictor has higher error than the
k-th one, the exponential term is greater than one, and the ratio of k-thcredit
to the m-th one increases from time t -1 to time tj the reverse situation holds
in case the m-th predictor has lower error than the k-th one. Hence the credit
ra~io evolves dynamically in time, depending on past values of itself (feedback)
as well as on the currently observed prediction errors. By repeatedly applying
eq.(2.18) for times t - 1, t - 2, ... , 1, we obtain
(2.19)
It now becomes obvious that the source with highest credit at time t is the one
with minimum cumulative prediction error.
The above arguments are, of course, quite informal. A more solid justifica-
tion of the use of eq.(2.17) and its variants will be given in Chapter 5, where it
will be proved (for a variety of credit update algorithms) that: under very mild
assumptions on the time series Yt, if the k-th source has "smallest prediction
error" , then limt....oo pf = 1 and limt.... oo pi = 0 for all m :F k. The exact mean-
ing of "smallest prediction error" , as well as the exact sense of convergence will
be described in Chapter 5.
In conclusion, let us list some of the advantages of the phenomenological
point of view. First, as already mentioned, it renders assumptions about the
nature of the time series unnecessary. Second, it allows a simple treatment
of switching sources, as will be seen in the next section. Third, it allows the
introduction of variant classification algorithms; we have found such variants
(a number of which will be presented in Chapter 4) to be, in some instances,
more efficient than the basic algorithm presented above.
18
Figure 2.1. Desired behavior of credit functions in a source switching situation.
Credit Fun. No.1. ..: C,edit Fun. No.2
1.00
..;'P-"""~'.~'''~'\'''''r''''''I" ....,It.''.~\-,-,\Ji~/'''''''~
0.90
0.80
"
0.70
0.60
~ 0.50
5
0.40
0.30 :
0.20
.~
0.10 :
V:
0.00 ~~"",'.......,v-,/.,.,,-~.A.-_. "'~vv·.........r.,.",,·
o 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375
Time Steps
2.3 SOURCE SWITCHING AND THRESHOLDING

It has already been mentioned that the probabilistic derivation of the basic
PREMONN algorithm is based on the assumption that the time series is pro-
duced by a single source. In many cases this assumption will be violated. For in-
stance, consider a speech time series that consists of several distinct phonemes.
Each phoneme is generated by a different source and this results in different
portions of the time series having quite distinct characteristics. We refer to this
phenomenon as source switching. As already argued, using the PREMONN al-
gorithm in source switching situations makes sense from the phenomenological
point of view. This claim will be further supported in later sections, using both
experimental evidence and mathematical reasoning. However, a modification is
necessary before the PREMONN algorithm can be applied to switching source
situations. To illustrate this point, let us consider a specific example. Suppose
that the initial segment of the time series is generated by a fixed source, say
source no.l, and at time ts there is a switch to source no.2. The desired behav-
ior close to the source switching time is is the following. Prior to the switching
time t s , p; is close to one and the remaining p:'s are close to zero. As we pass
the switch time t s , p; starts decreasing and p~ starts increasing, until finally
we have p~ very close to one and the remaining p:'s close to zero, something
like the behavior depicted in Figure 2.1.
The problem with obtaining this behavior is that, before time t s , p; is very
close to 1 and P:, k = 2,3, .. , K very close to O. After t s , p~ starts increasing;
theoretically, if the credit functions are updated for a long enough time, p~
will become 1. But, since before ts p~ is very close to 0, after ts it starts from
a very unfavorable initial condition. Therefore it is most likely that a new
source switch will take place before p~ becomes sufficiently large. In such a
case, classification to the second source fails. In the extreme case, because of
numerical computer underflow p~ is set to zero before ts; referring to eq.(2.17)
we observe that p~ will remain 0 for all subsequent time steps. To resolve
this problem, whenever p~ falls below a specified threshold h, it is reset to h.
Then the usual normalization of the p~'s is performed; this ensures that the
thresholded p~'s remain approximately within the [h, 1] range and add to 1. In
essence, this thresholding is equivalent to introducing a forgetting factor; the
argument to show this goes as follows. Suppose that several samples of the
time series are observed, which have not been produced from the k-th source.
For each such sample, the k-th predictor produces a large error and, as can be
seen from eq.(2.17), p~ is multiplied by a number close to zero. If this process
continued for several time steps, p~ would become zero soon, as explained
above. If we never let p~ go below h, we essentially stop penalizing predictor k
for further bad predictions; these are, in effect, "forgotten". If h is small, then
p~ will also be small and will not essentially alter the classification results. On
the other hand, when source no.k becomes active, p~ can recover quickly. An
alternative way of looking at the use of threshold h is that whenever one or
more p~'s fall below h (because the corresponding predictors perform poorly) we
restart the algorithm using new initial values for the credit functions, obtained
from resetting the corresponding p~'s to h. Under this interpretation, our
prior belief in any of the source models never goes below h. In the experiments
we present in Section 2.6, we always chose h=O.Ol; this choice is arbitrary but
consistent and gives good results.
We will further consider the possibility of probabilistic and phenomenolog-

ical modeling of the source switching process in Chapter 4. Some interesting
variants of the PREMONN algorithm result from such considerations; in par-
ticular a connection to a species of hidden Markov models (Rabiner, 1989) is
established. However, we have found that the thresholding method presented
usually yields superior classification results, so that the added complexity of
modeling the source switching process is not justified or necessary (at least as
far as the particular Markovian model is concerned).
2.4 IMPLEMENTATION AND VARIANTS OF THE PREMONN

ALGORITHM
In this section we discuss some issues which are related to the implementation
of the PREMONN algorithm and also introduce some possible variations of the
basic algorithm. These issues are treated here briefly; a detailed presentation of
variant algorithms appears in Chapter 4 and a full discussion of implementation
issues (for the standard and variant algorithms) appears in Chapter 7.
20
Figure 2.2. The PREMONN architecture.
Yt
p/
Credit
Pt2
Assgn.
PtK
2.4.1 PREMONN Architecture

The PREMONN algorithm has two components: prediction and credit assign-
ment. This structure corresponds to the hierarchical network implementation
illustrated in Fig.2.2.
The left side of the network corresponds to partition, which is implemented
on the basis of prediction. One predictor corresponds to each source in the
source set. The predictor modules can be implemented as neural networks.
Training of the bottom level is performed offline. The right side of the net-
work corresponds to recursive credit assignment according to eq.(2.17). Credit
assignment can be seen as a form of adaptive, learning process.
The network is also modular, in the strong sense that both the predictive
and credit assignment modules are interchangeable. For instance, after setting
up and operating a PREMONN, one may choose to retain some of the pre-
dictive modules, which have performed well, and replace some other with new
modules of the same type, or even of a different type. Of course, one may decide
to replace the credit assignment scheme, instead, while retaining the original
predictors. Hence PREMONNs are truly modular systems.
2.4.2 Parallelism and Scaling

It is well known that lumped neural networks scale badly. AB the number of
their parameters increases, training may take exponentially long time and/or
fail to achieve a good solution. Hence, modular networks, where fixed size
modules are trained in a piecewise manner, are highly desirable. PREMONNs
fall in this category. Clearly, both network size and training time scale linearly
with the number of sources (categories) that must be learned. Also, PRE-
MONNs perform well even when the individual prediction modules have poor
performance, as long as they are clearly separated in the parameter space: the
decision module will simply pick the "least bad" prediction module. In par-
ticular, PREMONN is immune to a high level of noise in the data (in this
connection see also Section 2.6). Finally, the modularity of the PREMONN
algorithm introduces parallelism naturally. Prediction modules can execute in
parallel and send the results to the decision module. Hence execution time is
independent of the number of classes in the classification problem.
2.4.3 Variance and Threshold
Two parameters influence the performance of the basic PREMONN algorithm:

(the variance of prediction error) and h (the credit threshold). These para-
(72
meters are connected to each other as will be now explained.
For correct classification to take place, it is desired that the credit func-
tions converge, either to one (for the true source) or to zero (for the remaining
sources). As can be seen from eq.(2.19) convergence depends on the cumula-
tive square error and on the error variance (72. In particular, if the cumulative
square errors of all sources are considered fixed, a smaller value of (72 results in
faster convergence. Intuitively, larger variance means that less information is
generated per observation, so more observations must be collected to reach a
certain level of confidence. (72 can be estimated during the predictor training
phase, but we found by computer experimentation that sometimes it is advan-
tageous to modify this estimate. For faster classification, the variance must be
decreased.
The variance (72 is connected to the threshold h (which was introduced in
Section 2.3 in connection to source switching). In particular, both (72 and hare
related to a speed/accuracy trade-off. Large variance slows the network down
and assigns little importance to individual errors; small variance speeds the
network up, but also assigns more importance to instantaneous fluctuations and
makes the network more prone to instantaneous classification errors. Similarly,
a low threshold (or absence of threshold) incurs on the credit functions a large
recovery time between source switchings. On the other hand, a high threshold
tends to obliterate the significance of past performance and makes the response
of the network to source switchings faster, at the cost of spuriously interspersed
false classifications.
2.4.4 Predictor Variants
In addition to variance and threshold, the performance of the PREMONN

algorithm is influenced by the choice of predictors. We have so far assumed that
the predictor functions h(.) are sigmoid feedforward neural networks; but it is
clear that any other functional form can be used. We have experimented with
linear, polynomial and RBF (radial basis functions) predictors; other choices
include (but are not limited to) splines, recurrent neural networks, state space
22
Kalman filters etc. In fact, several different predictor types can be used within
the same PREMONN.
In general, the probabilistic analysis presented in Section 2.1 breaks down in
case the source functions Fk(') are different from the predictor functions Ik(')'
This, however, presents no difficulty in the phenomenological framework, where
the functions Fk (.) can be completely ignored; in this context it is only required
that predictions yf are available for k = 1,2, ... , K and t = 1,2, .... As soon
as the yf are available, no matter how they were generated, credit assignment
can take place. Hence it can be seen that the phenomenological point of view
yields considerable freedom in the design of PREMONNs.
After a particular type of predictor is chosen, values must be selected for
various predictor parameters; for instance, in a sigmoid feedforward neural
network, the number of layers and neurons, as well as the predictor order
M. The usual common sense rules should be applied to the selection of such
parameters. However, it must be stressed that such choices are not crucial for
the performance of the algorithm, because PREMONNs are particularly robust
to faulty predictions. This is easily understood by considering eq.(2.19) once
again. It can be seen that what matters in credit assignment is not absolute,
but relative prediction performance. Consider, for instance, the case where the
k-th predictor is tuned to the currently active source, but is not well trained.
In this case we may expect the errors e~ to be large, but consistently smaller
than those of the remaining predictors. Considering eq.(2.19) it is clear that
in this case p~ will dominate pr;' for m i= k, resulting in correct classification.
This observation is corroborated by the experiements presented in Section 2.6,
as well as by the theoretical analysis of Chapter 5.
2.4.5 Credit Assignment Variants

Regarding credit assignment, there is nothing special about the Gaussian error
le~r
function ~O" 'e- 2<7 we have used in Sections 2.1 and 2.2. Suppose a function
of the form e-g(le; I) is used, where g(.) is a strictly positive, increasing function.
Then the following update equation results
(2.20)
This is a meaningful credit assignment scheme; it yields the credit ratio
(2.21)
which again penalizes more heavily predictors with higher error, while keeping
track of past predictor performance. Credit update schemes of the form (2.20)
are termed multiplicative schemes (for obvious reasons).
The cumulative squared prediction error can also be used directly, rather
than in its exponential form; in this case we can obtain credit update schemes
of the form
p~ = P~-I + g (le~1) ; (2.22)
where g(.) is a strictly positive and increasing function. In this case, however,
we lose several attractive features of the credit function. In particular, the
properties p~ < 1 and E;;=I p~ = 1, for all k and t, do not hold any longer. In
addition, p~ now describes the discredit, rather than credit of the k-th model.
We call such schemes additive.
We have also implemented incremental credit assignment schemes (which
resemble a steepest ascent procedure) and schemes which implement fuzzy rea-
soning (by using the minimum operator in place of the product and the maxi-
mum operator in place of the sum). These schemes will be presented in Chapter
4. Note that the variants presented previously refer to the credit assignment
scheme and are completely independent of the type of predictive modules used.
Such variants will be examined in Chapter 4.
2.5 PREDICTION
A prediction algorithm can be easily obtained from the PREMONN classifica-
tion algorithm. The key idea is predictor combination, a methodology which
has become increasingly popular in the last decade (Clemen, 1989; Farmer and
Sidorowich, 1988; Perrone and Cooper, 1993; Quandt, 1958; Quandt and Ram-
sey, 1978; Tong and Lim, 1980). Generally speaking, predictor combination
consists in the use of a collection of predictors (operating in parallel) in con-
junction with a predictor selection mechanism; the final outcome is a prediction
fit of Yt.
Several predictor selection mechanisms have appeared in the literature. The
PREMONN architecture is particularly suitable for a predictor combination
approach. The predictor collection (i.e. y~ = fk(Yt-l, Yt-2, ... , Yt-M), k =
1,2, ... , K) is available as an integral component of the PREMONN; further-
more prediction combination can be effected in a natural manner by use of the
credit functions p~. We have experimented with the following two combination
methods.
1. Weighted Combination: fit = E;;=I pLIY~.

2. Winner-take-all Combination: fit = yft, where Zt = arg max k=I,2, ... ,K pLI.
The rationale behind these combination methods is quite obvious. For in-
stance, the weighted combination method generates the combined prediction
fit by assigning greater importance to predictors of higher credit, i.e. to predic-
tors which are consistently characterized by relatively small prediction error.
The winner-take-all method can be seen as a limiting case of the weighted
combination, where all importance is assigned to the currently best predictor.
24
Our experience indicates that, while each of the two predictor combination
methods may be advantageous in particular situations, neither method has a
clear, universal advantage over the other. At any rate, both methods give quite
satisfactory prediction results.
Finally, it is worth noting that in both the weighted and winner-take-all
methods, the combined prediction Yt depends only on quantities (pLl' yf for
k = 1,2, ... , K) which can be computed at time t - 1.
2.6 EXPERIMENTS
In this section we apply the basic PREMONN Classification Algorithm to
several time series classification tasks, using computer synthesized logistic,
Mackey-Glass and sequential logic gates time series. We explore the effect of
varying parameter values and observation noise levels on classification accuracy.
2.6.1 Logistic and Noise Time Series

Our first set of experiments deals with the problem of discriminating logistic and
noise time series. We have chosen these data because of the statistical and visual
resemblance of the logistic and noise time series, which makes classification of
such time series a nontrivial problem.
The logistic time series is produced by the difference equation
Yt = a· Yt-l . (1 - Yt-l). (2.23)
For values of a greater than 3.67, eq.(2.23) yields a chaotic time series, (see
Figs. 2.3, 2.4 and 2.5).
A total of twelve predictor modules have been trained. All the predictors
used are 18-5-1 sigmoid feedforward neural networks. The 18 inputs are the
values Yt-l, ... , Yt-18 and the target output is Yt. Eleven of the predictors have
been trained on sample logistic time series, generated according to eq.(2.23),
with a= 3.0, 3.1, ... ,4.0. The mean square error of these predictors varies, but is
generally between 0.1 and 0.3. The twelfth neural network predictor is trained
on a Gaussian white noise time series, with mean I-£w = 0.50 and standard
deviation C1w =0.25. In this case the mean square error of the predictor is 0.3. 3
Let us now present a few representative experiments. In the first experiment
a test time series is generated, using eq.(2.23), a = 4.0 and different initial con-
ditions than the ones used to generate the training time series.. 182 samples of
the time series have been generated 4. A PREMONN employing two predictors
has been used: the first predictor is the one trained on logistic, a= 4.0, and
the second is the one trained on white noise. The PREMONN is required to
discover that the actual time series is logistic rather than noise. The algorithm
3Note that 0.25, the noise variance, is the minimum theoretically attainable MSE, which can
be attained by using the constant predictor Yt =0.5.
4 Actually 200 samples, of which the first 18 are used as initial values for the sigmoid
predictors.
Figure 2.3. Credit function evolution for a classification experiment involving a logistic
time series and two predictors.
_ : Time Series; ... : Credit Function
1.00
0.90
0.80
0.70
!j 0.80
tI
§ 0.50
IL
~ 0.40
0.30
0.20
0.10
0.00 '-.-
0 25 50 75 100 125 150 175
Time Steps
parameters are (J'= 0.15, h = 0.01. The results of this experiment are presented
in Fig. 2.3. It can be seen that classification to the logistic is correct and very
fast.
In the second experiment a composite time series has been used. The first
half of the time series consists of 82 samples of a logistic time series with a=4.0;
the second half consists of 100 samples of white noise with mean J.tw = 0.5 and
standard deviation (J'w =.25. The PREMONN employs the same two predictor
modules used in the previous experiment; (J'= 0.20, h = 0.01. The task is
to classify the first 82 samples as belonging to a logistic time series and the
next 100 as being white noise. The results of this experiment are presented
in Fig.2.4. In the beginning of the time series, classification to the logistic is
almost instantaneous. Then at the switching point t s =82, a very quick switch
to the noise module is observed.
In the third experiment ten predictor modules are used, corresponding to
logistics with a= 3.0, 3.1, ... , 3.9. 200 time steps of a test logistic have been
generated with a=3.8 (and new initial conditions); (J' = 0.15, h = 0.01. The
results of this experiment are presented in Fig.2.5. It can be seen that classifi-
cation to the true logistic is very fast. This demonstrates PREMONN's ability
to deal with a large number of sources.
In order to evaluate the dependence of classification accuracy on the (J' and
h parameters, as well as on the level of noise in the observations of Yt, we have
performed many additional experiments, which are presented in tabular form.
26
Figure 2.4. Credit function evolution for a classification experiment involving a time series
with two components (logistic and noise) and two predictors. Switching time ts =82.
_ : Time Sertes; _ . _ : Credit Fun. No.1; _ _ _ Credit Fun. No.2
1.00
....... ---~ ............ \ .... ........ .
~
0.90
0.80
0.70
0.60
~
"
u.
~
0.50
~
5 0.40
0.30
0.20
.,,.\
.,
0.10 : I.
0.00 '._ .......... _ • ....1'."'.... '\.~:\) .:.~_

25 50 75 100 125 150 175
Time Steps
Figure 2.5. Credit function evolution for a classification experiment involving a logistic
time series and ten predictors.
_: Time Series; _ __ Credit Function No.9
1.00
~
~ 0.50
~
U
0.00 -P===F====l====F===F===O==='F='==r=-
51 101 151
Time Steps
Table 2.1. Classification accuracy c for various values of (T. Dataset A is a logistic time
series; dataset B is a composite (logistic and noise) time series.
Dataset Uv U h c
A 0.000 0.010 0.010 0.958

A 0.000 0.050 0.010 0.994
A 0.000 0.100 0.010 0.994
A 0.000 0.200 0.010 0.989
A 0.000 0.300 0.010 0.989
B 0.000 0.010 0.010 0.897
B 0.000 0.050 0.010 0.895
B 0.000 0.100 0.010 0.961
B 0.000 0.200 0.010 0.961
B 0.000 0.300 0.010 0.934
In particular, classification results for various values of u are presented in Table

2.1 and results for various values of h are presented in Table 2.2. In Table
2.3 we present results for experiments involving various levels of observation
noise; in other words, in these experiments additive, white, zero mean Gaussian
noise with standard deviation (Tv is added to the observations. Classification
performance is measured by a classification figure of merit c, This is computed
by dividing the number of correctly classified Yt samples, by the total number
of samples, i.e. c = Nt/No, where No is the length of the time series, and Nl
is the number of time steps where the true source is discovered. Two datasets
have been used: dataset A consists of 182 samples of a logistic time series with
a= 4.0; dataset B consists of 82 logistic samples (with a= 4.0) followed by 100
samples of Gaussian white noise with JLw=0.50 and U w = 0.25).
It can be seen from the above tables that the overall classification perfor-
mance is very good. PREMONN exhibits considerable noise robustness; for
very high noise levels a graceful degradation of performance is observed. This
illustrates the point previously discussed, namely that performance of the PRE-
MONN architecture depends not on the absolute, but on the relative perfor-
mance of the prediction modules. Equally important is the fact that PRE-
MONN performance is relatively independent on specific (T and h values. This
implies that fine tuning of the algorithm is not particularly important.
2.6.2 Mackey-Glass Time Series

The classification task in the second set of experiments involves two Mackey-
Glass time series. These are generated according to the following equation
d'lj; 0.2· 'lj;(s - T)

ds = 1 + 'lj;lO(s _ T) - 0.1 . 'lj;(s). (2.24)
28
Table 2.2. Classification accuracy c for various values of h. Dataset A is a logistic time
series; dataset B is a composite (logistic and noise) time series.
Dataset {ju {j h c
A 0.000 0.100 0.001 0.994

A 0.000 0.100 0.010 0.994
A 0.000 0.100 0.100 0.994
A 0.000 0.100 0.200 0.994
B 0.000 0.100 0.001 0.954

B 0.000 0.100 0.010 0.961
B 0.000 0.100 0.100 0.989
B 0.000 0.100 0.200 0.994
Table 2.3. Classification accuracy c for noisy observation of the time series. Dataset A is
a logistic time series; dataset B is a composite (logistic and noise) time series. Both time
series are mixed with additive white noise (observation noise) with zero mean and standard
deviation {j u .
Dataset {ju (j h c
A 0.000 0.100 0.010 0.994
A 0.050 0.150 0.010 0.994
A 0.100 0.200 0.010 0.994
A 0.200 0.300 0.010 0.989
A 0.300 0.400 0.010 0.961
A 0.500 0.600 0.010 0.741
B 0.000 0.100 0.010 0.961
B 0.050 0.150 0.010 0.928
B 0.100 0.200 0.010 0.873
B 0.200 0.300 0.010 0.879
B 0.300 0.400 0.010 0.522
B 0.500 0.600 0.010 0.516
Using first T = 10 and then T = 17 we obtain two chaotic time series. More
specifically, for each value of T we integrate eq.(2.24) and then sample at times
8=5, 10, 15, ... sees to obtain a time series YI, Y2, ... , where Yt = '¢(5 . t), t=l,
2, ... . We use the above data to train two sigmoid feedforward predictors
(both of size 5-5-1), Input is Yt-l, Yt-2, ... , Yt-5 and target output is Yt . The
mean square prediction error is (for both predictors) approximately 0.04.
Figure 2.6. Credit function evolution for a classification experiment involving a Mackey
Glass time series.
_: Time Ser1es; ___: Credit FunctiOn
1.50
1.20
g 0.90
]
~ 0.60
0.30
0.00
, -,_ . . •' .••••• - •. ~ .... _.- .•.•.• 0- __ . _ . . . . . . . . . .
o 50 100 150
TImeSIeps
In the first classification experiment we use 200 samples of a Mackey-Glass

time series generated with T = 17. The PREMONN employs the two predictors
previously trained. The task is to classify the observed time series as having T
= 17. We use u = 0.040, h = 0.01. The results of this experiment are presented
in Fig. 2.6. It can be seen that classification is correct and very fast.
In the second experiment a composite Mackey-Glass time series has been
used. The first 200 time steps have T = 17 and the last 200 time steps have
T = 10. Also, Gaussian, zero-mean white noise with standard deviation U w =
0.3 has been added to the data. The PREMONN employs the same predictors
as in the previous experiment; u = 0.04, h = 0.01. The results of this experiment
are presented in Fig. 2.7. In the beginning of the time series, classification to
the logistic is almost instantaneous. Then at the switching point t s =200, a
very quick switch to noise classification is observed.
We have performed many additional experiments, which we present in tab-
ular form. Two datasets are used: dataset A consists of 200 samples of a
Mackey-Glass time series with T = 17; dataset B consists of 200 samples of a
Mackey Glass time series with T = 10 followed by 200 time samples of a Mackey
Glass time series with T = 17. Classification performance is measured by c,
which is computed in the same manner as in the previous section. In Table 2.4
we experiment with u values, in Table 2.5 with h values, and in Table 2.6 with
observation noise levels.
From the above tables it can be seen that the overall classification perfor-
mance is again very good, robust to noise and not particularly sensitive to the
30
Figure 2.7. Credit function evolution for a classification experiment involving a composite
time series with two Mackey-Glass components.
_ : Time Serles; ___ : Credit Function No.1;_._: Credit Function No.2
1.50
,~
1.20
, ...
0.90 ; !
i ~
!:
;
0.60
~
0.30 :!',
\~,.'"._ ~.,._
••• .......,> ., ••, ............, •• .' •.•• _ _ ~.J! \\.._."~..,~'.,..\.....,.".....,.. ,.~"..'".
0.00 .J....-~"""'=~*":=:::~~='=i---""'''''''''~=.:!'''I-'''''''''''='''''''''''-''''''''
o 100 200 300
Time Steps
Table 2.4. Classification accuracy c for various values of (T. Dataset A is a time series with
one Mackey-Glass component; dataset B is a composite time series with two Mackey-Glass
components.
Dataset Uw U h c
A 0.000 0.010 0.010 0.850
A 0.000 0.050 0.010 0.995
A 0.000 0.100 0.010 0.995
A 0.000 0.200 0.010 0.950
A 0.000 0.300 0.010 0.930
B 0.000 0.010 0.010 0.955
B 0.000 0.050 0.010 0.993
B 0.000 0.100 0.010 0.980
B 0.000 0.200 0.010 0.895
B 0.000 0.300 0.010 0.755
values of the (T and h parameters. Hence the conclusions of the previous section
are further corroborated.
Table 2.5. Classification accuracy c for various values of h. Dataset A is a time series with
components.
Dataset aw a h c
A 0.000 0.040 0.001 0.995

A 0.000 0.040 0.010 0.995
A 0.000 0.040 0.100 0.995
A 0.000 0.040 0.200 0.995
B 0.000 0.040 0.001 0.993
B 0.000 0.040 0.010 0.993
B 0.000 0.040 0.100 0.988
B 0.000 0.040 0.200 0.978
Table 2.6. Classification accuracy c for various values of a. Dataset A is a time series with
components. Both time series are mixed with additive white noise (observation noise) with
zero mean and standard deviation a v .
Dataset av a h c
A 0.000 0.040 0.010 0.995
A 0.050 0.090 0.010 0.995
A 0.100 0.140 0.010 0.995
A 0.200 0.240 0.010 0.965
A 0.300 0.340 0.010 0.975
A 0.500 0.540 0.010 0.855
B 0.000 0.040 0.010 0.993
B 0.050 0.090 0.010 0.990
B 0.100 0.140 0.010 0.980
B 0.200 0.240 0.010 0.888
B 0.300 0.340 0.010 0.875
B 0.500 0.540 0.010 0.558
2.6.3 Sequential Logic Gates

This final group of experiments is different from the previous ones in two re-
spects. First, the time series Yt, t = 1,2, ... takes discrete (0/1) rather than
continuous values. Second, the time series includes a non additive noise com-
ponent Ut, t = 1,2,.... More specifically, the time series Yt is produced by
32
successive activation of four sources, described by the following Boolean differ-

ence equations:
Yt = XOR(Yt-b Ut), (2.25)
Yt = NOT(ut), (2.26)
Yt = NOR(Yt-b Ut), (2.27)
Yt = NAND(Yt-l, Ut). (2.28)
Here XOR, NOT, NOR and NAND are the usual logic gates; Yt and Ut are
Boolean variables, taking values in {O, 1}; in particular, Ut, U2, •.• is a sequence
of independent random variables, which takes each the values 0 and 1 with
probability 0.50.
We first run each of the above equations separately, with randomly generated
Ul, U2, ... sequences and four (training) Yt time series, each oflength 200. These
time series are used to train four sigmoid 2-3-1 feedforward predictors of the
form (k = 1,2,3,4)
y~ = h(Yt-l,Ut).
(notice the slight change from the previous model; the predictions now depend
not only on past values Yt-l, but also on input ut-}
In the test phase we run two experiments. The first involved a time series
generated by a source switching sequence of the form XOR - NOT - NOR -
NOT (illustrated in Fig.2.8). The second time series was generated by a source
switching sequence of the form XOR _ NAND- NOR - NAND (illustrated
in Fig.2.10). The classification results for the two experiments are illustrated
in Figs. 2.9 and 2.11. It can be seen that classification is almost perfect.
2.7 CONCLUSIONS
In this chapter we have presented the basic PREMONN classification algorithm.
Classification is performed by recursive computation of the credit functions pf,
which indicate the likelihood of each candidate source having generated the
observed time series.
The motivation for developing the algorithm has been probabilistic, but it
has been established by informal arguments that its use can be justified by
purely phenomenological arguments, without recourse to probabilistic consid-
erations. Further justification for the use of this and related algorithms will
be provided in Chapter 5, where convergence to correct classification will be
established mathematically. For the time being, the following remarks are in
order. The phenomenological point of view offers greater flexibility than the
probabilistic one, in the sense that the PREMONN algorithm can be applied
to a wider range of problems and modified in various ways (some of these mod-
ification will be presented in Chapter 3 and their possible advantages will be
discussed). In addition, the phenomenological point of view allows classification
of time series generated by switching sources.
Figure 2.8. The sequence of source switchings used in the first classification experiment.
XOR corresponds to source no.1, NOT corresponds to source no.2, NOR corresponds to
source no.3, NAND corresponds to source no.4.
o+-______+-______+-______ ~------~------~
101 201 301 401

Time Steps
Figure 2.9. Credit function evolution for the first logic gates classification experiment.
r" --------: f"

1.00
~ ............ ": r.. ' _. -'_.-._.
: :: : I
I ,: : i
: :~ :i
:
I
I:
I.
:i:i
: ~f :i
6 I I: :1
~ : :: ;i
I'rl",
0.50
':
:~
IL
~ ,.,:
,: "
Ii
~ Ii
I, i,
I: I ,
Ii I :
1
1
0.00 ILL_ IL '

101 201 301 401
Time Steps
34
Figure 2.10. The sequence of source switchings used in the first classification experiment.
XOR corresponds to source no.l. NOT corresponds to source no.2. NOR corresponds to
source no.3. NAND corresponds to source no.4.
101 201 301 401

Time Steps
Figure 2.11. Credit function evolution for the first logic gates classification experiment.
1.00
r - - - - - - - - , •.•. _._._._._.-: J - - - - - - - - I .. 1······· __ · __ ···
I
,! I:'
':
, ':
'!l
6
c:
;;;"
0.50
.
IL
r ,!
I
~
u :t:
",
l:,
,I .,
I, .,"!,
I
II "!,
"
L I![
0.00
101 201 301 401
Time Steps
The recursive nature of the PREMONN algorithm allows online operation.

In addition, preliminary evidence from numerical experiments indicates that the
algorithm is robust to noise and does not require fine tuning of its parameters.
Appendix 2.A: Proof of the Bayesian Update Formula

In this appendix we will prove the Theorem 2.1, i.e. we will demonstrate the
validity of the Bayesian posterior probability update equation
IYt -;~ 12
k pLl . e- 2"
Pt = ----='----"----:-1
K n _
--2"712
Yt-yf
0. (2.A.l)
Ln=l Pt-l . e 2"
Eq.(2.A.l) is Bayes' rule applied to probability densities, rather than to prob-

abilities. The intuition behind the proof is based on a straightforward appli-
cation of Bayes' rule; however some care is required because densities for both
continuous- and discrete- valued random variables are involved. We will need
to recall the definitions of the following quantities (recall that k = 1,2, ... , K)
p~ = Pr(z = klYl, ... ,Yt),
yt = Fz(Yt-b ... , Yt-M),

Y; = Fk(Yt-l, ... , Yt-M),
gee) = -1- . e- I20'e 12
V2ncr
Also, the reader should keep in mind that for the discrete valued variable z,
probabilities and probability densities are equivalent.
Proof of Theorem 2.1 :The proof proceeds in four steps. We first demonstrate
the following relationships regarding probability densities
(2.A.2)
dy"z(a, k 1 Yb Y2, ... , Yt-t} = d Yt (a 1 Yb Y2, ... , Yt-b z = k) . P~-l (2.A.3)
k dy"z(a,k Yl,Y2, .. ·,Yt-l)

Pt = -=-:';--';--;--'''':''''-'':'''':''''--'-''-'-:-'-
1
(2.A.4)
d Yt (a 1 Yl, Y2, ... , Yt-l) ,
(these are the relationships introduced in Section 2.1). Then the above rela-
tionships will be used to prove eq.(2.A.l).
Step 1. Since et is independent of z and Yl, Y2 ,... , it follows that
Pr(et < a) = Pr(et < aIYl ... Yt-b z = k) ::::}

36
(recall that yf is a deterministic function of Yl, ... , Yt-l and et is zero mean,
Gaussian)
j-
a
(Xl
1 ~
- - . e- 2.,. dy
..;27rcr
= Pr(Yt - yf < aIYl, ... , Yt-l, Z = k) =>
j-
a
(Xl
1 ~
- - . e- 2.,. dy
..;27rcr
= Pr(Yt < a + yflYl, ... ,Yt-l, Z = k) =>
(substituting a - yf in place of a)
j-(Xl
a- y : 1 ~
m= . e- 2.,. dy
V 27m
= Pr(Yt < alyl, ... , Yt-l, Z = k). (2.A.5)
Now, if we differentiate both sides of eq.(2.A.5) with respect to a, we will get

1 la_y k l2
m= . e-~ = dy,(aIYl ..Yt-l,Z = k). (2.A.6)
V 27rcr
which shows that eq.(2.A.2) is true. Also, let us define (for brevity)
qk(a) ~ dy, (a I Yl,Y2, ... ,Yt-l, Z = k);
then we can rewrite eq.(2.A.6) as
1 la-yk 12
qk(a) = - - . e-~ .
..;27rcr
Step 2. Let us next find the conditional joint probability density of Pr(Yt <
a, Z = k I Yl, Y2, ... , Yt-l). Using the standard Bayes' rule, we have
Pr(Yt < a, Z = k I Yl, ... , Yt-l) = Pr(Yt < a I Yl· .. , Yt-l, Z = k)·
Pr(z =k I Yl, ... ,Yt-l) (2.A.7)
Note that eq.(2.A.7) makes sense because both probabilities in the right hand
side are well defined. In addition, as remarked in Step 1, Pr(Yt < a I
Yl ... , Yt-l ,Z = k) has a Gaussian density, denoted by qk (a). Hence, by dif-
ferentiating eq.(2.A.7) with respect to a, we obtain
dy"z(a, k I Yl, ... , Yt-l) = qk(a) . ptl. (2.A.8)
Eq.(2.A.8) is equivalent to eq.(2.A.3). Defining (for brevity)
r(a, k) ~ dy"z(a, k I Yl, ... , Yt-d.
we can rewrite eq.(2.A.8) as
r(a, k) = qk(a) . P~-l. (2.A.9)
Note that by summing eq.(2.A.7) with respect to k, we obtain
K
Pr(Yt < a I Yl, ... , Yt-d = L Pr(Yt < a I Yl, ... , Yt-l, Z = k)·
k=l
Pr(z = k I Y1, ... ,Yt-1). (2.A.l0)

and by differentiating eq.(2.A.l0) with respect to a we obtain
K
dyt(a I Yl, .··,Yt-d = ~.~>k(a). P~-1; (2.A.1l)

k=1
defining (for brevity)
sea) == dYt (a I Y1, ... , Yt-1).
we can rewrite eq.(2.A.1l) as
K
sea) = Lqk(a). pt1' (2.A.12)
k=1
Step 3. In view of the above definitions, eq.(2.A.4) is equivalent to
r(a,k)
Pr(z = kIY1 ...Yt-1,Yt = a) = ~. (2.A.13)
Hence it suffices to prove eq.(2.A.13). According to the definition of conditional

probability (see Mathematical Appendix A) this will be proved if we show that
for all G = ex E (where E E B1 ) we have
1 r(a,k)
G ~dP(a, m) = Pr(z = k, Yt E E). (2.A.14)
Here P denotes the probability measure induced on z and et by conditioning

on the current values of Yb Y2, ... , Yt-1 and Yt = a. Now, the joint density of
z and Yt with respect to the measure of probability conditional on Y1, Y2, .,. ,
Yt-l is rea, m). Using this fact, Fubini's theorem, and the fact that the integral
over a discrete valued random variable reduces to a sum, we have
1 G
rea, k)
~dP(a,m)
r ~ rea, k)
= JE~ ~ 'r(a,m)da =
Ie rea, k) . {t, r~~'a7) } da = Ie rea, k)da = Pr(z = k, et E E);
this completes the proof of eq.(2.A.14).

Step 4. Finally, since rea, k) = qk(a)· pLl and sea) = ~~=1 rea, k), it follows
that
rea, k) qk(a) . pL1
Pr(z = kIYl···Yt-l,Yt = a) = - - = --'K~'-'----"--"--
sea) ~n=l qn(a) . pf-1
or, in other words,
38
by cancelling the vhCT in numerator and denominator, the proof of the theorem
is complete .•
3 GENERALIZATIONS OF THE BASIC
PREMONN
In this chapter we present variants of the basic PREMONN algorithm. Variant

algorithms can be obtained by modifying: (a) the predictive modules, (b) the
manner in which the prediction error is computed and (c) the credit update
algorithm.
3.1 PREDICTOR MODIFICATIONS
It has been assumed so far that the predictors fk (.) are feedforward sigmoid
neural networks. However, as has already been pointed out, the fk(.)'S may
represent various other functional forms, such as linear functions, polynomials,
radial basis functions, splines etc. In addition, the predictions can be obtained
by feedforward or feedback (recurrent) calculations. Generally, the only require-
ment for the operation of the PREMONN algorithms is that the predictions yf
are available for k = 1,2, ... ,K and t = 1,2, ... j the method by which these are
obtained is not important. In fact, a bank of predictive modules may include
predictors of different types, for instance a sigmoid and a linear predictor may
coexist in the same PREMONN. The use of different predictor types may be
advantageous if it is known that different sources are approximated better by
different predictor types.

40
3.2 PREDICTION ERROR MODIFICATIONS

So far it has been assumed that the observation error e~ is a scalar variable.
There are two ways in which a vector error can be introduced; neither of these
results in substantial modifications of the credit update algorithm.
3.2.1 Vector Valued Time Series

There is no substantial change in the PREMONN algorithm if the time series
Yt is vector valued. In this case, the predictions Y~ are also vector valued. The
credit update algorithm is still written as
the only difference being that now 1.1 signifies Euclidean norm. Variants in-
volving the exponential form e- E : QE: (with Q a positive definite matrix) are
also possible and obvious. These do not induce substantial modifications in the
credit update algorithm.
3.2.2 Slow Credit Update

In the presentation of Chapter 1 it has been assumed that the observation of
the time series and the update of the credit take place at the same time scale.
This, however, is not necessary. It may be preferable to update the credit
functions at a slower rate than that at which the time series is observed.
To describe this idea precisely, consider again the observed time series Yl,
Y2, ... and the predictor equation
(3.1)
where k = 1,2, ... , K and s = 1,2, '" . Note the change of the time variable:
time is now denoted by t rather than s. This is done because we reserve the
variable t to denote data blocks rather than single data points. Specifically, we
define observation blocks Yi and prediction blocks ~k as follows
yt == [
Y(t-l).N+1
~~t-l)'N+2
1
, ~k ==
Y~-l)"N+l
~~t-l)"N+2
k 1 . (3.2)
[
Y(t-l)"N+N Y(t-l)"N+N
For example, t = 1 corresponds to s = 1, 2, ... , N or, in other words, Y1 corre-

sponds to [Yl Y2 ... YN]'. 1 We also define the N-step block prediction error
Ef as follows
(3.3)
IGenerally, if X is a vector, X, denotes the transpose of X.

GENERALIZATIONS OF THE BASIC PREMONN 41
Finally, the credit functions are redefined, in terms of block prediction errors,
by
1Ekl2
k pLl ·e-~ (3.4)

Pt = IEnl2
L~=lP~-l . e-~
Note that 1'1 now signifies Euclidean norm. Also note that the variables yt,
ytk, Ef, pf depend on N; however this dependence is not denoted explicitly,
for reasons of brevity. .
Eqs.(3.4) contains eq.(2.17) as a special case, which is obtained by setting
N = 1; in this case we also have s = t and yt = Yt, ytk = yf, Ef = ef. Unless
otherwise mentioned, all further discussion will be based on the use of the more
general equation (3.4).
3.3 CREDIT ASSIGNMENT MODIFICATIONS

Substantial and useful variants of the original PREMONN algorithm may result
by altering the credit assignment method. We characterize the original PRE-
MONN credit assignment as multiplicative; we have also experimented with
additive, counting, incremental and fuzzy credit assignment algorithms. The
meaning of the above terms will become clear in what follows.
3.3.1 Multiplicative Credit Assignment

The basic PREMONN uses a multiplicative credit update scheme. In other
words, the new credit pf is a normalized version of the product
It has already been remarked in Chapter 2 that there is nothing special about
IE"1 2
the quadratic term in the exponential e-~. Any function of the form
e-g(lEm will do, as long as g(.) is a strictly positive and increasing function.
Hence a credit update equation of the following form may be used
k e-g(lEm . pLl
Pt = (3.5)
L~=l e- g (I E t'1) . P~-l
This equation is presented (Kehagias and Petridis, 1997a). We have experi-
mented with several such functions and obtained results comparable to those
of the quadratic error function 2 . The general idea is clear: large errors result
2In fact, any strictly positive function G(.) can be written in the form G(.) = e-log[G(.}];
=
hence, using g(.) log [G(.)), the update
G(IEfl) ·ptl
G (I Et'1) 'Pt'-l '
k
Pt = Ln=l
K
(3.6)
42
in a small value of the exponential; this, in turn, results in a decreased value

of the new credit pf. This can be made more precise by considering the credit
ratio:
e-g(!E~1) . Pf-l pf-l e-g(!E~1)
=--. (3.7)
e-g(!Et'1) . P~l P~l e-g(!Et'1) .
Hence, if at time step t the k-th predictor has larger error than the m-th one,
k k
then the ratio -4
p.
is reduced relative to the ratio -4.
p.
In fact, by repeating the
above argument for times t - 1, t - 2, ... , 1, we obtain
(3.8)
It becomes obvious from eq.(3.8) that multiplicative credit update schemes fur-
nish a method for evaluating credits according to the exponentiated cumulative
square error. Namely, if after t observations of the time series the k-th predictor
has larger error than the m-th one (as measured by the function g(.)), then the
k
ratio ~ is less than one. In short: predictors of larger error receive smaller
credit.
3.3.2 Additive Credit Assignment

The family of PREMONN algorithms can be further simplified by dispensing
with the use of the exponential function. Since credit assignment depends
ultimately on cumulative square error, credit assignment may be performed in
an additive manner by the following recursion
p~ = Pf-l + 9 (IEfl) , (3.9)

where g(.) is any positive, increasing function. This equation is presented
(Kehagias and Petridis, 1997a). The simplest choice for g(.) is the quadratic
(3.10)
Note that in eqs.(3.9) and (3.10) the function pf actually is the discredit (rather
than the credit) of the k-th predictor: a large value of pf indicates that the
respective predictor is performing poorly. Changing slightly eq.(3.9) we obtain
the recursion
k
Pt =
t - 1 Pt-l
-t-· k
+ 1E tk 12 , (3.11)
in which case P:
yields the running average of the cumulative square error of
the k-th model.
can be used with any strictly positive and decreasing function G(.).
Additive credit update schemes of the forms (3.9), (3.11) are easier to im-
plement than multiplicative ones; however, our experience indicates that their
performance is somewhat inferior to that of multiplicative schemes. In addi-
tion, some attractive properties of the credit function are lost when an additive
algorithm is used. For instance, the properties 0 < p~ < 1 and 2:~=1 p~ = 1,
for all k and t, do not hold any longer. However, the difference in performance
is not that great and the simplicity of implementation makes additive schemes
an attractive alternative to the multiplicative ones.
3.3.3 Counting Credit Assignment

The additive credit assignment algorithm represents a computational simplifi-
cation over the multiplicative one. A further simplification can be obtained by
assigning credit to each predictor according to the number of times it yields
minimum (over the K predictors) prediction error. We call this a counting
algorithm; to describe it more formally, it is necessary to use the indicator
function
l(E~ < Ef for m =I k) = 1 eiflEf{0se.
< Ern for m =I k;
The counting credit update algorithm is then described by

p~ = ptl + 1 (E~ < Ern for m =I k)
Hence, at time t one unit of credit is assigned to the k-th source if it has
minimum prediction error; all other sources receive zero credit. This equation
is presented (Kehagias and Petridis, 1997a).
This scheme has minimal computational requirements; its performance is
quite good and certainly comparable to that of additive schemes; multiplicative
schemes tend to perform better, as will be seen in Section 2.A.
3.3.4 Fuzzy Credit Assignment

We now present a credit assignment scheme motivated from fuzzy set theory.
After giving the fuzzy theoretic motivation, we present an argument to phenom-
enologically justify this scheme. In the following paragraphs we will make use
of fuzzy set terminology; the reader is referred to the Mathematical Appendix
A for a detailed explanation of some of the terms used here.
Fuzzy Set Formulation. The source set is a finite crisp set 8 = {1, 2, ... , K}.
The source variable z takes, as usual, values in 8. The estimate of z is called
Zt and also takes values in 8. The computation of Zt at time t is based on a
process of fuzzy inference.
Consider the attribute: "source no. z has been active from time s up to time
t". A crisp set of elements that satisfy this attribute, must include exactly one
member of 8; this is so because it is assumed that the time series is generated
by a single source. However, we propose to use a fuzzy set:
A(s, t) = {(z, ILA(s.t) (z)) Iz E 8}.
44
The fuzzy set A(s, t) consists of the crisp set e (the set of possible values of
the source parameter) and the membership function PA(s,t) (z); for a given z,
PA(s,t)(z) is the membership grade of the attribute "source no. z has been active
from time s to time t". Obviously A(s, t) has a time dependence on times s
and t. Now, consider the k-th member of e: PA(l,t)(k) is the membership
grade of "source no. k has been active from time 1 to time t", or equivalently,
"observations Yl, Y2, ... ,Yt have been generated by source no. k". For economy
of space, and also for compatibility with previous analysis, we use the notation
p~ ~ PA(l,t)(k);
What is required here is to provide and justify a method for updating P: at
every time step. This will be derived presently; but first note that, for a given
time t, it is natural to set
Zt = arg max P:'

k=1,2, ... ,K
In other words the time series is classified to the source no.Zt which achieves
maximum membership grade.
Membership Grade Update. A reasonable membership function can be

computed in a recursive manner using the following update
PA(l,t)(k) = PA(l,t-l)(k) AND PA(t-l,t)(k). (3.12)

Eq.(3.12) has the following meaning: the membership grade of the attribute
"the complete Yl, ... ,Yt observation has been generated by source no. k" is
the same as the membership grade of the conjuncted attributes "the complete
Yl, ... ,Yt-l observation has been generated by source no. k" AND "the
Yt observation has been generated by source no. k". Thus, P: is computed
in terms of PA(l,t-l)(k) and PA(t-l,t)(k). The latter can be computed as a
function of the form
(3.13)
where g(.) is any positive increasing function; for instance we can use g(IEfl)=
1Ekl2
e-T;;i to obtain
IE k l 2
PA(t-l,t)(k) = e-T;;i
In eq.(3.13) the membership grade is expressed in terms of predictive accu-
IE k l 2
racy: for instance, when I Ef I is large, PA(t-l,t)(k) = e-T;;i is small. Now
eqs.(3.12), (3.13) result in the following recursive equation:
Ptk = pkt - lAND e-g(IE~1) . (3.14)

The implementation of the AND conjunction in (3.14) has not yet been spec-
ified; several options are available and will be discussed presently. At any
rate, (3.14) shows that when I Ef I is large, then e-g(IE~1> (and consequently
PA(t-l,t)(k)) is small; this implies thatPA(I,t_l)(k) AND e-g(IE~1> is also small.
In fact, a little reflection shows that eq.(3.14) results in a decreasing sequence
of membership grades pf. This may result in various implementation problems
(e.g. numerical underflow), so a normalized form will be used in what follows.
k pf_1AND e-g(IE~1>
(3.15)
Pt = OR :=1 (pf_1AND e-g(IEt'I>)"
The previous comments about the influence of I Ef I on pf apply to eq.(3.15)
as well, but now relative, not absolute, magnitude of I Ef I influences pf, since
the computation of membership grades is competitive. Hence, a large I Ef I
does not necessarily imply small membership grade pf; the value of pf may be
large if I E;' I > I Ef I for n 1= k, that is, if other predictors perform even worse.
Note that the form of the decision module has not yet been specified; this will
depend on the implementation of the fuzzy AND and OR inference, to be
discussed in the next section.
Modes of Fuzzy Inference. The form of the fuzzy credit assignment de-
pends on the implementation of the fuzzy AND and OR in eq.( 3.15). In
fuzzy set theory there are two standard ways to implement such logical oper-
ators (Bezdek, Coray,Gunderson and Watson, 1981a) AND is implemented
by a product and OR is implemented by a sum; alternatively AND is
implemented by a minimum and OR is implemented by a maximum. Two
"hybrid" combinations are also possible: AND is implemented by a product
and OR is implemented by a maximum; AND is implemented by a min-
imum and OR is implemented by a sum. Only the first two cases are dealt
with here. The Sum/Product Fuzzy PREMONN Algorithm is based on the
equation
k Pf-l . e-g(lE~1>
Pt = K (3.16)
Ln=1 pf-l . e-g(IEt'1>
This is, of course, exactly the basic PREMONN algorithm. The Max/Min
Fuzzy PREMONN algorithm is based on the equation
k Pf-l /\ e-g(lE~1>
(3.17)
Pt = V~=1 (pf-l /\ e-g(IEt'I»
where /\ indicates the minimum operator and V indicates the maximum opera-
tor. In addition to the sum/product and max/min algorithms, one can use the
max/product algorithm (in the sum/product algorithm replace the sums with
max operators) and the sum/min algorithm (in the sum/product algorithm
replace the product with min operators). Since these algorithms are obvious
modifications of the sum/product and max/min algorithms, we do not give
their descriptions here. We have introduced these algorithms in (Petridis and
Kehagias, 1997b) under the name Predictive Modular Fuzzy Systems (PRE-
MOFS).
46
Phenomenological Point of View. Having obtained the membership grade
update algorithms, we can now rename membership grade as credit function
and the fuzzy point of view can be abandoned in favor of the phenomenolog-
ical one. Let us then rename the P:
quantities as credit functions. It will
be observed that the sum/product algorithm is exactly the basic PREMONN
algorithm, while the max/min, sum/min and max/product algorithms are vari-
ations. For instance, consider the max/min algorithm. This says that, at time
t, the credit function P: is a normalized version of pLl t\e-g(IE~I). Ignoring the
normalization, for the time being, this says that P:
will be no greater than pLl
or e-g(IE~I). Hence, if either the previous value of the credit function or the
current error is small, the new value of the credit function will also be small.
Now let us consider the scaling effected by the denominator in eq.(3.17). This
results in the maximum of the p:'s being equal to 1. Hence, the max/min al-
gorithm updates the credit functions in the following manner: at every step all
credit functions decrease (or at least do not increase) but the credit of predic-
tors with larger errors decreases more; then the credit functions are rescaled, so
the maximum credit becomes equal to one. It is clear that here we have once
again a case of recursive, competitive credit assignment, just like in the basic
PREMONN algorithm. The usual attractive properties of the credit functions
are preserved. The normalized form of equations (3.16) and (3.17) ensures
that for both algorithms we have 0 < P: ::; 1 for all t and K. In the case
of the sum/product PREMONN ~~=l P:
= 1 for every t and in the case of
the max/min PREMONN V~=l P:
= 1 for every t. Hence the normalization
ensures that at least one P:
never becomes too small. In fact, it will be seen
in Chapter 5 that, under appropriate conditions, one P:
will tend to one, both
for the sum/product and max/min algorithms.
3.3.5 Incremental Credit Assignment

In this section we describe an incremental credit assignment scheme which has
a certain resemblance to a steepest ascent procedure.
Incremental Credit Assignment. We will now revert temporarily to the

probabilistic interpretation of the basic PREMONN credit assignment algo-
e:,
rithm. This will be used to motivate the "Bayesian" incremental credit assign-
ment scheme. Consider the one-step errors k = 1, 2, ... , K and the following
difference equation
q tk _ qk
I
t-l
lekl2
= e - ~. Pt-l
k
-
(
L
K
n
Pt-l . e
- lenl2)
~
k
. qt-l . (3.18)
n=l
Hence, q:'s are defined by the above recursion and some initial conditions q~
(k = 1,2, ... , K) which satisfy
K
q~ > 0, L q~ = 1;
k=l
the p~'s are the original Bayesian posterior probabilities defined in Chapter 2
We claim that if the q:'s (as given by the above equation) converge, then,
at least at equilibrium, they approximate the p~'s. Indeed, at equilibrium we
have q~ ~ qf-l and from eq.(3.18) we obtain
Since
(3.19)
it follows that q: ~ p~ (for k = 1,2, ... , K). The point of introducing the q:'s
is to avoid the computation of p~ by equation (3.19). In this case the Pf-l 's in
(3.18) are unknown, so let us substitute them by the qf-l's, which approximate
them. After some rewriting, eq.(3.18) becomes
qf=qf-l+')"
lekl2 (~p~_l·e-~
[e-~- K lenI2)] ,pf-l' (3.20)
Since the originalp~'s have disappeared from the picture, let us rewrite eq.(3.20)
replacing qf-l 's by P~-l 's and, in addition using now the N-step error. We then
obtain the credit update equation
qf = q:-l + ')'. [e_I;~~2 - (t, qf-l . e_I:~~2) ]. qf-l' (3.21 )
Finally, let us rename the qf's as p~'s (since we expect that qf ~ p~) and in
eq.(3.21) let us replace the quadratic function 1~~~2 by the more general positive
and increasing function g(le~l). Then eq.(3.21) becomes
pf =pf-l +')'. [e-g(le~1) - (t,p~-l.e-g(le~l)) ]'Pf-l' (3.22)
Eq.(3.22) is the Incremental Credit Assignment (ICRA) scheme, which we have

introduced in (Petridis and Kehagias, 1996b). It will be presently argued, that
this makes phenomenological sense; hence the p~'s (as defined by eq.(3.22) )
can be taken to be credit functions. Eq.(3.22) is somewhat easier to implement
than the basic PREMONN (avoids the use of floating point divisions); in fact
it can be implemented by a recurrent neural "network of networks"; this uses
the outputs of the sigmoid predictor networks as inputs to Gaussian neurons
and effects the recursive credit update as indicated in Fig. 3.1.
Hence the ICRA scheme is appropriate for hardware implementation. In
addition, it has an attractive phenomenological interpretation, which will now
be presented.
48
Figure 3.1. The leRA architecture.
Phenomenological Justification. We have claimed that the new p:'s (as

given by eq.(3.22)) can be taken to be credit functions: in other words that a
high P: value indicates that source no. k is likely to be the "true" source. Let
us offer here an informal argument to justify this claim. A rigorous argument
will be presented in Chapter 5, where it will be shown mathematically that the
source with minimum (in some appropriate sense) prediction error will receive
highest credit in the long run.
From eq.(3.22) we see that the credit functions P: are updated in an incre-
mental manner, similar to a steepest ascent procedure. It is easy to check (this
will be proved in Chapter 5) that for every time t we have 2:;:=1 P: = 1.
Hence the term 2::;=1 pf-1 . e-g(e~) is a weighted average. Now, suppose
that the k-th model has consistently smaller error than the remaining ones.
Then we will have at time t that g(e~) > g(e:) (for all m =I k) hence the
term [e-g(le~1) - (2::;=1pf-1 . e-g(le~I»)] will be positive. Then we see from
eq.(3.22) that P: is increased relative to pL1' A rigorous proof of convergence
based on this argument will be provided in Chapter 5. Namely, it will be proved
that the p~'s as given by eq.(3.22) are convergent; in particular, the p~ associ-
ated with the predictor of smallest predictive error, converges to one, while all
other p~'s converge to zero.
3.3.6 Summary of Fixed Source Credit Update Algorithms

All of the algorithms which have been discussed so far operate on a fixed source
assumption, i.e. it is assumed that a single source produces the entire time
series. Hence it is appropriate to describe them collectively as fixed source
Table 3.1. Summary of Fixed Source PREMONN Algorithms.
Algorithm Credit Update Equation
.e-g(E~)
"k
Pk
k t 1
Multiplicative Pt = n.e g(E~)
L...J n =l P t - 1
Additive p~ = P~-l + g(Ef)

Counting
Fuzzy Sum/Product
Fuzzy Max/Min
Incremental p~ = ptl + 'Y' [e-g(E~)_

(2:~=lP~-l . e-g(le~I))] . P~-l
algorithms. Of course, as has already been explained, fixed source algorithms

can also be applied to switching source time series, by use of credit function
thresholding. The following table summarizes the fixed source credit update
algorithms for easy reference.
3.4 MARKOVIAN SOURCE SWITCHING

As already remarked, the fixed source algorithms described up to this point,
operate on the assumption that the source variable z is determined at time
t = 0 and then remains fixed. In other words, it has been assumed that a single
source generates the time series Yl, Y2, .... ,Yt, ... . Source switching is handled
by thresholding.
In this section we present a different approach to handle source switching.
Namely, the source variable is assumed to be a Markovian process Zt which
takes a new value at every time step t and this assumption is used to modify the
credit update schemes previously discussed. To obtain a basic Markovian credit
update algorithm we will once again start from a probabilistic point of view;
once the basic algorithm is obtained, we will revert to the phenomenological
point of view and obtain a number of variant credit update schemes. These
developments will be in complete correspondence with the presentation of the
fixed source case.
50
3.4.1 Probabilistic Derivation

Assume that the mechanism by which the time series is produced is described
by the function
(3.23)
Here eb e2, ... ,et, ... is a Gaussian white noise process with zero mean
and standard deviation 0". Note that in place of the previously used Z we
now have placed Zt. This is a stochastic process taking values in the set e =
{I, 2, ... , K}, according to a Markovian law described by Pmk, the transition
probability matrix, defined by
Pmk ~ Pr(zt = klzt - 1 = m). (3.24)
Define the posterior probabilities as usual
p~ ~ Pr(zt = klYt, ... , Yl);
The variable Zt will be approximated by the Maximum A Posteriori (MAP)

estimation of Zt where
~
Zt =• arg k=1,2,
max Pt .
... ,K
k
(3.25)
To find the MAP estimate, a recursive algorithm will be developed to obtain pf

for k = 1,2, ... , K and t = 1,2, .... We will only present an informal derivation,
analogous to that of Section 2.1; a rigorous derivation analogous to that of
Appendix 2.A will be omitted for reasons of brevity.
Recall that for discrete valued variables, probability densities are identical
with probabilities. Hence, in what follows we will make use of
dzt(kIYt,Yt-l, ···Yl) = Pr(zt = kIYt,Yt-l, ···Yl) = pf, (3.26)
d Z ' _ l (kIYt-l, ···Yl) = Pr(Zt-l = klYt-b ···Yl) = Pf-l (3.27)
and (because of the Markovian nature of Zt)
dzt(klzt- 1 = n,Yt-l, ... Yl) = Pr(zt = klzt- 1 = n). (3.28)
Let us now start by applying Bayes' rule (for densities)
k _ d (kl ) _ dyt,zt(a,kIYt-l, ... Yl) -

Pt - Zt Yt,···Yl - -
dy.(aIYt-b ···Yl)
K K =
Lm=l Ln=l dyt,Zt_l,Zt(a, n, mIYt-l, ···Yl)
(using eqs.(3.27), (3.28))
= k, Yt-l, ···Y1) . Pr(Zt = klZt-1 = n) . Pt-1

2:;;-1 dYt (alZt
--~~~~~~--~~~~~--~~--~~--~~~~--=}
2:;;'=12:;;=1 dyt(alZt = m,Yt-l, ···yd· Pr(Zt = mlZt-1 = n)· Pt-1
k 2:;;-1 dYt (alZt = k, Yt-l, ···yd . Pnk . Pt-1 (3.29)

Pt = K K n
2:m=l2:n=l dyt(alZt = m,Yt-1, ···Y1)· Pnm · Pt-1
We have already seen in Section 2.1 that the density of Yt conditioned on Zt =k
and Yt-l, ... Y1 is given by the following expression:
1 la_ y k l 2
dyt(alzt = k,Yt-1, .. ·Y1) = f"iC e-~. (3.30)
y21l"0"
If we replace a with Yt we get from eq.(3.29)
k 2:;;-1 dyt(YtlZt = k,Yt-1, ···Y1)· Pnk · Pt-1 (3.31)

Pt = K K
2:m=l 2:n=l dyt(alzt = m, Yt-b ···yd . Pnm · Pt-1
and from eq.(3.30)
_ _ 1 _lllt-~;12
dyt (Ytl zt - k ,Yt-1, ... Y1)- f"iC e 20" • (3.32)
y21l"0"
Combining eqs.(3.31) and (3.32) and using ef = Yt - yf we have obtained an

informal "proof" of the following theorem.
Theorem 3.1 If the Markovian time series model presented above holds, the
posterior probabilities pf evolve in time according to the following equation.
(3.33)
Eq.(3.33) has been introduced in (Petridis and Kehagias, 1998). Note that
it is compatible with the fixed-source case; in that case we would have Pkk = 1
and Pnk = 0 for n -=I k and eq.(3.33) would reduce to
(3.34)
which is, as expected, the original basic PREMONN credit update equation
(2.17) of Section 2.2.
52
3.4.2 Phenomenological Justification

Eq.(3.33) summarizes a classification algorithm for time series generated by
Markovian sources. This algorithm has been obtained by probabilistic argu-
ments. However, it can also be justified phenomenologically. Suppose no prob-
abilistic information is available on the behavior of Yl, Y2, ... and Zl, Z2, ...
However, the following facts are empirically known (or assumed).
1. There are K sources which are activated successively in time to generate the
time series Yl, Y2, ... .
2. If the Yt sample of the time series is generated by the k-th source, then Yt
may be approximated by
(where fk(.) approximates the input / output behavior of Fk('))'

3. At any time t, the likelihood of the n-th source being deactivated and the
k-th source being activated is given by Pnk (values closer to one indicating
higher likelihood).
Under the above assumptions, p;

can be interpreted as the credit that the
k-th source receives at time t as a possible generator of Yt ; p;
will always
lie in the interval [0,1]' with values close to one indicating high likelihood.
Then eq.(3.33) can be interpreted as a recursive scheme for credit update. This
becomes clear by analyzing the terms in the numerator of eq.(3.33):
1. Pt-l is the credit accumulated by the n-th source up to time t - 1;

2. this is multiplied by Pnk to obtain the likelihood of a n to k transition;
3. the sum L;;=l Pt-l Pnk indicates the likelihood of a transition to k originat-
ing from any nand
le k l 2
4. the e - ~ corresponds to the likelihood of Yt having been generated by the
k-th source, which is related to the discrepancy e;
= Yt - yf between actual
observation Yt and expected value yf.
The denominator in eq.(3.33) is used to normalize in the interval [0,1] the

newly computed credit. Now the probabilistic framework can be abandoned and
a number of modifications are possible concerning the various terms appearing
in the update equation (3.33) or, in fact, the form of the equation itself.
3.4.3 Connection with Hidden Markov Models

Hidden Markov models (HMM) are a very popular tool for modelling time se-
ries. For a good introduction see (Rabiner, 1988); for a very detailed exposition
see (Elliot, Aggoun and Moore, 1995) It is worth noting that the model of a
Markovian switching source Zt combined with an observable equation of the

form (3.23) yields a (somewhat unconventional) HMM. For the sake of simplic-
ity, let us consider the special case when M = 1; the case of M > 1 can be
analyzed similarly. When M = 1, eq.(3.23) becomes
Yt = fzt (Yt-l).
In this case, it is easy to prove that the process (Zt, Yt) is Markovian, i.e.
Pr(Zt E A, Yt E BIZt-l, · .. ,Yt-ll .. ·) = Pr(Zt E A, Yt E BIZt-llYt-l)
for any measurable sets Ace and B C R and for any Yt-n E R, Zt-n E e
(n > 0). In addition, the process Yt obviously is a (deterministic) function of
(Zt, Yt). Hence the process [(Yt, Zt), Ytl falls within the definition of HMMs. In
fact, the posterior probability update eq.(3.33) is simply the Viterbi decoding
algorithm (Rabiner, 1988). This connection between Markovian PREMONNs
and HMMs is further discussed in Chapter A.
3.5 MARKOVIAN MODIFICATIONS OF CREDIT ASSIGNMENT

SCHEMES
We present several possible variants of the Markovian credit update algorithm
in tabular form (for the sake of brevity). All credit update algorithms pre-
sented here are using the N-step block error defined in Section 3.2.2. Also, the
algorithms presented below utilize a general error function g(.) rather than the
quadratic function.
The modification of the fixed source credit update schemes (to account for
source switching) is fairly straightforward and there is a complete correspon-
dence between Tables 3.1 (for the fixed source case) and 3.2 (for the Markovian
source case). However, the following points require clarification.
1. Updating the credit at times N, 2N, 3N, ... implies that Zt also changes at
times N, 2N, 3N, .... Then it is appropriate to replace the state transition
matrix P (which expresses the likelihood of state transitions in one time step)
by the matrix R = pN (which expresses the likelihood of state transitions
in N time steps). In addition, we will use a source variable Zt in place of Zt;
however, this variable (unlike yt, Y/ and so on) will not be considered as
a vector variable with N components, but as a scalar one. In short, we are
introducing the source variable Zt which describes source switchings at times
N, 2N, 3N, ... and we will assume that source switchings at intermediate
times are not possible.
2. The "transition" matrix formulation applies to multiplicative and incremen-
tal schemes; it also applies to fuzzy schemes, with the understanding that,
if products are replaced by min operators and sums by max operators, then
matrix multiplication must be accordingly modified.
3. Regarding additive and counting schemes, in place of the "transition" matrix
R , a transition function w( n, k) is used. This function penalizes source
54
Table 3.2. Summary of Markovian Source PREMONN Algorithms.
Algorithm Credit Update Equation
Multiplicative
Additive 3
Counting 2 P~ = P~-l + l(E; < Er' for m i= k) + W(k;_l' k)

k
("K pn Rnk ).e -g(E;)
~n-l t-l
Fuzzy Sum/Product Pt = "K ("K p" Rmn).e g(Er')
Wtn=l L..."n=l t-l
Fuzzy Max/Min
k _ V K
n=l
pn
t-l
I\R
nk
l\e- g (E;')
Pt -
V"'=1 V,,=1
K (
Incremental
transitions, i.e. the cost of transition from source n to source k is given by

wen, k).
3.6 EXPERIMENTS
In this section we present some comparative experiments on classification of
computer generated time series. The goal of the experiments is to compare
the performance of the PREMONN classification algorithms presented in the
previous sections. In particular, we are interested in comparing classification
accuracy and noise robustness of the PREMONN algorithms. In addition we
want to explore the difference in performance between fixed and switching
sources versions of the same algorithms.
The data used for training and testing the PREMONN algorithms are gen-
erated by four chaotic sources (dynamical systems). Namely, the data are
generated by sources of the form
Figure 3.2. One period of the test time series.
1.00 ,.·,·········_······_················..··_····;n········..........................,..........,...,.................... ..,.··c;··········..···..·········,. ..·:..,.··,-;--..······,T.,..··Tr;·T..."
.,
.''""
~., 0.50
E
i=
0.00
201 401
Time Steps
where k = 1,2,3,4 and the functions fk(') are as follows:

It(Y) = 4· y. (1- y) (logistic);
2· Y if O:S; y :s; 0.5
h(y) = { 2. (1- y) if 0.5:S; Y :s; 1.0 (tent-map);
h (y) = It (It (y» (double logistic);
f4(y) = h(J2(Y» (double tent map).
Training time series are obtained by running the recursions Yt= fk(Yt-l), for
k=l, 2, 3, 4. Four predictors are trained, each on noise free data from one of
the four previously mentioned time series. The predictors are 1-4-1 sigmoid
neural networks. The test time series is generated by successively activating
each of the four sources, in the sequence 1 -4 2 -4 3 -4 4 -4 1 etc. Every source
is switched for 200 time steps at a time; a total of 5000 time steps is obtained.
The statistical properties of the four time series are quite similar, and the same
is true of their visual appearance. 400 time steps of the composite time series
are presented in Figure 3.2. The reader may want to locate the number and
position of switching points in the graph (they are not at times 200 and 400).
This is an interesting and nontrivial task; the answer is given in footnote 5, at
the end of the chapter.
The five PREMONN algorithms (multiplicative, additive, counting, fuzzy
max/min and incremental4 ) presented previously are now applied to the task
4The fuzzy sum/product algorithm is the same as the multiplicative one and hence is not
listed separately.
56
Table 3.3. Classification figure of merit c for several PREMONN algorithms.
Noise A 0.00 0.10 0.20 0.30 0.40
Multipl. 0.9821 0.9640 0.9070 0.8180 0.6460

Additive 0.9893 0.9360 0.5430 0.3210 0.3210
Fixed Source Counting 0.9786 0.9360 0.9070 0.7390 0.5430
Fuzzy 0.9786 0.9360 0.9070 0.7390 0.5430
ICRA 0.9964 0.9820 0.9110 0.8110 0.6820
Multipl. 0.9821 0.9640 0.9070 0.8070 0.6390
Additive 0.9929 0.9390 0.5460 0.3210 0.3210
Switching Source Counting 0.9964 0.9640 0.9390 0.7960 0.5860
Fuzzy 0.9786 0.9360 0.9070 0.7390 0.5430
ICRA 0.9893 0.9790 0.8860 0.6320 0.5290
of classifying the 5000 observations of the test time series to one of four possible
categories (Le. logistics, tent-map, double logistic and double tent-map). Both
the fixed source and Markovian switching source versions of each algorithm
are used. All algorithms are used with the quadratic error function: g( e) =
e- ~
2.,. • We superimpose on the time series observations additive white noise
distributed uniformly in the interval [-A/2, A/2]; the classification experiment

is performed at various noise levels, which are expressed by A. Classification
accuracy is measured by the classification figure of merit c, defined in Section
2.6. The results are summarized in Table 3.3.
3.7 CONCLUSIONS
We have presented several variations of the basic PREMONN classification
algorithm, argued for their phenomenological justification and performed nu-
merical experiments to compare the algorithms. While a relatively small set of
experiments has been presented here, the results corroborate our experience,
which can be summarized as follows.
First, there does not appear to be a significant advantage in the use of
the (Markovian) switching source algorithms; their fixed source counterparts
exhibit comparable and in fact usually better classification accuracy. Hence
the added complexity which must be incorporated in the algorithms to handle
the switching source situation does not appear to be justified.
Second, while the additive algorithms perform better than the multiplicative
ones under noise free conditions, they are not noise robust. The slightly more
complex multiplicative algorithm performs better.
A good combination of complexity and performance is offered by the count-
ing algorithm: while it is extremely simple to implement, it is quite accurate
and noise robust. The fuzzy algorithm has comparable performance; while it
is rather more complex than the counting algorithm, it may be easier to im-
plement than the multiplicative algorithm; the fuzzy interpretation may be
epistemologically more appealing to certain users.
Finally, the ICRA algorithm probably combines the most attractive features.
It has the highest accuracy and noise robustness and can be implemented in
hardware by a neural network.
The experiments presented in this chapter are only meant to impart to the
reader a general idea of the PREMONN algorithms performance. For a better
evaluation of the PREMONN potential two steps are required. In the following
chapter we will present a mathematical analysis of the algorithms and show
that in every case convergence to the "correct" classification is guaranteed
under mild and reasonable digressions. Then, after a discussion of identification
problems (this is presented in Chapter 5), in Chapters 6 to 9 we will present
applications of the PREMONN algorithms to real world problems. 5
SBefore concluding this chapter let us give the answer to the exercise posed on page 55. The
switching times in Figure 3.2 are tl = 101 and t2 = 301.
4 MATHEMATICAL ANALYSIS
In this chapter we formulate and prove convergence theorems for most of the
credit assignment schemes introduced in Chapter 3. The presentation style is
quite uniform: in every case a theorem is stated, which has the general form:
"the credit of the best model converges to one as time goes to infinity"; what
is meant by "best model" in every case is explained and various remarks are
made regarding the algorithm, the conditions necessary for convergence and so
on. The actual proofs are presented in several Appendices at the end of the
chapter.
4.1 INTRODUCTION
As explained in Chapter 4, we have generally found that fixed source algo-

rithms perform better than their switching source counterparts (at least in the
case of Markovian switching). Hence the convergence theorems presented here
mostly regard the fixed source algorithms. Namely, in Section 4.2 we present
convergence theorems for the multiplicative, additive, counting, fuzzy and in-
cremental fixed source algorithms. These theorems are obviously useful for the
case where the generating sources are fixed; but they also offer useful insight
for the case of slowly switching sources. In this case the various convergence
theorems indicate that if source switching takes place at a slow rate then the
credit functions will converge within a time interval in which a single source is
active. "Slow rate" is not an absolute term; it depends on the rate of conver-

60
gence of the credit functions; our theorems provide some explanation regarding
factors which may influence the switching rate.
In Section 4.3 we also provide a convergence theorem for the case of the
multiplicative credit assignment algorithm with Markovian source switching
(this is the algorithm presented in Section 3.4). Similar theorems can be proved
for source switching algorithms with additive, counting etc. credit update.
However, since our interest in the source switching algorithms is limited and the
corresponding proofs fairly involved we do not proceed further in this direction.
In all cases examined here, we assume the predictors to have a general para-
metric form (k = 1, ... , K):
Y~ = fk(Yt-l, ... ,Yt-M). (4.1)
No further special assumptions are made regarding the form of fk(.). As al-
ready mentioned, fk(·) can be a neural network (feedforward or recurrent linear,
sigmoid, RBF, polynomial etc.) or any other kind of predictor (Kalman filter,
fuzzy system, spline regression etc.).
Finally, note that all theorems are proved for the N -step data block algo-
rithms; of course the one-step case is also included by setting N = 1.
4.2 CONVERGENCE THEOREMS FOR FIXED SOURCE

ALGORITHMS
4.2.1 Multiplicative Credit Assignment; Quadratic Error Function

The multiplicative credit update scheme was presented in Section 3.3. The
credit update equation is the following
k ptl . e-g(E;)
Pt = K . (4.2)
L:n=l pf-l . e-g(E;.')
As has already been explained, the function g(.) can be any strictly positive,
increasing function. The most common choice for g(.) is
g( Ek) = E
t
I tl 2
20"2 '
in which case the credit update scheme becomes
k 'Ekft
k Pt-l . e 20-
Pt = ----'---"----,E-n"""'r""" (4.3)
"",K n _ t
L.m=l Pt-l . e 20-
For example purposes we examine first the special case of eq.(4.3) and then the
more general case of eq.(4.2). The following theorem is proved in Appendix
4.A.
Theorem 4.1 (Mult. Credit Assignment, Quadratic Error Function)

Suppose that the following assumptions hold
MATHEMATICAL ANALYSIS 61
Al Y1, Y2, ... is ergodic and square integrable.

A2 For k = 1,2, ... , K the junction fk(zl, ... , ZM) is measurable and there is a
constant ak such that Ifk(zl, ... , zn)1 ~ ak' {IZ11 + IZ21 + ... + IZMI}·
A3 p~>Ofork=1,2, ... ,K.
E(I. k I2 )
Now, for k = 1,2, ... , K, define Ck ~ e-~ and suppose em is the unique
maximum of Ck for k = 1,2, ... , K. Then, for the pf's defined by eq.(4.3) we
have, with probability 1, and for k oF m,
k
lim p~ = 1, · -;:n=
11m Pt 0. (4.4)
t--+oo t--+oo Pt
Remark 1. The boundedness assumption A2 is required in order to establish

square integrability of Ef; it can be replaced by any other appropriate condition
that yields the same result.
Remark 2. The case where the maximum is not unique can also be handled; in
this case the conclusion is that the total credit assigned to the set of maximum
Ck'S converges to one.
4.2.2 Multiplicative Credit Assignment; General Error Function

The following theorem is proved in Appendix 4.A.
Theorem 4.2 (Multiplicative Credit Assignment) Suppose that the fol-

lowing assumptions hold
B I Y1 , Y2, ... is ergodic, integrable.

B2 For k = 1,2, ... , K the junction fk(zl, ... , ZM) is measurable and there is a
constant ak such that Ifk(zl, ... , zn)1 ~ ak . {IZ11 + IZ21 + ... + IZMI}.
B3 The junction g( z) is positive, increasing, measurable and there is a constant
b such that Ig(z)1 ~ b ·lzI.
B4 p~ > 0 for k = 1,2, ... ,K.
Now, for k = 1,2, ... , K, define Ck ~ e-E(g(E;» and suppose em is the unique
maximum of Ck for k = 1,2, ... , K. Then, for the pf's defined by eq.(4.2) we
have, with probability 1, and for k oF m,
k
lim p~ = 1, lim l!!.. = O. (4.5)
t--+oo t--+oo p,!,
Remark. The boundedness assumptions B2 and B3 are required in order
to establish square integrability of Ef; they can be replaced by any other
appropriate conditions that yield the same result.
62
4.2.3 Additive Credit Assignment

Let us now consider the case of an additive credit assignment algorithm where
both the error function g(.) and the predictors fk(.) are of general, nonlinear
form. Here p~ indicates discredit and is given by
p~ = pLl + g(E~); (4.6)
here g(.) is a continuous, increasing and nonnegative function. The following

theorem can be proved (see Appendix 4.B).
Theorem 4.3 (Additive Credit Assignment) Suppose that the following

assumptions hold
Cl Yl, Yl, ... is ergodic, integrable.
C2 For k = 1,2, ... , K the function fk(zl, ... , ZM) is measurable and there is a
constant ak such that Ifk(zl, ... ,zM)1 ~ ak' {IZ11 + IZ21 + ... + IZKI}·
C3 The function g(z) is positive, increasing, measurable and there is a constant

b such that Ig(z)1 ~ b· 14
Now define Ck ~ E(g(Ef)) and suppose Cm is the unique minimum of Ck for

k = 1,2, ... , K. Then for k -=I=- m we have,with prob. 1.
k pm pk
· -Pt E(g(Ef))
11m lim _t_ < lim -1. (4.7)
t->oo Pr' E(g(Et')) , t->oo t t->oo t
Remark 1. The boundedness assumptions B2 and B3 are required in order

to establish square integrability of Ef; they can be replaced by any other
appropriate conditions that yield the same result.
Remark 2. Note that in this case the ratio of credit functions does not con-
verge to zero; however the best model gets minimum discredit.
4.2.4 Counting Credit Assignment

Let us next consider the case of a counting credit assignment algorithm where
both the error function g(.) and the predictors fk(') are of general, nonlinear
form. Here p~ is given by
p~ = P~-l + I(Ef < Ef" for m -=I=- k). (4.8)

The following theorem is proved in Appendix 4.C.
Theorem 4.4 (Counting Credit Assignment) Suppose that the following

assumptions hold
D 1 Yl, Y2, ... is ergodic.
D2 For k = 1,2, ... , K the function fk(zl, ... , zn) is measurable.

Now define Ck = Pr(Ef < Ei n #- k} and assume that there is an m such

that for all k #- m we have em > Ck. Then we have
k m k
lim Pt = Pr(E~ < Ef n #- k}, lim '!!.L >
(4.9)
lim Pt .
t
t-+oo t-+oo t t->oo t
Remark. Note that for this theorem minimal assumptions are required. On
the other hand, the conclusion is that the credit function of the best model is
greater than that of all other models, but it does not necessarily converge to
one.
4.2.5 Fuzzy Credit Assignment; Generalities

Next we provide convergence theorems for the sum/product and max/min fuzzy
credit assignment algorithms. There is a change in the point of view adopted in
proving these theorems. Unlike in previous sections, we avoid the use of prob-
abilistic concepts; in this instance we adopt the approach (Kosko, 1991) which
advocates that fuzzy set theory should be completely distinct and dissociated
from probabilistic concepts.
In a probabilistic context one would describe average behavior using expected
values; but here we want to avoid probabilistic considerations. Our approach
is based on the Cesaro average (Billingsley, 1986, p.572). For a sequence Xl,
X2, ... we say that Xt -+ X in the Cesaro sense iff
. Xl + X2 + ... + Xt
11m =X. (4.10)
t-+oo t
In short, x}, X2, ... tends to X on the average. It is easy to prove that con-
vergence in the usual sense implies convergence in the Cesaro sense, but not
conversely (Billingsley, 1986, p.572).
4.2.6 Fuzzy Credit Assignment; Sum/Product Algorithm

The next theorem refers to the "fuzzy" update equation
k pf-l . cg(lE:D
Pt = "K n -g(IE"i)
(4.11)
L.m=l Pt-l . e •
presented in Section 3.3.4. The following theorem is proved in Appendix 4.D.
Theorem 4.5 (Fuzzy Sum/Product Credit Assignment) Assume that
El For k = 1,2, ... , K we have 0 < p~ < 1.
·
E2 1lmt-+oo e-g(IEtl)±e-g(lE~I)± ... ±e-g(lE:1)
t exists for k = 1,2, ... , K.
Defi ne the quant ~'t'~es Ck =. l'lmt-+oo e-g(IEtl)±e-9(IE~I)±
t
... ±e-9(IE:1)
an d Ck =
e- Ck . If em = maxk=1,2, ... ,K Ck, then for the pf defined by eq.(4.11) and for all
k #- m we have
k
lim
t-+oo
p": = 1, lim
t-+oo
J!J-.
pr = O. (4.12)
64
Remark. Condition E2 requires that the limits Dk must exist for all k (in a
probabilistic context, this condition would hold for an ergodic time series; but
our formulation avoids any reference to probabilistic concepts.) If the above
conditions are satisfied, convergence to the "best" class is guaranteed and, in
the limit, the largest membership grade is attained by the m-th class which
minimizes prediction error in the sense of limit Cesaro average.
4.2.7 Fuzzy Credit Assignment; Max/Min Algorithm

The next theorem refers to the "fuzzy" update equation
k
Pt-l 1\ e-g(E;)
k (4.13)
Pt = VK pn 1\ e-g(E;')
n=l t-l
presented in Section 3.3.4. The following theorem is proved in Appendix 4.D.
Theorem 4.6 (Fuzzy Max/Min Credit Assignment) Assume that
Fl For k = 1,2, ... , K we have 0 < p~ < 1.
F2 Ck == limt->oo 9(Et)+g( Ei2+'''+ 9(En exists for k = 1,2, ... , K.
F3 For Ck,t == I~I we have (for all k and t) Ck,t ~ C k.
Now define Ck,t == e- Ck .'. If there are numbers a,j3,,",(,m such that for k #- m
and for all t we have
1>a> Cm,t > j3 > '"'( > Ck,t > 0, (4.14)
then for the pf defined by eq.(4.13) and for all k #- m we have
lim pr;' = 1, lim sup p~ < 1. (4.15)
t--oo t->oo
Remark 1. Assumption F3 requires that the N-steps block error approximates
the Cesaro average Ck for all k. The exact nature of approximation is not
specified. As the reader can see in Appendix 4.D, what is required is some
form of approximation such that C k can be replaced with Ck,t.
Remark 2. If the conditions of Theorem 4.6 are satisfied exactly, then p"!'
will increase monotonically and will achieve the value 1 in a finite number
of steps; then it will never decrease. If p"!' temporarily becomes less than
one, for instance after a source switch (when it starts from a small initial
value PO') or because of random fluctuations in the observations, then it will
increase monotonically until it becomes one. In realistic experiments F3 may
be temporarily violated, especially if N (the Ef error order) is small and /
or the observations are very noisy. In such cases temporary decreases of p"!'
may be observed. However, if the assumptions of Theorem 4.6 are satisfied, it
can be seen from the proof of the theorem that p"!' = 1 is a stable equilibrium
point in the following sense: if p"!' = 1 then PZ+- 1 = 1; if p"!' < 1 then PZ+-n = 1
for some finite n. Therefore, the system will return to equilibrium in a finite
number of steps.
4.2.8 Incremental Credit Assignment

The incremental credit assignment algorithm is based on the update equation
P: ~ P:-t + ~" [e-'(Ell - (t, P~-t "e-'(E~l) j" P!'-t" (4.16)
presented in Section 3.3.5. The following theorem is proved in Appendix 4.E
Theorem 4.7 (Incremental Credit Assignment) Assume the following.
Gl For k = 1,2, ... ,K we have PO' > o.

G2 We have 'Y < k.
G3 Yb Y2, ... is ergodic.
G4 The function g(z) is positive, increasing, measurable and there is a con-
stant a such that Ig(z)1 ~ a ·lzI.
G5 The error process Ef,E~, ... is independent ofYl, Y2, ...
Define (for k = 1, ... , K ) Ck ~ E(e-g(e:» and suppose Cm is the unique

maximumofcl, ... , CK. Then, withprobabilityl,forthepf definedbyeq.(4.16)
and for all k =F m we. have
P: = 0,
k
P't =
lim 1, lim lim !!!.. = o. (4.17)
t-+oo t-+oo t-+oo Pr'
Remark 1. Ck = E(e-g(E:», i.e. the expectation of e-g(E:>. Since g(lel)
is an increasing function of lei, a large value of Ck implies good predictive
performance. In this sense, Ck can be viewed as a prediction quality index and
it is natural to consider as optimal the predictor m that has maximum Cm.
Remark 2. The theorem can be generalized to the case where there is more
than one predictor that achieves maximum am; then the total posterior prob-
ability of all such predictors will converge to 1.
4.3 CONVERGENCE THEOREM FOR A MARKOVIAN SWITCHING

SOURCES ALGORITHM
Let us now consider the convergence of the switching sources multiplicative
credit assignment algorithm. In this case we prove a result which is somewhat
weaker than the previous theorems. Namely, rather than proving convergence
of the credit functions pf, which are random quantities, we prove convergence
of some deterministic quantities 7rfn, which approximate pf. Several possibil-
ities exist for obtaining a stronger result (e.g. using stochastic approximation
convergence methods (Kushner and Clark, 1978) but we have not pursued
this approach, since our experience is that the use of the Markovian switching
mechanism does not yield significant performance improvements. For the same
66
reason, we only consider the multiplicative case; and do not extend our analysis
to the additive, counting etc. cases. For the mUltiplicative case, we show then
that the 7rfnls have the desired convergence behavior: if the "best" source is
the m-th one, then 7rr n goes to one and the remaining ones to zero. To the
extent that the 7rfn,s approximate the p~'s closely, the latter will also have the
desired behavior that ensures correct classification and prediction.
To prove convergence of the Markovian credit update algorithm we need
some preliminary definitions. Recall that the N-step error Ef depends on the
quantity N. Suppose, for the time being, that the parameter value is fixed at
Zt = n, for t = 1,2, ... Define
(4.18)
note that 0 < Ckn < 1 for all k, n. Now consider the quantities 7r~n, defined for
t = 1,2, ... and k = 1,2, ... , K by the recursion
(~~=1 7rr:l . Rmk) . Ckn

7r: n = --~----~------~~---- (4.19)
~~1 (~~=1 7rr:l . Rmi) . Cin
Note the similarity of eq.( 4.19) to the credit update equation
k_ (~~=1 pr: 1.Rmk) . e-g(E;)

(4.20)
Pt - K ( K ) I '
~1=1 ~m=l Pr.:l . Rml . e-g(Et )
presented in Section 4.4. Then the 7rfn,s, as given by eq.(4.19), are convergent;
this is the conclusion of the following theorem.
Theorem 4.8 Consider the system defined by eq.(4.20), with Ckn defined by
eq.(4·18) for k,n= 1, 2, ... , K. Suppose that for a fixed n (1 ::; n ::; K) the
following conditions hold.
HI For t = 1,2, ... we have Zt = n.

H2 For all m =I- n we have Cnn > Cmn . It-~€.
H3 R> 0 (i.e. for k, m= 1, 2, ... , K we have Rkm > 0).

H4 There is some E > 0 such that for k = 1,2, ... , K we have Rkk > 1 - E and
for m =I- k Rkm < E.
Then, for the 7r~n defined in eq. (4.19) the limt--->oo 7rfn exists for k = 1,2, ... , K
and limt--+oo 7rfn > limt--+oo 7rfn for all k =I- n.
Remark I Condition HI states that the n-th source remains active for all
time. Of course this will not be true in a switching sources situation. However,
the point of the theorem is to show that the switching sources algorithm will
converge during time intervals between source switchings.
Remark 2. In H2, it is assumed that the n-th prediction quality index, Cnn.
is the maximum one. In other words, H2 is stated so as to imply that the n-th
source is matched to the n-th (best) predictor. Our usual assumption has been
that the n-th source and n-th predictor are matched. However, if for some
reason the source-to-predictor correspondence were permuted, so that the n-th
active source corresponded to the (say) i-th predictor, then the conclusion of
.
the theorem would still hold true, provided Cin > It~E Cmn is true for all
m =f i.
Remark 3:Note that H2 requires that for all m =f n the inequality Cnn
> l~E • Cmn is true; this is somewhat stronger than simply requiring the n-th
predictor to be the best one (which would be expressed as Cnn > Cmn ).
Remark 4: Condition H3 requires that R is positive, i.e. R > O. This means

that no source transition is completely impossible. In fact it would suffice to
assume that R is primitive, i.e. there is some d such that Rd > 0; we take d = 1
to simplify the analysis, but we would reach the same conclusions for any d > 1.
In addition, condition H4 requires that there is some € > 0 such that for all m
we have Rmm > 1 - € and Rmk < € for k =f m. This is a condition for "slow
switching". If parameter switching took place at a fast rate, there would not
be enough time for the p~ to converge to some limit between model switchings.
Slow switching is guaranteed if Rmm is significantly larger than Rmk.
Remark 5: The above remark about slow switching is related to the rela-
tionship between p~ and 7rf. Suppose that Zt is fixed to n for some time, say
Ts time steps. Suppose also that (for k = 1,2, ... , K) we have Ckn ~ 1YtZ';/12
(this assumption will be true if the original system has ergodic behavior and
N is large). Finally, suppose that the convergence of 7rfn (which is guaranteed
by the theorem) takes place (up to desirable accuracy) within some time, say
Tc time steps. If N < < Tc < < Ts, then it is reasonable that p~, as given by
(4.20), is approximated by 7rfn, as given by (4.19).
4.4 CONCLUSIONS
We have presented convergence proofs for all the fixed source algorithms in-
troduced in the previous chapter, and also for one of the switching sources
algorithms. It can be seen that under reasonable conditions, any of the above
algorithms can be expected to converge to correct classification. Hence we may
use these algorithms confidently. It is important that no probabilistic interpre-
tation of the algorithms was necessary to prove convergence.
68
Appendix 4.A: Convergence Proofs for Fixed Source Multiplicative Algo-

rithms
Proof of Theorem 4.1: The credit update equation of the general multi-
plicative scheme is
It follows that for every k =I- m we have
k
Pt
k
Pt-l e
_IE:r
2.-
pf' = P~l' _IE:f'

e 2.-
Repeating the argument for t - 1, t - 2, ... , 1 we get
_IE~t
e- 2!
IEkr _IE:r
p~ p~ e 2.- e 2.-
-=-. ... =>
pf' p(f IEkr
e- 2! e
_IE~r
2.-
IEml
e- 2:
(4.A.1)
Because fk(') is a measurable function of the y's, yf is ergodic (see (Breiman,

1968, p.119)). The same holds for Ef = yt - Y'tk and for IEfI2. Notice that,
because of the boundedness assumption A2, Y'tk is square integrable. Since Yt
is also square integrable, it follows that Ef = Y - Y'tk is also square integrable:
E (lEfI2) < 00. Incidentally, this shows that the Ck'S are well defined and
finite, and em exists. Since Ef is ergodic and square integrable, it follows that,
with probability one
By the continuity of the exponential function and (4.A.1) we have that, for all
€ > 0 and almost all Yl, Y2, ... , there is a te (depending on Yl, Y2, ... ) such
that for all t ~ te
(4.A.2)
By assumption, for all k # m we have e- E (IE;1 2 ) = Ck < Cm = e- E (lE;"1 2 ), so

E(IE k I2)
we can choose E such that e- ~ + E = z.:: + E < 1. Then, raising (4.A.2)

e-~
to the t power, we have that for all t :2:: te and almost all Yb Y2, ...
o~
Pt
p! ~ Pop! . (~+ E)t
Cm
(4.A.3)
The third part of eq.(4.4) follows easily from eq.(4.A.3). Note that the term
P~/PO' does not affect convergence, as long as neither p~ nor PO' are zero. Hence
the initial values of the credit functions are not crucial to the convergence of
the algorithm, as long as they are not zero. Now, from eq.(4.A.3)we also have
o ~ max P: ~ (max p~ ) . (max ~ + E) t (4.A.4)

m# p~ m# PO' m# Cm
Since, for k = 1,2, ... , K, the p~'s are given, the first bracket in eq.(4.A.4)
is fixed. Since, for k = 1,2, ... , K, the Ck'S are given, and for k # m we have
Ck < Cm, we can choose E small enough, so that the second bracket in eq.(4.A.4)
k
is less than one. So it follows that limt--+oo maxm # (li-)
Pt
= 0 with prob. 1. Then
we have (with prob. 1)
o~ Ek;f:P: ~ (K - 1)· ~aXm;fkP: = (K _ 1) . max p! -t O.

~ ~ m#~
Hence Ekp'!n P: tends to 0 with probability 1. Since pr: ~ 1 it follows that

Ek;fm P: tends to 0 with probability 1. Using this fact in conjunction with
P: :2:: 0 we obtain the second part of eq.(4.4); using it in conjunction with
Ek;fmP: pr:
+ = 1 we obtain the first part of eq.(4.4) .•
Proof of Theorem 4.2:The proof is very similar to that of Theorem 4.1. The
credit update equation of the general multiplicative scheme is
k pLl . e-g(E;)
Pt = ",K n (En) .
L..m=l Pt-l . e- g t
It follows that for every k #m we have
P: pLl e-g(E;)
pr: Pr.:l e-g(E;")
(4.A.5)
repeating the argument we get

p~ e-g(Ef) e-g(E~) e-g(E:)
PO' . e-g(Er") e- g(E2') .... e-g(E;") ::::}
(4.A.6)
70
Because Ik(') is measurable, yf is ergodic, and so is Ef = yt - ~k and g(Ef)

(since g(.) is measurable). Because Yt is integrable and Ik(') is bounded (As-
sumption B2), it follows that yf is also integrable; because 9 is bounded (As-
sumption B3), it follows that g(Ef) is also integrable. Since g(Ef) is integrable
and ergodic, we have for all k = 1,2, ... , K and with probability one that
lim
"t
ws=1
e-g(E:)
= E(g(E~)).
t--+oo t
By the continuity of the exponential function and (4.A.6) we see that for all
E > 0 and almost all Y1, Y2, ... , there is a t€ (depending on Y1, Y2, ... ) such
that for all t ~ t€
t[;f < t(;f. (e-E(g(E;» + E)

YPf - YPo' e-E(g(E;"»
(4.A.7)
By assumption, for all k=f- m we have e-E(g(E;» = Ck < Cm = e-E(g(E;"», so

e- E ( g ( E ; ) ) . 0 s . . .
+ E = ern + E < 1. Then, ralsmg (4.A.7)
••
we can choose E such that e -E( (Em))t 9
to the t power, we have that for all t ~ t€ and almost all Y1, Y2, ...
0:::; P:,.k :::; p~.

k (
~ +E )t (4.A.8)
Pt Po Cm
The rest of the proof is exactly like that of Theorem 4.1 and hence is omitted .•
Appendix 4.B: Convergence Proof for Fixed Source Additive Algorithm

Proof of Theorem 4.3:The credit update equation of the general additive
scheme is
p~ = P~-1 + g(E~).
From the credit update equation it follows that
t
p~ = ~g(E~). (4.B.1)
s=1
Because Ik(') is measurable, yf is ergodic; this means that Ef = yt - ~k

is also ergodic. Since g(.) is measurable, g(Ef) is also ergodic. Because Yt
is integrable and !k is bounded (Assumption C2), it follows that yf is also
integrable; because 9 is bounded (Assumption C3), it follows that g(Ef) is also
integrable. Since g(Ef) is integrable and ergodic, we have for all k = 1,2, ... , K
and with probability one that
lim pf = lim I:!=1 g(Ef) = E(g(E k )).

t--+oo t t--+oo t t
Then the first part of eq.(4.7) follows immediately by dividing with It

k
zt-; the
~
second part of eq.(4.7) follows from the assumption regarding the Ck'S .•
Appendix 4.C: Convergence Proof for Fixed Source Counting Algorithm

Proof of Theorem4.4: The credit update equation of the general counting
scheme is
p~ = pL1 + l(E; < E;" for n =1= k).
Using the same arguments as in the previous theorems, we conclude that E; is
ergodic for all k. From the credit update equation it follows that
t
p~ = L,l(E; < E;" for n =1= k)

8=1
p~
=} - =
2:!=1 l(E: < Er; for n =1= k) . (4.C.1)
t t
Since the function l(E; < Ef for n =1= k) is measurable and (obviously) in-
tegrable, we can use the ergodic theorem and conclude that with probability
one
lim 2:!_11(E: < Er; for n =1= k) =
t ..... = t
E(l(E; < E;" for n =1= k)) = Pr(E; < E;" n =1= k) (4.C.2)
Both parts of eq.(4.9) follow immediately from eqs.(4.C.1) and (4.C.2) .•
Appendix 4.0: Convergence Proofs for Fixed Source Fuzzy Algorithms

Proof of Theorem 4.5: This proof resembles the proof of Theorem 4.2, but
does not use a probabilistic analysis. Recall the credit update equation is
k pL1 . e-g(E;)
Pt = ""K n (En)
(4.D.1)
L_m=l Pt-1 . e- g t
Take any k =1= mj from eq.(4.D.1) we have
(repeating the argument for s
pkt = _
_
pf'
POk . e -
Po
""t
W.=1 9 (Ek)
e- 2::=1 g(E::)
•
=t
=}
~
pk
t _t _
pf' -
a
- 1, t - 2, ... , 1)
t
pk e 2:t.-
~. _ _.,.--_ _
Po e
_2:: J
J
t
t
g(E;l
g(E;'l .
(4.D.2)
Using the limit of Cesaro average of g(E:), g(E;') and the continuity of the
exponential function, we conclude that for every E > 0 there is some t€ such
that for all t > t€ we have
t{;[ < VPff

VPf t{;[. (~ + E) . em
(4.D.3)
72
Since, by assumption E2, Cm is strictly larger than Ck for all k =1= m, one can
find some E small enough that the bracketed term above is less than one. Raise
eq.(4.D.3) to the t-th power; for every t > tf we get
p! < PoPE. . (~ + E)
Pt Cm
t (4.D.4)
Since (~ + E) < 1, taking the limit as t goes to infinity we have
. pt = 0
hm - Vk =1= m;
Pt
t-->oo
this proves the third part of eq.(4.12). The first and second part of the same
equation are proved in much the same way as in Theorem 4.2 .•
Proof of Theorem 4.6: The credit update equation for the fuzzy max/min
algorithm can be written using Ck,t as
k pL1 1\ Ck,t
Pt = K . (4.D.5)
Vn=l (Pf-1 1\ Cn,t)
Suppose that for some time s we have P":' < em,s; then, since P":' 1\ em,s is the
minimum of p":,, em,s, we must have P":' 1\ em,s = P":' and eq.(4.D.5) yields
m
Ps +1 = K
P":' . (4.D.6)
Vn=l (Pf-1 1\ Cn,t)
On the other hand, for n = 1,2, ... , K
K K
p~ 1\ cn,s ~ cn,s => V (p~ 1\ Cn,s) ~ V cn,s = em,s (4.D.7)
n=l n=l
where the last maximum equals em,s by (4.14). Use (4.D.7) in the right hand
side denominator of (4.D.6):
m
P +1
s = Pms .
VK (n
1
>P .-
) - sCm,s
mImI
> Ps .-d· (4.D.8)
n=l Pt-1 1\ Cn,t
The last inequality follows from (4.14). Now, applying (4.D.8) T times we get
m m 1
Ps+r > Ps . dr
and taking T large enough, P":'+r will get larger than em,s+r, which is bounded
above by eq.(4.14). In short: for some to = s + T we have
P~ ~ Cm,to => P~ 1\ em,to = Crn,to ~ /3; (4.D.9)

at the same time, for k =1= m
Pto k <
k 1\ cto k <
_ cto _ 'Y. (4.D.10)
Combining (4.D.9) and (4.D.10) (and using the assumption f3 > ')') it is con-
cluded that
K K
V (p~o A Cn,to) :::; V Cn,to = Cm,to ::::} P~+l ::::: Cm,to = 1.
~l ~l ~~
Since by construction we also have 1 ::::: P~+l' it follows that P~+l = 1. But
then P~+l ::::: Cm,to+l and the argument can be repeated from (4.D.9) down.
From this follows that there is some to such that for all t ::::: to we have Pt = 1.
This yields the first part of eq.(4.15). Similarly, one sees that for all t ::::: to and
k =1= m
k < Ck,t < '1 < 1. (4.D.ll)

Pt+l - Cm,t f3'
taking lim sup on both sides of (4.D.ll) we obtain the second part of eq.(4.15)
and the proof is concluded .•
Appendix 4.E: Convergence Proof for Fixed Source Incremental Algorithm

To prove Theorem 4.7 we first need the following Lemma.
Lemma 4-4.E.l If E;:=l P~ = 1, then E;:=l P~ = 1 for t = 1,2, ... . If, in

addition, ')' < -k,
then 0 < P~ < 1 for k = 1,2, ... , K and for t = 1,2, ... .
Proof: Proof will be by induction. Supposing E;:=l P~-l = 1, it will be shown

that E;:=l P~ = 1 as well. Recall that the incremental credit update equation
IS
P~ = P~-l + ')'. [e-g(E;) - (t, P~-l . e-g(Er») ]. P~-l (4.E.1)
Summing the update equation over k (and using E;:=lP~-l = 1) we get:

KKK
I>: = LP:-l + ')' . LP:-l . e-g(E~L
k=l k=l k=l
t p : = 1 + ')'. tP:-l . e-g(E~) - ')'. [nt_lP~-l . e-g(E;)]. 1 = 1. (4.E.2)

k=l k=l
Since the proposition is true for t = 0, applying (4.E.2) repeatedly for s= 1,2,
... proves the Lemma. •
74
To prove positivity of pf, we will again use induction. Suppose that for t = s
we have 0 < P~-l < 1 for k = 1,2, ... , K. Now we will show that
1+ 1 , [e-g(E:) - (t,P~-1 .e-g(E:')) 1> 0, (4.E.3)
To prove eq.(4.E.3), first note that
e-g(E:) - (tp~-l . e-g(E:')) >- tp~-l . e-g(E:') > -K, (4.E.4)

n=l n=l
since P~-l < 1 and e-g(E:') < 1. Then, from eq.(4.E.4) follows that
1 + I' [e-g(E;) - (t,P~-1 . e-g(E;')) 1> 1 - ~ . K = O.

So we have shown eq.(4.E.3). This means that for k = 1,2, ... , K we have
p~ > O. Then, since we already know that L:;;=l p~ = 1, it follows that for
k = 1,2, ... , K we have P: < 1. Now we can use induction on s to complete the
proof of the Lemma .•
Proof of Theorem 4.7: For t = 0,1,2, ... , define F t to be the sigma- field
generated by p~ and {E:}~=o, with k = 1, ... , K. Define by F 00 ~ U~l Ft.
Now, pf is F t - measurable, for all k, t. I This is so because pf is a function of
El, ... ,Ef and of pLI' ... , pl~-l' But pLI' ... ,P[l are in turn functions
of ELI, ... ,Et..l and of pL2' ... ,P[2 and so on. In short, pf is a function of
Ef, ... ,Ef, E~, ... Ef. Hence it is clearly F t - measurable.
Also, for k = 1,2, ... , K, t = 0,1,2, ... , define 7rf ~ E(pf). In eq.(4.E.1) let
k=m and take conditional expectations with respect to F t - l . For all k and t
we have E(Pf-1 I F t - l ) = Pf-l' E(Cg(E:) I F t - l )= E(e-g(E:))= Ck. In other
words, e-g(E:) is independent of F t - l . This is so because we assumed the
noise process {En~l to be white, hence Ef is independent of E!, l = 1, ... , K,
s = 1, ... , t - 1. Finally, from Lemma 4-4.E.1, L:;;=l Pt-l = 1, hence the update
equation eq.(4.E.1) yields
E(p,!, I {Ft-d = { 1 + 1 [cm - (~P~-l . cn) 1} .Pr:l =}
E(p,!, I F t - l ) ~ { 1+1 [cm - em . (t,P~-I) 1}.Pr:l =}
1A sigma-field F generated by random variables Ul, U2, ... is defined to be the set of all sets
of events dependent only on Ul, U2, ...• A random variable v is said to be F-measurable if
knowledge of Ul, U2, ... completely determines Vj in other words, either v is one of Ul, U2,
... or it is a function of them: V(Ul' U2, ... ). Note that the total number of Ul, U2, ... may
be finite, countably infinite or even uncountably infinite. For more details see (Billingsley,
1986).
(4.E.5)
From eq.(4.E.5) follows that {pr"}~o is a submaTtingale. Since 0 ~ E(pr") =

E(lpr"1) ~ 1, we can use the Submartingale Convergence Theorem and conclude
that, with probability 1, the sequence {Pr" }~o converges to some random vari-
able, call it pm, where pm is F 00- measurable. We have assumed that Po > 0;
from this, and a slight rearrangement of eq.(4.E.1) it can be seen that for all t
we have Pr" > O. From this it is easy to prove that the pm = limt-+oo Pr" > O.
Hence, convergence of Pr" does not depend on the initial values p~, k =
1,2, ... , K, as long as Po is greater than zero. However, we still do not know
whether the sequences {pn~o, k =f. m, converge. Similarly, since Pr" ~ pm,
we can take expectations and obtain E(pr") ~ E(pm) = -rrm; but we do not
know whether E(Pn converges for k =f. m. However, since ~~=l pf=l for all t,
we have that E(~k,emPf) = 1- E(pr") ~ 1- -rrm. Now, if in (4.E.1) we set
k = m and take the limit as t ~ 00, we obtain
(4.E.6)
Since pm = limt-+oo Pr" > 0, (4.E.6) implies
pm ~ ,~ {1+ > [e-'(E~) - (;P7-, .e-'(En) l}. pm; (4.E.7)
the important point is that the quantity in curly brackets has a limit. Since
pm> 0, it can be cancelled on both sides of (4.E.7); then we get
1 ~ 1 +>',~ [e-'(E~) - (;P7-, .C'(E~)) 1 =}
o ~ ,n.'! [e-'(E~) - (; p7-, . e-'(En) 1 =}
t~~ [e-g(E;') . (1- P~l)] = t~~ [2:

n,em
Pf-l . e-g(E;')] :::::}
(taking expectations and using the Dominated Convergence Theorem 2)
2The Dominated Convergence Theorem states that, under appropriate conditions,

limt_oo E(Jt) = E(limt_oo It). See also the Mathematical Appendix A, or (Billingsley,
1986).
76
(taking en ~ maxk~m Ck and note that Cn < em)
(4.E.8)
From eq.(4.E.8) it follows immediately that 7T'm = 1; otherwise we could cancel

1-7T'm from both sides of (4.E.8) and obtain em ~ Cn, which is a contradiction.
Hence 1 = 7T'm = limt-->oo 7T'f", i.e. 1= limt-->oo E(p'!') = E(limt-->oo p'!'). Since
limt-->oo p'!' ~ 1, we must have limt-->oo p'!' = 1 with probability 1; it follows
that limt-->oo pf = 0, for n =1= m, which completes the proof of the theorem .•
Appendix 4.F: Convergence Proof for a Markovian Switching Sources Algo-

rithm
The proof of Theorem 4.8 requires several preliminary results. We will work
with auxiliary quantities qf, defined as follows, for k = 1,2, ... , K.
(4.F.l)
(Actually the qf's depend on n as well, but since we will consider n fixed in
what follows, we suppress this dependence from the notation.) Comparing
eq.(4.19) and eq.(4.F.l), se see that the qf's are simply the unsealed versions
of the 7T'fn,s. This is actually proved in the following Lemma.
Lemma 4-4.F.l For 7T'fn as given by eq.(4.19) and qf as given by eq.(4.F.l),

define At = ~;;'=1 qf" . Suppose that for k = 1, ... , K the 7T'~n and q~n are
ql q2 qK
chosen so that :!tn
7rl
= :!an
7rl
= ... = -3kn.
11"1
Then, for m = 1,2, ... , K and for
t = 1,2, ... we have qf" = At . 7T'f"n.
Proof By induction. For t = 1
qi q? qf qi + ... + qf qi + ... + qf A
7T'ln = 7T'2n = ... = 7T'Kn = ,,..In + + 7T'Kn = 1 = 1·
1 1 1 "1 ... 1
Now suppose that the proposition holds for t = r. Then 7T'rmn = ..1.... m
Ar qr for
m = 1,2, ... ,K and
kn L (~;;'=1 q;?' . Rmk) . Ckn

7T'r+l = ----'--(-;------~--;:)--
L ~~1 ~;;'=1 q;!' . Rml . Cln
and the proof is complete .•

Now, to prove convergence we work for a while with the auxiliary quantities
qf rather than the 7rfn,s. Define qt = [ql, q~, ... , qf] and
R12
~2
Q=RD;
then (4.F.l) can be rewritten as

qt = qt-IRD = qt-IQ.
Q is a positive matrix. Since d kn < 1 for k = 1, ... , K and 'E~l Rkl = 1, we
have 'E~l Qkl < 1. The following theorem holds for the powers of Q.
Theorem 4-4.F.2 Qt = >..t. Wi. Vi + O(tMIJl.lt).
Proof: It can be found in p.7 of (Seneta, 1987) .•
Here>.. is the (positive) maximum modulus eigenvalue of Q (guaranteed
to exist by the Perron-Frobenius theorem, (Seneta, 1987, p.l); w, v are the
associated (strictly positive) right and left eigenvectors, i.e. wQ = >..w , Qv =
>..v and wv = 1; Jl. is the second maximum modulus eigenvalue, with multiplicity
M. We have the following Lemma.
Lemma 4-4.F.3 1>"1 = >.. < 1.

Proof: That 1>"1= >.. follows from the Perron-Frobenius theorem. Now define
am ~ 'E~=l Qmk < 1. Suppose >.. ~ 1. We have
K K K K
wQ = >..W =} L WmQmk = >"Wk =} L L WmQmk = >.. LWk =}
m=l k=l m=l k=l
KKK K K
LWmLQmk =>"LWk =} Lwmam=>" LWm =}

m=l k=l k=l m=l m=l
K
L wm(em - >..) = O.
m=l
This leads to a contradiction, because we assumed that >.. ~ 1 and, for all m,
Wm > 0 and am < 1; so we have a sum of strictly negative numbers that equals
less than zero. Hence we must have >.. < 1 and the proof is complete .•
78
Now we will prove the following Lemma.
Lemma 4-4.F.4 Define 'Yml = .1!.m..

VI
For t = 1,2, ... and m, l = 1,2, ... , K we
have
qr' [Qt]nm
-I = -[Qt] -> 'Yml as t -> 00.
qt nl
Proof From Theorem 4-4.F.2 and Lemma 4-4.F.3 it follows that as t -> 00
... VK] =
Then, for i,j, m = 1,2, ... , K

[Qt]i'"
At -> WiVm
} [Qt]
=> ~ W
_i - (3--
At -> _1_ [Qt]_ -> W- - 'J
(4.F.2)
[Qt]j", WjV", Jm J
Similarly, for i,j = 1,2, ... , K (and fixed n)
[Qt]ni
At -> WnVi } [Qt] ni Vi
--
At
-> -
1 => - - ->
[Qt] nJ-
-
V-
= 'Yij (4.F.3)
[Qt]nj WnVj J
Since qt = qt-1Q, qt-1 = qt-2Q etc., finally qt = qoQt . Then we have for
m, l = 1,2, ... , K
qr' = q6[Qthm + q5[Q t bm + ... + qJ<[Qt]Km. (4.F.4)
1 1
(4.F.5)
q~ q6[Qt]ll + q5[Q t hl + ... + qJ<[Qt]KI .
Taking i = 1,2, ... , K and j = n in (4.F.2) we obtain
.. "
then applying to (4.F.4) we obtain
[Q~[m -> (qa(31n + q5(32n + ... + qJ< (3Kn) . (4.F.6)
Similarly, in (4.F.2) take m = l, i = 1,2, ... , K and j = n to obtain
"0' (4.F.7)
combining (4.F.7) with (4.F.5) we obtain

[Qt]nl 1
12K ). (4.F.8)
qo(31n + qo(32n + ... + qo (3Kn
- 1 - -> (
qt
Multiply (4.F.6) and (4.F.8) to obtain
~.[Qt]nl----tl (4.F.9)
[Qt]nm q~ .
Also, by (4.F.3) with i = land j = m, we obtain

[Qt]nl
(4.F.I0)
[Qt]nm ----t 'Ylm
Combining (4.F.9) and (4.F.I0) we obtain finally
qr qr 1 1
-I . 'Ylm ----t 1 =} -I ----t - = ..!!L = -Vm = 'Yml
qt qt 'Ylm v=VI
and the proof is complete. •

Now we prove the convergence theorem.
Proof of Theorem 4.8: From Lemma 4-4.FA we know that for m, l
q=
1,2, ... , K we have ~ ----t 'Yml. From Lemma 4-4.F.l we know that for m, l
qt
=
1,2, ... , K and t = 1,2, ... we have qr = At ·7J"rn and q~ = At . 7J"~n. Hence
· 1· At . 7J"r 1· qr n
11m 7J"rn
-1- = t-+oo
1m I = t-+oo
1m - I ='Yml·
t-+oo 7J"t At . 7J"t n qt
In other words
",K mn K 1 K
) =}
L....tm-l 7J"t
7J"t
In ----t L 'Yml
m=l
=} -zn ----t
7J"t
L 'Yml
m=l
=}
It is easy to show that

7J"ln ",K mn R
_t_ = L....tm=l ·7J"t-l ml
7J"nn ",K mnR
t L....tm=l ·7J"t-l mn
Then, taking the limit as t ----t 00 one obtains
d
",K
L....tm=l 7J" mn . .LDLml In
L:~=l 7J"mn . Rmn dnn ·
Now suppose that the conclusion of the theorem is false. In that case the
maximum of 7J"m occurs for some m =F n. In other words for some m =F nand
for all k =F m, we have 7J"mn ~ 7J"kn. In particular
mn ",K kn R d
~ = L....tk=l7J" . km. ~ <
nn
7J" ",K
L....tk=l 7J"kn . R kn
dnn -
80
7r mn . ,,",K
L.....-k=l R km dmn 7r mn 1 + K f dmn
---===--- . - - <- - .- -- . -- =}
7r nn . Rnn dnn - nn
7r 1- f dnn
1 < 1 + K f . dmn =} d nn < 1+Kf

- 1- f d nn d mn - 1- f
which contradicts Assumption H2. Hence the proof is complete.•

5 SYSTEM IDENTIFICATION BY THE
PREDICTIVE MODULAR APPROACH
In this chapter we consider the connection between classification and identifi-

cation of time series; in particular we present a point of view by which iden-
tification can be considered as classification involving an infinite source set.
We exploit this approach to present some generalizations of the PREMONN
approach which can be applied to system identification problems. In addition,
this chapter serves as an introduction to the problem of time series classifica-
tion when the source set is unknown; this type of problem will be discussed in
greater detail in Part III.
5.1 SYSTEM IDENTIFICATION

Consider a dynamical system described by an equation of the form
Yt = F(Yt-l. Yt-2, ... , Yt-M, Uti eo) + et, (5.1)
where Yt is the system state (fully observable), Ut is the system input, et is a
noise term and eo is the parameter vector, which takes values in the parameter
space =: (in many cases of interest,=: is a Euclidean space Rd; however a more
general parameter space can be involved). It has been shown (Leontaritis and
Billings, 1985a; Leontaritis and Billings, 1985b) that the form of eq.(5.1) is
general enough to describe a large number of interesting dynamical systems; in
particular, the more traditional description, which involves two equations (one
for state evolution and another for system observation) can in many cases be
reduced to eq.(5.1).

82
A large class of system identification problems can be described in connection

to eq.(5.1). Such problems can be broadly divided into two categories.
1. In parameter estimation problems, the functional form F(., .;~) is assumed

known and it is required to find the true parameter value ~o or a "good" (in
some appropriate sense) approximation to it.
2. In problems of black box identification, the form F(.,.;~) is assumed un-
known; a class of approximating functions f(.,.; 0) is usually given (where
o E e, some appropriate set) and it is required to find the value 0* which fur-
nishes the "best" (in some appropriate sense) approximation to the behavior
of eq.(5.1).
Usually the "best" ~ or 0 value is taken to be the value which optimizes some
appropriate function J (~) or J (0) 1. Hence, in case the identification problem
is solved ofHine, it reduces to a standard optimization problem, which can
be solved by steepest descent methods (especially Levenberg-Marquardt type
algorithms), genetic algorithms and simulated annealing, to mention only a few
possibilities. For the online case, some of the preferred methods are recursive
least squares, extended Kalman filtering, instrumental variable methods and so
on. An excellent overview of the subject can be found in (Ljung, 1987).
While the formulation of the identification problem is straightforward, its
solution is not always easy, especially in case eq.(5.1) is nonlinear. Depending
on the nature of the nonlinearity, the ofHine case may require the solution of an
arbitrarily hard optimization problem, while generally efficient online methods
are available only for linear modelling functions. In short, at this point in
time the general system identification problem may be considered practically
unsolved.
5.2 IDENTIFICATION AND CLASSIFICATION
5.2.1 Identification by Classification

We will now argue that the PREMONN approach to classification can in cer-
tain cases provide a useful identification method. The main idea we want to
exploit is this: while identification usually involves searching an infinite para-
meter set e, in certain cases it is more expedient to search e in a piecewise
fashion, using a divide-and-conquer approach. In such cases, PREMONN-like
algorithms which successively search finite subsets of e can prove useful. Let
us now explain this point in detail.
First of all, the identification problem can be easily reformulated in classifi-
cation terms. Let us rewrite eq.(5.1). as
(5.2)
1For instance, we may have J(e) = 2:;=1 IYt - Y;

12 , i.e. the total square difference ("error")
between the output of the true system and the estimated model.
SYSTEM IDENTIFICATION BY THE PREDICTIVE MODULAR APPROACH 83
This is merely a change in notation, and has been employed so that eq.(5.2) re-
sembles eq.(2.1). Indeed the two equations are identical except for the presence
in eq.(5.2) of the input term Ut. We have already remarked that such an input
term does not alter in any way either the applicability of the PREMONN al-
gorithms or their phenomenological justification. Hence the similarity between
the classification and identification problems becomes apparent. In classifica-
tion terms, the parameter estimation and black box identification problems can
be described as follows.
1. In the black box identification problem, FE and S are assumed unknown,

e
but a class of models 19(.) (parameterized by E e) is given; the problem
consists in using the observations Yl, Y2, ... , Yt and the inputs Ul, U2, ... to
find the "best" (in some appropriate sense) value e* in the set "source set"
e.
2. In the parameter estimation problem, the form of FE("') and the source set S
are assumed known. In this case we can take ~ = e, S = e and F{(.) = 19(.).
The problem consists in using the observations Yl, Y2, ... , Yt and the inputs
Ul, U2, ... to find the "best" (in some appropriate sense) value e* in the set
"source set" e.
Hence the similarity between identification and classification, at least from

a formal point of view, becomes evident. Furthermore, as we have explained
in Chapters 3 and 4, for the fixed source case, PREMONN classification is
essentially equivalent to the optimization of an appropriate predictive error
function. For example, it has been shown in Chapter 4 that the multiplica-
tive PREMONN algorithm with quadratic error function selects the parameter
e E e which minimizes the function e- E(IE:1 2/20'2).In light of this observation,
PREMONN classification and identification appear to be different formulations
of the same problem. However, there is an important caveat. As already re-
marked, PREMONN classification rests on the assumption that e is finite; this
will generally not be true for identification problems.
Given the above limitation of the PREMONN approach, why not use a
more traditional system identification method? In the following paragraphs we
examine three cases in which a PREMONN approach may be expedient.
5.2.2 Local Models

Suppose that we cannot find a class of 19(.) models which approximate the
global input/output behavior of FE(.) sufficiently well. In this case, it may still
be possible to develop local models, each of which describes well the behavior
of the original system when it operates in particular range of Yt and Ut values.
Such models can be trained omine and their online combination can be effected
by using any of the credit assignment schemes presented in Chapter 3. Online
training and combination of the 19(.) models presents a harder problem; some
possible solutions will be discussed in Part III.
84
5.2.3 Switching and Slowly Varying Systems

Even when the global input/output behavior of Fd.) can be well approximated
by the le{.) models for a fixed value of ~, ~ may be switching (i.e. not fixed)
or varying slowly. From an operational point of view this problem is identical
to the one discussed in the previous paragraphs.
5.2.4 Piecewise Search Strategies

Finally, a PREMONN approach to identification makes sense when the global
input / output behavior of Fe (.) can be well approximated by models of the form
le(.) and ~ is fixed, but it is not easy to search the set 8. In such a case, a
divide-and-conquer search strategy may be useful: rather than searching the
entire (infinite) space 8 at once, we may develop strategies for successively
searching subsets 8(1), 8(2), ... in such a manner that as i goes to infinity,
every element in the set e(i) will furnish a model which approximates the true
system sufficiently well. In this manner the divide-and-conquer approach can
provide a feasible solution to the system identification problem. We will present
algorithms along these lines in Section 5.4.
5.2.5 The Divide-and-Conquer Approach

In summary, we propose an approach to system identification which is charac-
terized by the following features: use 01 multiple models and model combination
by predictive credit assignment. We should add that our particular approach
is one of many examples of multiple models methodsj such methods have been
used for many years in various disciplines. A fairly detailed description of our
point of view on the connections between such approaches (together with a list
of representative references) will be presented in Part IV. For the time being,
let us simply remark that we consider all such methods as implementations of
the general divide-and-conquer approach to problem solving.
Let us now proceed to present two particular divide-and-conquer methods
for system identification. Both make use of the "predictive modular" approach
and both are applied to parameter estimation problems. The first method is
appropriate for a small search set 8j the second is better suited for finite but
large 8.
5.3 PARAMETER ESTIMATION: SMALL PARAMETER SET

Since we assume the functional form of Fe (.) to be known, we can take Fe (.) =
le(.), ~ = 9, :=: = 8. Hence the system we want to identify can be written as
(5.3)
Now suppose that 8 is a bounded subset of Rd and d is small. Then identi-

fication can be effected in a rather straightforward manner by quantization in
conjunction with classification.
Let us illustrate the method for the case d = 1. In this case, suppose that it is
known that 0 takes values in an interval e = [a, b] C Rl. [a, b] can be quantized
into a K-members set e = {a,a+8x,a+28x, ... ,b}, where 8x = l'<-=-~. In other
words, we replace e with e = {Ol, O2 , ••• , OK}, where, for k = 1,2, ... , K
b-a
Ok = a + K _ 1 . (k - 1); k=I,2, ... ,K.
We can consider now K models of the form
yf = f8 k (Yt-l,Yt-2, ···,Yt-M)
and run any of the usual PREMONN credit assignment algorithms; the only
modification is that now pf refers to the credit of the k-th model (with para-
meter Ok) rather than the k-th predictor. It is clear that this is only a change
of nomenclature.
If the quantization is sufficiently fine, then it is reasonable to expect that
for some k we will have 10k - 00 1 sufficiently small, so that yf approximates
Yt much better than the remaining yr ,
m # k. 2 Hence, by the analysis of
Chapter 4 it is reasonable to expect that, given sufficient time, pf will approach
one and hence Ok, the "best" approximation to 00 will be identified.
This method really amounts to exhaustively searching the parameter set
by simulating several candidate models and scoring them according to their
prediction error. In case the true system has fixed parameter eo,
this is probably
the simplest parameter estimation method available. In case the true system
parameter is eo(t), a time varying quantity, then the recursive nature of the
PREMONN credit update is rather useful, because it allows for online scoring.
Conceptually it is quite obvious that the above method can be applied for
any value of d. However, if we assume Q levels of quantization per dimension
of the parameter space, the total number of models becomes Jd and it is clear
that the method is practicable only as long as d and Q are relatively small. In
particular, even for a very coarse quantization (i.e. small Q), when d is bigger
than three or four, the curse of dimensionality results in an unmanageably
large number K.
In the case of large d, quantization may still be used in conjunction with a
method of moving grids. This method starts with a coarse quantization; as soon
as a promising value of () is identified, a finer quantization is effected around
this value and a new value of 0 is obtained. In every iteration the 0 value
with highest p: is selected; since p:
is related to a cumulative error function
(see Chapters 3 and 4), it is reasonable to expect that successive iterations of
the above procedure yield 0 values with progressively decreasing cumulative
square error. Essentially this leads to a steepest-descent-like approximation
to the true parameter value (}o. Iterative application may yield a sequence
2In other words we assume that proximity in the parameter space results in proximity in
the output space. While this assumption wil generally be not true (especially in the case of
nonlinear ones) it holds true in a number of cases large enough to make it useful.
86
0(1), 0(2), ... which converges to 0 0 , However, because of the greedy nature of
the algorithm, convergence to a locally optimal estimate which may in fact be
quite distant from ( 0 ) is also possible. To overcome this problem, as well as the
curse of dimensionality, a more sophisticated search strategy is needed; this is
presented in the next section.
5.4 PARAMETER ESTIMATION: LARGE PARAMETER SET

We now propose a parameter estimation algorithm to overcome the curse of
dimensionality, as well as to escape local minima. Again, since the functional
form of Fe(.) is known, we can take Fe(.) = 10(.), ~ = 0, =: = e.
The proposed algorithm embeds a predictive modular credit assignment
scheme in a genetic algorithm, which runs for several epochs, numbered by
the index i= 0, 1, 2,... . In every epoch the genetic algorithm introduces
a generation of models (i.e. subsets of Q). Call the generation produced in
the i-th epoch e(i); all models in e(i) are "evaluated", i.e. assigned credit
according to their predictive performance. Model evaluation is then used by
the genetic algorithm to produce a new generation of models (i.e. e(i+l»),
for which the evaluation process is repeated. In this manner Q is searched in
a divide-and-conquer mode. An well-documented (Goldberg, 1989; Mitchell,
1996; Rudolph, 1994) advantage of the genetic algorithm is that it performs a
search for the globally optimal model. J(O) (the prediction error as a function
of the parameter 0, see footnote 1) is taken as the fitness function. This means
that the genetic algorithm seeks the global optimum of J(O). Indeed J(O) may
have local minima, either inherent or due to noisy measurements (Ljung, 1987).
By judicious choice of the models used to represent the true system, the
PREMONN evaluation scheme and the genetic search algorithm it is possible
to search parameter spaces Q = Rd of high dimension.
5.4.1 Genetic Algorithms for Parameter Estimation:

Genetic algorithms are inspired by the biological paradigm of natural selection.
The starting point is a population which evolves in successive generations; the
population consists of individuals of distinct genotypes. The genotype of each
individual is propagated to the next generation in accordance to the individ-
ual's fitness. Various operators are provided for the combination (mating) of
genotypes within one generation, such as crossover, mutation etc., obviously
inspired by biological analogy; their role is to provide diversity and introduce
new genotypes, which are possibly fitter than the existing ones.
Translating the identification problem in the above terms, an "individual" is
a model determined by parameter vector Ok and the "population" is the search
subset e. Each parameter is encoded by a string of bits; concatenation of the
d strings (one for each parameter) results in a longer string (the "genotype"
of Ok) which encodes the parameter vector Ok' "Fitness" of each individual
is related to the accuracy with which it predicts system behavior; in fact any
decreasing function of the total square error can be used as a fitness function ..
In "genetic" terms, the final goal (discovering the true parameter vector ()o) is
translated into finding the "fittest individual".
In practice it is certain that the j-th parameter will lie within an interval
[Aj, B j ], j = 1,2, ... , d. Each interval is discretized, using 2n steps. Since () is a
vector of length d, there are d parameters, each taking 2n values; hence there
are (2n)d = 2n.d parameter combinations, resulting in 2n·d different models,
each coded by an· d-bit string; this is the "genotype" of the particular model.
5.4.2 Genetic Operators and Diversity Criterion:

A genetic algorithm is used to produce successive generations of models. At
initialization, K out of the 2n . d possible models are selected randomly and used
by the predictive credit assignment algorithm. The resulting model credits are
fed to the genetic algorithm, and used in combination with the model genotypes
to produce a new generation of models. The following components are used to
implement the genetic algorithm.
1. Selection Mechanism. A roulette wheel is used to randomly choose models

according to selection probabilities which are exactly the predictive credits.
2. Genetic Operators. These are mutation and crossover (Goldberg, 1989;
Mitchell, 1996). Mutation operates on one model of the old generation,
choosing randomly (with uniform distribution) m bits out of the n . d-long
genotype string (where m is a fixed number, m < n . d) and reversing these
bits (1 becomes 0 and 0 becomes 1). Crossover operates on two models of
the old generation, splitting the genotype string of each model into m parts
(break-points are the same for both models and are chosen randomly from
a uniform distribution); the offspring model is created by choosing genotype
fragments from each parent alternately.
3. Elitism. This means that the best model of each generation is always
included in the next one. In fact, elitism is supplemented by hill-climbing
(Goldberg, 1989; Mitchell, 1996), meaning that a local search is performed
around the best model of each generation, slightly perturbing its parameters
and looking for a possible improvement in total square error. Elitism and
hill climbing, have been introduced to speed up the genetic search.
4. Selection Probabilities. The credit function p~ is the selection probability
of the k-th model and t is the time at which the entropy criterion (see below)
is satisfied and the genetic epoch terminated.
5. Entropy. An entropy criterion is also used, to ensure diversity in the models
of each generation. While hill climbing and elitism may speed up the algo-
rithm, they also introduce the possibility of the algorithm getting trapped
in a local, rather than global minirimm. To avoid this, it is required that
in every model generation sufficient diversity is maintained so that a good
number of alternative solutions keep being explored; this is achieved by use
of the entropy criterion, which is now discussed.
88
5.4.3 The Entropy Criterion .

.Ail has been proved in Chapter 4, if the predictive credit assignment algorithm
operates sufficiently long (t --+ 00), then each of the pf's converge to either zero
or one. However, if the pf's converge to either zero or one, then the opera-
tion of the genetic operators, ensures that the next generation of models will
only contain the fittest model of the previous generation. Thus, the genetic
algorithm may get stuck with a poor model which provides only a local mini-
mum of estimation error. The problem is exacerbated by the use of elitism and
hill-climbing, which tend to enforce the presence of the fittest models. On the
other hand, it is clear that if the pf's do not concentrate sufficiently on the most
promising models, then the search of the parameter space will be essentially
random.
To avoid either of the above situations, the PREMONN credit assignment
algorithm must stop operating after the pf's start concentrating to promising
models, but before they get too close to either one or zero. To achieve this
goal, in every epoch the PREMONN credit assignment algorithm operates for
a variable number of steps, determined according to an entropy criterion. The
entropy H t of pL pt, ... , pf is defined by H t = - E~=lPf .log(pf).The max-
imum value of H t is log(K), and is achieved when pl= p~= ... =pf= -k, i.e.
when the pf's are equally spread throughout all models. The minimum value of
H t is zero, and is achieved for pf = 1, pr=O, m =F k, i.e. when all probability
is concentrated on one model. Since the pf converge to either zero or one, it
follows that H t eventually converges to zero. Let us define a dynamic threshold
fit: fit = log(K) . ~. This is simply an increasing linear function. It is clear
that the inequality H t < fit will be satisfied for t equal to some f:$ T. This in-
equality describes our entropy criterion; when the entropy of pl, p~, ... , pf falls
below fit, then the PREMONN credit assignment algorithm stops operating.
Since fit is increasing linearly and H t converges to zero, termination will take
place at some intermediate value of entropy, between 0 and log(K), ensuring
that the pf's are neither too concentrated, nor too diffuse. Hence, sufficient but
not excessive diversity is maintained in every generation of models. In addition,
since the entropy criterion is usually satisfied before T, the maximum number
of time steps, a considerable speedup of the algorithm is achieved.
5.4.4 Summary of the Parameter Estimation Algorithm

We now have all the components that comprise the parameter estimation algo-
rithm, which is summarized as follows.
Credit Assignment / Genetic Parameter Estimation Algorithm
Initialization
Discretize the parameter space

Create randomly e(O) = {BiO), B~O), ... , B~),}, the first generation of K mod-
els.
Main
For i= 1, 2, ... , Imax.
Set the search subset to be e(i-l).
Initialize the p~'s in the interval [0,1].

Set EntropyCheck=False
Sett=1
While t < T and EntropyCheck=False
Observe yt
For k = 1,2, ... K.
Compute Ef= yt _ ~k.
Compute g(Ef).
Compute pf (i.e. the credit of the model with parameter i - 1») Bi
using one of the credit update schemes of Tables 3.1 or 3.2).
Next k
Compute H t and Nt
If H t < Nt then
Set EntropyCheck=True.
End If
Set t f-- t + 1.
End While.
Set B(i-l) f-- Bii- 1) where k = argmaxk=1,2, .. ,K pf

Apply to B(i-l) hill climbing; insert the resulting model into e(i).
Set r f-- 1.
While r ::; K and i < Imax.
Choose randomly the mutation or crossover operator (with probabilities

q and l-q respectively).
Choose the appropriate number of parents (one or two) from e(i-l).
Produce (by the chosen operator) a new offspring model.

If the offspring model is different from the models already in e(i)
add it to e(i);
set r f-- r + 1;
90
else
discard it.
End if.
End while
Next i
Set () * f-
()(i-l)
k
h
were k = argmaxk=l,2, .. ,K Pt.
k
5.5 EXPERIMENTS
We now present two sets of parameter estimation experiments. Both sets utilize
computer simulations of dynamical systems; the first system involves a small
parameter set and the second a large parameter set (hence requires the use of
the Credit Assignment / Genetic Parameter Estimation Algorithm).
5.5.1 Small Parameter Set

The first set of experiments involves an AC induction motor, the operation of
which (in discrete time) is described by the following nonlinear state equations.
Xt = 8t. A- l . {B· Xt-l + Ut} (5.4)
Wt = Wt-l + "t { 3P
U·
2 (Zt_l
2J· .qs ·dr
Zt-l -
·ds .qr)
Zt-l Zt_l -
r.L}
t , (5.5)
where
(5.6)
l'tds 0 OJ' (5.7)
and
+
0 Lo 0
[ L. Ls 0 Lo
A= (5.8)
-Lo
0 Ls
0
0
Ls
1'
L'~'-l 1
0
B~[
Rs 0
0 Rs LoWt-l
(5.9)
0 -LoWt-l Rs LoWt-l .
-LsWt-l 0 0 Rr
Here il s, its are stator currents, ir, itT are rotor currents, Wt is angular velocity,
l'tqS , l'tds are stator voltages and Tl is torque. at is the integration step; R s ,
Rr are stator and rotor resistances, Ls, L r , Lo are stator, rotor and mutual
inductances; J is moment of inertia and P is number of pole pairs. yt and Wt
are the states; Ut and Tl are the inputs.
All the parameters are assumed known, except for R r . However, Rr is
necessary for the determination of the motor time constant T r , which in turn
is required for efficient and economic angular velocity control. In addition, Rr
depends on operating conditions, in other words, it may be time varying. Hence
we are faced with a typical online parameter estimation problem, for which
various methods of solution have been proposed in the past; see (Krishnan and
Doran, 1987) for a specific method and (Krishnan and Doran, 1991) for a good
reVIew.
Using the predictive credit approach to parameter estimation, we consider
that the vector sequence fils itS]' constitutes the time series YI , Y2, ... which
is assumed to have been generated by a system of the form (5.4), (5.5). What
is unknown is the value of Rr , which plays the role of the source parameter Bo.
To obtain a sample of the Yl , Y2, ... time series we integrate eqs.(5.4),
(5.5) with integration step is set to &=0.5 msec. However, for the parameter
estimation experiments we actually use subsampled time series, with several
values of the subsampling time 88; namely 88 is taken to be 0.5 (full sampling),
1, 2, 3 msec. 3 Finally, the error order N is taken equal to 10.
Each simulation is run for 10000 time steps, each step corresponding to 0.5
milliseconds of real time; hence the operation of the motor is simulated for a
5 second time interval. Input is a three phase AC voltage of 220 Volts RMS
value and torque T L =1.5 N·m. The actual motor has the following parameters:
Rs=11.58 Ohm, Ls =0.071 Henry, Lr=0.072 Henry, Lo=0.069 Henry, J=0.089
kg·m 2 , B=O Nt·sec/m, P=2. In the experiment we use the following strategy
is used to simulate the effect of Rr variation: the value of Rr is changed every
1000 steps (Le. every 0.5 sec; hence ten Rr values are used: from time t=O.O
to 0.5 seconds Rr= 4.9 Ohm, from 0.5 to 1.0 seconds Rr=5.9 Ohm and so on
until the value 13.9 Ohm. Finally, the time series observations of the stator
current were mixed with additive noise at various noise levels, indicated by the
signal to noise ratio SNR.
We use ten candidate parameter values (K=1O), tuned to Rr values of 5,6, ... ,
14 Ohm. When actual Rr value is 4.9, the best estimate is 5 Ohm; similarly
for R r=5.9, 6.9, ... . Hence we can evaluate the results of the parameter
estimation experiments by listing the usual c figure of classification accuracy,
where a correct classification occurs when the algorithm picks the parameter
value which is closest to the currently active true parameter. The c figures for
various noise levels and sampling times are presented in Table 5.1.
3A larger sampling time implies that less information is obtained about the operation of the
motor and fewer comparisons are performed between the true system and the predictors.
Presumably, this makes the identification task harder.
92
Table 5.1. Classification figure of merit c for AC motor parameter estimation experiment.
SNR 88=0.5 ms 88=1 ms 88=2 ms 88=5 ms

Noise Free 0.951 0.950 0.916 0.870
20.00 0.949 0.947 0.916 0.870
10.00 0.947 0.940 0.908 0.855
6.66 0.943 0.929 0.894 0.816
5.00 0.934 0.917 0.866 0.822
4.00 0.905 0.889 0.820 0.801
3.33 0.881 0.887 0.808 0.720
2.50 0.870 0.853 0.796 0.687
Table 5.2. Robotic Manipulator Inertial Matrix Components.
All: i
h + 12 + m2hl2 cos((/>2) + (mll~ + m2ln + m2l~
A 12 : 12 + im2l~ + ~m2hl2 COS(¢>2)
A 22 : h + 4m2l~
B12: -m2h12 sin(¢>2)
B21: m21112sin(¢>2)
5.5.2 Large Parameter Set

We next apply the credit assignment / genetic parameter estimation algorithm
to the problem of estimating the parameters of a robotic manipulator.
The MIT serial link direct drive arm can be configured as a two-link hor-
izontal manipulator by locking the azimuth angle at 180 0 (Atkeson, An and
Hollerbach, 1986). The input to the manipulator is an angle sequence ¢t=
[¢1t ¢2t]'. The manipulator is controlled by a PD controller so that joint an-
gles ¢1t, ¢2t track the reference input ¢t. Therefore the problem is parameter
estimation under closed loop conditions. Ignoring gravity forces, the equations
describing the manipulator and controller are the following:
o ]. [ ¢1t - ~1t ] + [ Kul

Kp2 ¢2t - ¢2t 0
[ Au (5.10)
A12
where F is the friction coefficient and A ij , Bi,j are state-dependent quantities

given by
In the above table, the mass and length parameters, are ml=120.1 Kgr,
m2=2.1 Kgr, h=0.462 m, 12=0.445 m, and are considered unknown for the
purposes of this example. The remaining parameters (considered as known)

are the inertial parameters ( 11=8.095 Nt.m·s2/rad , 12=0.253 N·m·s2/rad),
the coefficient of friction (F=0.0005 Nt·m·s/rad) and the gain parameters of
the PD controller (Kpl = 2500, Kul =300 for the first joint, and K p2= 400,
K u2 =30 for the second joint).
To obtain the observation time series we first discretize the system equations
(with a discretization time step ot = 0.01 sec) to obtain equations of the form
Xt = !(Xt-l, Uti ( 0 ), Yt = g(Xt; ( 0 ), (5.11)

where
Xt = [ ¢;~l 1'
¢2t-l
Yt = [ h =(¢,,) +
l} sin(¢lt) +
~~(¢" - q",)
l2 sin(¢lt - ¢2t)
1 (5.12)
and Ut = [¢It ¢2tl' and 0 = [ml m2 l} l2l'. This completes the discrete
time description of the system. Next we integrate eq.(5.11) and add white,
zero-mean and uniformly distributed noise to obtain the final time series Yl,
Y2, ....
The genetic algorithm parameters as follows: number of bits per parameter
n=13, crossover probability q=0.8, population size K =50, number of break
points m=4, number of generations Imax=5000. The parameter space Q is a
subset of R4, namely Q= [A},B 1 jx [A 2 ,B2 jx [A3,B3jX [A 4 ,B4 j. Here, for
j=l, 2, 3, 4, Aj is the lower bound of the j-th parameter, chosen to be 30%
of the true parameter value, and B j is the upper bound of the j-th parameter,
chosen to be 200% of the true parameter value.
The parameter estimation experiment is run at various noise levels; in other
words the additive noise mixed to the series is uniformly distributed in the
interval [-A,Aj, where A takes the values 0° (noise free), 0.25°, 0.50°,1°,2°,
3° and 5°. For everyone of the above noise level one hundred experiments are
run. The accuracy of the parameter estimates for every experiment is expressed
by the following quantity:
16md + 16m21 + 16ltl + 16121
S= m1 m2 11 12
4
where oml is the error in the estimate of ml and similarly for the remaining
parameters. In other words, S is the average relative error in the parameter
estimates.
The experiment results are presented in Table 5.3, in cumulative form. For
comparison purposes, we also list in Table 5.4 similar results for a more tradi-
tional genetic algorithm which uses selection probabilities proportional to the
inverse of total square error. In both tables the presentation format is the
same. Each row lists (for various noise levels) the number of experiments (out
of a total of one hundred) for which the final S error figure is less than the
number indicated in the first column. As can be seen, the credit assignment I
94
Table 5.3. Parameter Estimation by Credit Function Genetic Algorithm; Accuracy Results.
Noise
0 0.25 0.50 1.0 2.0 3.0 5.0

S < 0.01 78 78 66 41 10 4 2
S < 0.02 99 82 79 59 19 6 4
S < 0.03 99 85 82 67 29 9 4
S < 0.05 100 88 87 78 40 18 6
S < 0.10 100 91 88 87 60 32 12
S < 0.15 100 95 89 92 68 40 24
S < 0.20 100 98 89 92 74 52 34
S < 0.30 100 100 98 97 85 69 52
S < 0.40 100 100 99 98 86 85 69
S < 0.50 100 100 100 100 92 89 85
Table 5.4. Parameter Estimation by MSE Genetic Algorithm; Accuracy Results.
Noise
0 0.25 0.50 1.00 2.0 3.0 5.0

S < 0.01 15 1 1 0 0 0 0
S < 0.02 34 4 1 1 0 0 0
S < 0.03 50 8 4 1 0 0 0
S < 0.05 73 17 8 2 0 0 0
S < 0.10 93 57 24 5 1 1 0
S < 0.15 99 91 51 19 7 5 1
S < 0.20 100 97 85 51 29 24 11
S < 0.30 100 100 100 100 96 95 88
S < 0.40 100 100 100 100 100 100 100
S < 0.50 100 100 100 100 100 100 100
genetic parameter estimation algorithm performs significantly better than the

traditional one.
To better illustrate the performance of the genetic parameter estimation al-
gorithm, in Figures 5.1-5.4 are presented histograms of the distribution of pa-
rameter estimates over the one hundred experiments performed at noise level
±O.25°. Each figure relating to one of the four unknown parameters. As ex-
pected, the algorithm picks parameter values centered around the true ones.
The length estimates are more "concentrated" than the mass estimates; we con-
Figure 5.1. Histogram of ml estimates.
0.1 r···············_·····································..............................................................................................- .................·················_···1
0.075
~
Iii 0.05
~
u.
0.025
0 II
0 50 100 150 200
% of correct valu
jecture that this is due to the relative insensitivity of the manipulator behavior
to mass values.
Finally, let us mention that the average duration of one run of the parameter
estimation algorithm is 7 minutes on an HP Apollo 735 workstation.
5.6 CONCLUSIONS
In this chapter we have formulated the system identification problem in classi-
fication terms and seen how it can be solved by using the "predictive modular
approach" , which essentially is the PREMONN approach minus the neural net-
works terminology. It is rather clear at this point that the important part of
PREMONN is the "predictive modular" credit assignment; the "neural" part
is rather incidental since the "predictors" or "models" can be implemented in
non-neural ways.
In previous chapters we have applied the predictive modular credit assign-
ment approach to finite source or model sets. In this chapter we have seen
that in some cases of parameter estimation problems the above approach also
applies to infinite source or model sets. We have presented experiments which
indicate that such infinite sets can be searched efficiently by the divide-and-
conquer approach, whereby only a finite subset is examined at any given stage
of the search process. However, the algorithms we have presented so far are
rather ad hoc and their convergence properties cannot be analyzed in a rigorous
manner. In Part III we will treat in considerably greater detail the black box
96
Figure 5.2. Histogram of m2 estimates.
0.25
0.2
0.15
>-
(.)
c::
Q)
:J
CT
~
u.. 0.1
0.05
0 II •
0 50 100 150 200
% of correct valu
Figure 5.3. Histogram of h estimates.
0.75
~
5i
:J 0.5
CT
~
u..
0.25
o +--------r------~~------1_------_4
o 50 100 150 200
% of correct value
Figure 5.4. Histogram of l2 estimates.
0.4
0.3
ij'
Iii
~
u. 0.2
0.1
0+--------4--------~--------+_------~
o 50 100 150 200
% of correct value
identification problem and we will develop a structured approach, which can

be proved to converge.
II Applications
6 IMPLEMENTATION ISSUES
In this chapter we discuss practical issues relating to the implementation of

the PREMONN algorithms. Particular attention will be paid to the signifi-
cance of various parameters which influence the classification and prediction
accuracy. We try to summarize the experience we have gained from extensive
experimentation with PREMONNs in both artificial and real-world problems.
6.1 PREMONN STRUCTURE

All PREMONN versions are characterized by the same modular architecture
as the basic PREMONN presented in Chapter 2. This architecture is presented
in Fig.6.1.
It can be seen that PREMONNs have two components: prediction and credit
assignment. Prediction is effected by separating the source space to a finite
number of sources and developing a prediction module for each source. An
additional module assigns credit recursively. The prediction / credit assign-
ment structure can be implemented by the modular, hierarchical architecture
illustrated in Figure 6.1.
The left side of the architecture illustrated in Fig. 6.1 corresponds to predic-
tion. One predictive module corresponds to each source in the source set. The
predictive modules can be implemented by neural networks, but other predic-
tor choices are also available. Training of the predictive modules is performed
offline, each module being trained on data from a particular source.

102
Figure 6.1. The PREMONN architecture.
Yt
Pt'
Credit
Pt2
Assgn.
PtK
The right side of the architecture corresponds to credit assignment. This is

performed online, using any of the recursive credit update equations presented
in Tables 3.1 (p.49) and 3.2 (p.54).
6.2 PREDICTION
6.2.1 Form of Prediction Module

As already mentioned, there is considerable latitude in choosing prediction
modules. Neural networks with linear, sigmoid, polynomial or RBF (radial
basis functions) neurons have been some of our choices. Other possibilities
include Kalman filters (Ljung, 1987), NARX and NARMAX models (Leon-
taritis and Billings, 1985a; Leontaritis and Billings, 1985b), spline regressions
(Friedman, 1991) Volterra or Laguerre series (Rugh, 1981; Wahlberg, 1991) etc.
Really, there is no limitation in choosing the form of the predictive modules.
The only requirement is that a prediction y"k is available for every source (k=l,
2, ... , K) and every time step (t=l, 2, ... ); the method by which the prediction
is obtained is not important. In fact, it is not necessary for every predictive
module to be of the same type. For instance it may be advantageous, for a
particular problem, to model one source with a linear predictor and another
with a sigmoid one.
So far we have been discussing black-box predictors; these provide an in-
put/output model of source behavior, but do not attempt to model the internal
mechanism which generates the time series. In certain cases (if such a mech-
anism is not known) this is a useful choice. If, on the other hand, some prior
knowledge of the source is available, better predictions can be derived using a
white box or structured predictor, as discussed in Chapter 6.
IMPLEMENTATION ISSUES 103
For classification problems, it may be advantageous to avoid using the most

accurate predictor. For example experimentation with chaotic time series has
led us to the following observations. A sigmoid predictor gives zero prediction
error when noise-free input is used, but the error increases rapidly when the
input is noisy. This may be due to the nonlinear, chaotic nature of the logistic
mapping: small input variations may result in large output variations (Arrow-
smith and Place, 1990, pp.226, 244). It is true that the sigmoid predictor may
approximate the input / output behavior of the logistic with excellent accuracy;
but this also implies that the sigmoid predictor inherits the logistic's sensitivity
to noise. On the other hand, a linear predictor has higher prediction error in
the case of noise-free observations, but is more robust to noisy observations;
while its prediction error never becomes very small, it never becomes too large
either. Because of the competitive nature ofPREMONN credit assignment (i.e.
each predictor receives credit in accordance with its relative, rather than ab-
solute performance) the linear predictor yields superior classification accuracy.
Hence in certain cases it may be preferable to use a predictor of "average" and
robust performance.
6.2.2 Predictor Order M

Predictor order M, i.e. the number of past samples used to produce yf, is also
related to noise robustness. Consider the case of feedforward, sigmoid neural
predictor. A high value of M will result in a large number of network weights,
which must be determined during ofHine training. If not enough training data
are available, a large number of weights may result in overfitting.. Hence there
is an incentive to use low values of M. Of course the exact meaning of "low val-
ue" depends on the particular problem examined. Several excellent discussions
of overfitting are available in the literature (Geman, Bienenstock and Doursat,
1992). Following Geman's terminology, we have generally preferred to use a
low M so as to reduce bias, even at the expense of increasing variance, i.e. spu-
rious prediction errors. This strategy is especially appropriate for classification
problems because of the competitive nature of PREMONN credit assignment.
The result is again increased noise robustness, at the expense of accuracy of
prediction (but not of classification).
6.2.3 Prediction Error Order N

In many time series problems the observations Yt are corrupted by observation
noise. The effect of noisy observations can be reduced by using the N-step
error Ef (rather than the one-step error en:
large values of N result in noise
robustness. The reason is rather obvious: spurious prediction errors introduced
at particular time steps are smoothed by looking at the error over an interval
of N time steps. This noise robustness, however, is achieved at the expense of
sluggish performance. The problem is accentuated in case of rapid variations
in source activation. For instance, suppose N = 10 is used; then credits are
updated on every tenth time series observation. If a source switch takes place
104
on the fifth time series observation, the earliest moment at which this source
switch can be registered is after the next credit update, i.e. five time steps
later. If two source switches take place, say on the fifth and seventh time steps,
then the first source switch will probably pass completely unnoticed.
Essentially, a high value of N results in "low resolution" of the algorithm
along the time scale. The converse situation holds when low values of N are
used. The algorithm has increased resolution, at the expense of being less
robust to noise, since random error predictions are not averaged out.
6.2.4 Modularity, Scaling and Parallelism

The predictive layer of PREMONN is modular in the strong sense. In other
words, if one of the predictive modules in the bottom layer of Fig.6.! is replaced
by a new predictive module (corresponding to the same source) or even entirely
removed, this will not necessitate the retraining of the composite system. If new
sources are incorporated, corresponding predictive modules can be added to the
prediction layer; again, the old layers need not be retrained. This modularity
property reduces development time.
In addition, PREMONN training time scales linearly with the number of
sources involved in a problem. For instance, training a PREMONN for a ten
source problem takes only twice as much as training for a five source problem.
This is in pleasant contrast to lumped neural networks, which are known to
scale badly. In particular, it is well known that in a lumped neural network,
as the number of neurons and/or weights increases, training time may fail to
produce a good solution; even in case a reasonable solution is obtained, training
time may increase exponentially with the number of weights and/or neurons.
This is one of the main reasons for the interest in modular networks, such as
PREMONN, where fixed size modules are trained in a piecewise manner. In
this case, clearly, both network size and training time scale linearly with the
number of sources (categories) that must be learned.
Finally, the modular PREMONN architecture is well adapted for parallel
execution. In particular, the prediction modules can operate in parallel and the
results (predictions) can be sent to the credit assignment module for further
processing. Hence execution time is independent of the number of classes in
the classification problem.
6.3 CREDIT ASSIGNMENT
6.3.1 Thresholding h and Switch Matrix R, W

We have already indicated that in the case of multiplicative algorithms source
switching can be handled by the use of either thresholding (with threshold pa-
rameter h) or by a Markovian assumption (with switching matrix parameter
R). Our experience indicates that thresholding gives superior results; concep-
tually, however, the hand R parameters operate in similar manner and can be
chosen by analogous considerations.
Choosing a value for h has already been discussed in Section 2.3. Let us
now present some remarks regarding the choice of the R matrix. In case we
have some prior knowledge about the source switching mechanism, this may
be incorporated in R. For example, consider a case of two sources: source no.l
is active at times t = 1,3,5, ... and source no.2 is active at times t = 2,4,6, ... ;
in this case we would set Rkk = ° (k = 1,2) and Rnk = 1 (n, k = 1,2,
n i=- k). In some cases only partial information is available about the source
switching mechanism. For instance, suppose we know that source switching is
slow, i.e. once a source is activated, it will remain activated for a fairly long
time. This information can be incorporated in R by setting Rkk = 1- (N -1) . E
(k = 1, ... , K) and Rjk = E (n, k = 1, .. , K, n i=- k), where E is a small positive
number. The above remarks, intended for multiplicative algorithms, carryover
in exactly analogous manner to the case of fuzzy and incremental algorithms.
Regarding additive algorithms, an upper threshold A may be used in a manner
analogous to that of h: if the discredit becomes greater than A, it is no longer
increased. Similarly, in the case of counting algorithms, an upper threshold A
can be used: whenever credit becomes greater than A, it is no longer increased.
Similar remarks can be made regarding choice of w in the additive and counting
switching source algorithms. For instance, in the counting scheme (Section 3.5),
large values Wkk and small values Wkn, n i=- k are used when it is known that
source switching is slow.
6.3.2 Variance (J'
The (J parameter relates noise to the performance of modular predictive algo-

rithms. For instance, consider the case of a Gaussian error function, from a
numerical point of view. Note that (J appears in the term
(6.1)
When (J is small, the expression (6.1) will be small for every k; what is more
important, however, is that any differences in the magnitude of the IEfl2 terms
result of a small (J is to concentrate more credit on predictors with small

This ensures quick response, at the cost of reduced noise robustness. This can
E:.
will be accentuated. Since credits are renormalized in the range [0,1], the final
also be understood through the convergence analysis of Chapter 4. Recall that,

for the multiplicative case, convergence rate depends on the ratios
e-E(IE~12)/20'2
e- E (IE;'1 2)/20'2
(k, n= 1, ... , K). If predictor k has large mean square error and predictor n
has small mean square error, then for a small (J the above ratio is close to 0,
and convergence is fast. If, on the other hand, (J is large, the above ratio is
small and convergence is slow. In this sense, (J operates in a manner similar to
N, the block size parameter.
106
From a probabilistic point of view, a is interpreted as the standard deviation

of prediction error; then a large value of a indicates that little information is
generated by individual observations.
Yet another interpretation of a is as a temperature parameter, similar to
what is used in simulated annealing methods (Geman and Geman, 1984).
The above considerations apply equally well to multiplicative, incremental
and fuzzy credit update schemes which use a quadratic error function and can
also be applied to the case of a general error function g(.), if the form g(!FJ-)
is used in place of g(Ef). On the other hand, a plays no important role in
additive and multiplicative schemes and can be set equal to one.
We have found that classification performance can be improved by using a
time-variable at, defined by
~::_1IEfI2
at= N
In this case, IE
2
",
;t
reduces to the proportional error of the k-th predictor with
respect to the total error, at time t.
6.3.3 Relationship between Variance and Threshold

Both variance a and threshold h are related to a speed/accuracy trade-off.
Large variance slows the network down and assigns little importance to in-
dividual errors; small variance speeds the network up, but also assigns more
importance to instantaneous fluctuations, which makes the network more prone
to instantaneous classification errors. Similarly, a low threshold (or absence of
threshold) incurs on the posterior probabilities a large recovery time between
source switchings. On the other hand, a high threshold tends to reduce the sig-
nificance of past performance and makes the response of the network to source
switchings faster, at the cost of spuriously interspersed false classifications.
6.3.4 Robustness of Credit Assignment

We have already remarked on the use of the predictor parameters M and N to
increase noise robustness of the PREMONN algorithms. But the main source of
PREMONN noise robustness is the competitive nature of the credit assignment
process. To understand this, let us consider, for instance, the multiplicative
credit assignment scheme. In this case, recall that the ratio of two credit
functions at time t is given by
pf p~. e- ~:=1 g(E;)
Pr' Po e- ~:=1 g(Er')'
It is clear that a predictor's credit depends on its relative predictive accuracy
(as compared with that of other predictors), rather than on the absolute one.
The point is that in many cases it will not be necessary to have very accurate
predictions, at least as far as classification is concerned (the situation is differ-
ent, of course, in prediction problems), since classification performance depends
not on absolute, but on relative predictive accuracy. In other words, the time
series will be classified to a source even if the respective predictor performs
poorly, as long as it consistently outperforms the remaining predictors. This
results in considerable robustness to prediction error. Good predictors gener-
ally result in superior classification performance, but, within certain bounds,
high prediction error does not affect classification too much: since some clas-
sification must always take place, if no good class is available, the "least bad"
one will be chosen.
6.3.5 Modularity
All the remarks made in Section refsec6.2.5, regarding modularity of the pre-
dictive modules, can be repeated here with reference to the credit assignment
module. In particular, at any point in the classification or prediction process,
the credit assignment module can be replaced by a new one, implementing a
different credit assignment algorithm. This will not require retraining of the
remaining modules.
6.4 SIMPLICITY OF IMPLEMENTATION

In addition to accuracy and robustness, the choice of a particular classifica-
tion algorithm will also depend on simplicity of implementation. While the
predictive modules used for every PREMONN variety are the same, credit
computation can be more or less complicated according to the choice of a mul-
tiplicative, additive, counting, fuzzy or incremental algorithm. For instance,
the counting algorithm is the simplest one, since the only operations necessary
for its implementation are integer addition and minimum taking. The addi-
tive algorithm follows closely, requiring only floating point addition. Hence
the implementation of these algorithms only requires the use of adders. The
incremental credit assignment algorithms offer a good balance between simplic-
ity of implementation and performance: they require the use of floating point
addition and multiplication, but not division. The multiplicative algorithm is
probably the most complicated PREMONN algorithm, since it requires float-
ing point addition, multiplication and division. Finally, the fuzzy max/min
PREMONN algorithm requires floating point division and max/min taking. It
must also be kept in mind that the switching source versions of all the above
algorithms incur a higher complexity. As we have remarked, our experience is
that this additional complexity is not accompanied by a respective performance
improvement.
7 CLASSIFICATION OF VISUALLY
EVOKED RESPONSES
In this chapter we discuss the application of PREMONNs to a medical time

series classification task, namely the classification of visually evoked responses
1 (VER).
7.1 INTRODUCTION
Visually Evoked Potential (VEP), or Visually Evoked Response (VER) , is an
electroencephalographic signal. Specifically, VER is the total electrical response
of the visual cortex to noninvasive visual stimulation. Electrical responses are
evoked from human subjects (by presenting them with a stimulus, usually a
flash of light or alternating checkboards (Sokol, 1976)) and measured by appro-
priately placed electrodes. The VER signal has rather low amplitude (0.1 - 20
J.LV) and is superimposed on other components, caused by the normal activity
of the brain. Recording VER involves extracting the relevant information from
the ongoing EEG, with a signal to noise ratio of about -5 dB.
Abnormal VER responses, which can be used for diagnosing neuroophthal-
mological disorders, can be detected by evaluating features of the response
lThe work reported here was carried out by M. Swiercz and M. Grusza of the Technical
University of Bialystok, Poland and P. Sobolewski, of the Opthalmology Division, Suwalki
District Hospital, Poland. We want to thank these researchers for giving us permission to
present their work in this book.

110
Figure 7.1. A Typical VER Waveform.
latency +
o 100 200 msec. 300
waveform; such as amplitude, latency, morphology and topography. A typical

VER waveform, where the above features are displayed appears in Fig.7.1.
More specifically, depending on the stimulation used, the shape of the typical
(i.e. for a healthy subject) VER waveform is as follows.
1. When the "flash" stimulation is used there is a slight initial negative deflec-
tion followed by a W-shaped wave. The positive components (peaks) of that
wave are called PI and P2 waves. The time elapsed from the stimulation
to the first positive component (PI) is called latency and the difference be-
tween the minimal and the maximum values of the waveform is called the
amplitude.
2. When using "pattern" stimulation, the "healthy" VER is a sequence of

Negative-Positive-Negative polarity, of relatively high values. The major
positive component of VER is called PlOD, as it occurs around 100 ms after
the stimulation.
The morphology and the topography are the qualitative (not quantitative)
features of the waveform. There are no precise definitions of these features
(and the criteria of their comparison); generally speaking they relate to the
location of the positive and negative components of the waveform, the time
distance between them, the "flatness" of the waveform, the fluctuations of the
VER after the major peaks etc.
In short, while latency and amplitude can be easily quantified, they do not
furnish sufficient information for diagnosis. Wave morphology and topography,
on the other hand, are more informative but their evaluation is harder and
requires subjective expert's interpretation. Some diseases are characterized by
specific VER patterns, which result in clear clinical diagnosis when correlated
with other clinical symptoms (Halliday and Kriss, 1976); however, in many
cases the morphological analysis of a complex VER signal is not reliable, be-
CLASSIFICATION OF VISUALLY EVOKED RESPONSES 111
cause of the lack of objective evaluation rules and the irregular features of each
individual VER waveform.
VER-based diagnosis of neuroopthalmological disorders can be seen as a
classification task: the subject's VER waveform must be assigned to one of
several possible classes, one class corresponding to each neuroopthalmological
disorder (or to the healthy state). Neural network classification methods can
be applied either to static feature vectors obtained by preprocessing aVER
signal (static pattern classification) or by considering the dynamically evolving
time series (time series classification).
Swiercz, Grusza and Sobolewski initially adopted the static pattern clas-
sification approach, using lumped neural network classifiers. This approach
provided useful results: it was found that lumped neural network classifiers are
capable of modelling the VER / disorder association up to a certain level of pre-
cision and can be useful in setting up more objective diagnosis procedures for
several neuroophthalmological disorders. However, the performance of lumped
classifiers has not been absolutely satisfactory and hence Swiercz, Grusza and
Sobolewski decided to apply the PREMONN methodology to the classification
problem, hoping for an improvement of classification rates.
Clearly the PREMONN classification algorithm is well suited to the VER
classification problem. Specific neuroophthalmological disorders appear to have
typical patterns, each of which can be considered as generated by a specific
source; the set of disorders (and corresponding sources) may be considered
finite. As will become clear in Section 7.4, the PREMONN algorithm yielded
higher classification accuracy, as compared to previously used lumped neural
classifiers (Swiercz, Grusza and Sobolewski, 1997).
7.2 VER PROCESSING AND CLASSIFICATION

Successful application of artificial neural networks to VER processing for diag-
nostic purposes and visual field analysis has been reported in the literature. A
few examples are listed below.
The standard ensemble 2 averaging (EA) technique used to compute a steady
state VER requires about 100 VER recordings to be extracted from an equal
number of EEGs. 3 However, in many clinical situations the available record-
ings are much fewer than 100, as the VER is usually immersed in the ongoing
encephalogram (EEG). The number of recordings required can be substantially
decreased by various processing methods. For instance, by applying neural
networks for adaptively filtering and smoothing the VER signal the number of
recordings required to obtain a reliable VER recording has been reduced to 20
(Fung et al., 1996, Dumitras et al., 1994).
2The term ensemble is used to indicate a set of VER waveform recordings.

3Details regarding the recording and extraction method can be found in (Davila, Abaye and
Khotazand, 1994; McGillem, Aunon and Yu, 1985a; McGillem, Aunon and Pomalaza-Raez,
1985; Laguna et al., 1992).
112
Also, lumped ANN classifiers can classify chromatic visual evoked potentials
(Swihart and Matheny, 1992) and hence distinguish between the responses of
normal and color blind individuals. Subtle differences in the VER waveform,
which play an important role in the diagnosis of serious retinal pathologies,
have been recognized by such a system.
As already mentioned, Swiercz, Grusza and Sobolewski (Swiercz, Grusza and
Sobolewski, 1997) have applied lumped ANN classifiers to the VER classifica-
tion problem. A separate network was used for each of the classified disorders;
the network input was the wavelet decomposition coefficients of the VER wave-
forms (Strand and Nguyen, 1996; Thakor and Sherman, 1996) and the output
was the type of disorder. Various network architectures were tried, including
1. single- and two-layer hardlimit perceptron networks, trained by a modified

delta rule;
2. two- and three-layer backpropagation networks with logistic activation func-

tions;
3. RBF (Radial Basis Function) networks with Gaussian activation functions;
4. LVQ (Learning Vector Quantization) networks, grouping input vectors into

clusters;
5. Elman networks with the hyperbolic tangent activation function of input

neurons;
6. General Regression Neural Networks (GRNN).
The best results Swiercz, Grusza and Sobolewski obtained with the above
networks will be reported a little later.
Finally, neural networks have been also used for diagnostic interpretation of
visual field data from PC-based video-campimeters (Mutlukan and Keating,
1994). High classification accuracies suggest that neural networks incorporated
into PC-based video-campimeters may enable correct interpretation of results
by non-specialists.
7.3 APPLICATION OF PREMONN CLASSIFICATION

The VER classification problem can be formulated as a time series classification
problem by assuming that the observation time series Yb Y2, ... , YT is gener-
ated by one of K possible sources, each source corresponding to one classified
neuroopthalmological disorder. More formally Yb Y2, ... , YT is generated by
source no. z ( z taking values in the set of disorders) where z is selected ran-
domly at time t = 0 and kept fixed for times t = 1,2, ... , T. The classification
task consists in estimating z. The following input data, predictors and credit
assignment algorithm have been used.
7.3.1 Input Data

The application presented here involved 115 patients with confirmed neurooph-
thalmological disorders: 25 with compressive chiasmal optic neuropathy, 20
with optic neuritis, 26 patients with optic nerve atrophy, 8 with demyelinative
neuropathy and sclerosis multiplex, 8 patients with oedema of the optic nerve
and 28 with other, non-classified diseases. A group of 8 patients without his-
tories of ocular or neuroophthalmological diseases was also included, yielding a
total of 123 subjects.
250 VER waveforms were obtained from the right and the left hemisphere
using the UTAS-E1000 (LKC Technologies Inc.) computer system. A flash
stimulus was provided by a xenon stroboscopic lamp. The duration of the
stimulus was 20 /Lseconds, the stimulus intensity 1500 lux in a Ganzfeld sphere
4 of 30 cm diameter and the stimulation frequency was 2Hz. The VER was
recorded by placing electrodes in the occipital region of the scalp, at the posi-
tions calculated by a computer algorithm. Low « 0.3 Hz) and high (>100 Hz)
harmonics of the VER response were filtered out and the VER was sampled
with a 1000 Hz sampling frequency. The first 256 milliseconds of each curve
were recorded, resulting in a time series Yl, Y2, ... ,Y256. Steady state VER
recordings were calculated after 100 stimulations using an ensemble averaging
technique which removes the background EEG signal (of values higher than
the VER waveform); the signal which remains is considered to be the VER
curve and is simply averaged from a number (about 100) of recordings. This
operation also eliminates the effects at the beginning of examination (when the
patient's eyes are getting used to this form of stimulation) and random varia-
tions of the VERso The raw VER waveforms were then normalized in the [0 ,
1] range and stored in ASCII files for further processing and ANN training.
The average VER recordings and their upper and lower enveloping curves
are shown in Figures 7.2 to 7.8. Each figure corresponds to one of the studied
groups of disorders.
The average waveforms and the waveform envelopes presented in the above
figures may be used for a preliminary classification of individual waveforms, but
this approach is not very successful, because of the high variation in individual
waveforms. Swiercz, Grusza and Sobolewski tried to perform classification by
using standard spectral decomposition (FFT) and retaining only 15-20 major
components of the spectrum. This method was only partly successful.. Other
statistical methods of data preprocessing, like computing moments and corre-
lations, showed that the VER characteristics in any parameter space occupy
overlapping clusters. Data preprocessing methods can reduce the overlapping
only partially. Even "normal" signals can differ substantially. Also the pro-
cedures of acquiring the data are not very restrictive so it can be expected
4The Ganzfeld sphere is a special construction, with an open front and a stroboscopic lamp
or checkboard monitor at the opposite end. The patient places his or her head inside the
sphere; the visual field is limited only to a small angle (inside this sphere) and the patient is
exposed to the visual stimulation. The examination is performed in a dark room.
114
Figure 7.2. Normal Subject Waveform.
_ : Average Waveform; : Upper Envelope; _ . _ : Lower Envelope
1.2 _._
_.._........._.. ..............._.... _----------_._..............._.-
0.8
~ 0.6
0.4
0.2
o +-------+-______+-______+-______+-______ ~
51 101 151 201 251

TIme Steps
Figure 7.3. Compressive Chiasmal Waveform.
_ : Average Waveform; • •• : Upper Envelope; _ . _ : Lower Envelope
1.2 , - - - - - - - - - - - - - - - - - - - - - - - - - ,
.•'~" ' ..•• ..;\ :"~~" "'~\.•,./·'w·· _ ..... '. '---•.."•..••\"../ •. ·'...r· "' ••l-'" .........
.....,...'."" ....... ".' ....
0.8
0::
~ 0.6
0.4
0.2
o+----+----~--~---_+---~
51 101 151 201 251
Time Steps
Figure 7.4. Optic Nerve Atrophy Waveform.
__ : Average Waveform; : Upper Envelope; _ . _ : Lower Envelope
1.2
,' .•. - . . .>,~""-:",,.- ••""'" •• ..-...., _•• " - •• -... .... , ••- ' ••: •••••• ,........ - .... ....... "'.--'' '
,
0.8
0::
~ 0.6
0.4
0.2
o+-------~------~------~------_+------~
51 101 151 201 251
Time Steps
Figure 7.5. Optic Neuritis Waveform.
__ : Average Waveform; : Upper Envelope; _ . _ : Lower Envelope

1.2
."--'-",. ,"""
.:-'., ..- ................... _. -- ... ..
_" -'\. ..... .......
_' ...... ''... \
- _.- ... - .... "
.,,\
I
......-'.
0.8
~ 0.6
0.4
0.2
o+-____ ~ ______ ~ ______+-____ ~~ ____ ~
51 101 151 201 251

Time Steps
116
Figure 7.6. Sclerosis Multiplex Waveform.
1.2 .-----.-:,:-,\-"'-\--;-
.. _-.-,,-,-/-..-.-...-<.-..-••-,.-•.-,.- •• -....-,.-0.-\,-._'1
....- - - - - - - -
.~,.. ':'\, '\ ••J '" ! \,.." .., .. " '.. ..

./ ~"..,.,: . . . '-' '·'.i \.....,/.....
O.B
n::
~ 0.6
0.4
0.2
o +--------+--------+--------+--------+-------~
51 101 151 201 251
Time Steps
Figure 7.7. Oedema of the Optic Nerve Waveform,

1.2
O.B .:,........ ~,....-.. ..-

\
n:: 0.6
~ ., ..... .- .-.-\ ...... _"" ...,
0.4
0.2
0
51 101 151 201 251
Time Steps
that additional noise will be introduced to the data by individual habits of the
technician or doctor who collects the data. The most clear pictures appear for
the waveforms of healthy subjects and for sclerosis multiplex patients. Looking
at the curves for sclerosis multiplex, two distinct maxima can be seen; however
for some patients they are shifted in time by as much as 25 ms. For the healthy
subjects the shapes are also well defined, so the differences between the average
and the enveloping curves are smaller, both at the beginning and the end of
the waveform. However the latencies (the time elapsed to the main positive
peak) for individual curves differ significantly. Probably the maxima for optic
neuritis are better concentrated in time and the variations in the first part of
the curve (before it reaches maximum) are smaller. Analyzing chiasmal optic
neuropathy one can observe rather flat average curves at the beginning, but the
variations for individual curves are quite substantial. The differences in local
values and global parameters between the curves belonging to the same classes
make this a quite difficult classification task.
For further processing, the VER waveforms were smoothed using wavelet ap-
proximation. A fourth level decomposition by discrete fourth order Daubechies
filters was employed and the coarse approximation of the signal was used for
classification purposes 5. This method of data preprocessing removed the tiny
fluctuations of the waveforms which could result in deteriorated prediction ac-
curacy.
7.3.2 Predictors
Seven predictors were built: one for healthy subjects, one for each of the five
classified disorders and one for non-classified disorders. The predictors were
neural networks with two hidden layers. While the number of neurons in each
hidden layer varies, generally 3 or 4 sigmoid neurons were used in the first layer
and 2 to 4 sigmoid neurons in the second layer. The inputs used were Yt-},
Yt-2, ... , Yt-M· The output layer used one linear neuron and the target output
was Yt. Prediction quality did not depend significantly on network architecture,
i.e. number of neurons and the type of sigmoid activation function (logistic or
hyberbolic-tangent) .
70% of the data were randomly selected and used for off-line training of
the predictors; the MATLAB 4.2 environment and the Neural Networks Tool-
box training routines (Demuth and Beale, 1994) were used. The disorders are
indexed by i = 1,2, ... ,7 (healthy state, five classified diseases and a class of
unclassified disorders). The neural network prediction is denoted by fit and the
total number of cases in each class by Ni. The average square prediction error
of each class is E i , defined by
1 Ni 256
Ei = (256 _ M). N- L L
, j=1 t=M+1
(Yt - fit)2.
5See the Matlab Wavelet Toolbox (Demuth and Beale, 1994) for details.
118
For all classes Ei is around 0.002; hence the average absolute error is equal to
about 0.045. Since the VER samples have values in the range 0.6 to 1.0, the
average absolute prediction error is about 5% of the signal.
It must be noted that Swiercz, Grusza and Sobolewski deliberately chose to
use neural predictors of a small size (i.e. with few neurons and connections),
which resulted in short training times but also in relatively high prediction
error. As already discussed, PREMONN robustness to noise allows for correct
classification even in the presence of high prediction error.
7.3.3 Credit assignment

Three credit assignment methods were used: Multiplicative (with error function
g(Ef) = e-IE~I), additive (with error function g(Ef) = IEfl) and counting.
These are the same methods described in Section 3.3.
7.4 RESULTS
A total of almost seventy five experiments were performed using the PRE-
MONN architecture, trying to determine the following.
1. The predictor order (M) (it was varied from 3 to 5).
2. The architectures of the local predictors; in all cases feedforward networks

were used, with a linear neuron in the output layer and one or two hidden
layers, with logistic and hyperbolic tangent neuron activation functions in
these layers.
3. The size of the training set.
4. The" quality of training" of local predictors; in particular the form of the

credit function «multiplicative, additive and counting) and the block size N
(usually N=lO or N=5 was used).
We present in Table 7.1 the classification results obtained by Swiercz, Grusza

and Sobolewski, using good values for the above parameters. Classification
was performed only at the final step, i.e. using the quantities P~56' P~56' ••. ,
P~56. The classification is carried out simultaneously for all classes of disorders.
In Table 7.1 we present classification results for the three PREMONN credit
assignment schemes used.
In addition in Table 7.1 are listed classification results obtained by the use
of lumped neural networks. Swiercz, Grusza and Sobolewski conducted over
one hundred VER classification experiments using lumped feedforward sigmoid
neural classifiers; they have experimented with network size, preprocessing
methods, predictor order M, window size N etc. The results they obtained
are listed in the last column of Table 7.1. LVQ classification was applied to the
same problem, but as the LVQ results were not as good, they are not reported
here (the reader can find more details in (Swiercz, Grusza and Sobolewski,
Table 7.1. Classification Results.
Disorder Class Multiplicative Additive Counting FF Neural

PREMONN PREMONN PREMONN Classifier
Healthy Subject 85.80% 86.30% 84.50% N/A
Compr. Chiasmal 79.10% 77.60% 79.50% 76.30%
Optic Nerve Atrophy 80.80% 81.50% 80.20% 76.60%
Optic Neuritis 82.50% 80.20% 77.20% 84.60%
Sclerosis Multiplex 86.20% 84.40% 88.30% 81.70%
Oedema Optic Nerve 77.20% 80.10% 84.40% N/A
Non-Classified 63.20% 61.80% 64.30% N/A
weighted average 1 81.24% 80.23% 80.04% 79.03
weighted average 2 77.17% 76.42% 77.03% N/A
1997)). It must be noted that the lumped classifiers could not handle success-
fully as many disorders as the PREMONN classifiers; hence certain entries in
the last column of Table 7.1 are left blank.
Classification accuracy for a particular disorder is the number of cases of
this disorder which were correctly identified (by the classifier) as belonging
to this disorder divided by the number of cases (belonging to this disorder)
which are present in the test set. The two last rows of the table show the
average classification accuracy results, weighted by the number of cases of each
disorder. We present two such averages. The row marked ''weighted average
I" shows averages for which both methods have been applied. The row marked
"weighted average 2" shows average over all disorders; this row has entries only
for the PREMONN classifiers, since the lumped neural classifier has only been
applied to four out of the seven classes.
It can be seen that the PREMONN classifiers outperform the lumped FF
neural one for every disorder in which a comparison is possible, except for
optic neuritis. In addition, the overall average performance of the PREMONN
classifiers is better than that of the lumped classifier. Finally, the PREMONN
classifiers have been applied to a larger set of disorders. In particular, it would
be extremely difficult to apply the FF neural classifier to the "non-classified"
disorders. The credit function profiles for two representative experiments are
presented in Figures 7.9 and 7.10.
Figure 7.9 corresponds to the VER time series collected from a subject with
oedema of the optic nerve and Figure 7.10 corresponds to the VER time series
collected from a subject with optic neuritis.
7.5 CONCLUSIONS
The results indicate that PREMONN is an efficient tool for the analysis of
VER patterns. The classification accuracy was higher for PREMONN than for
120
Figure 7.8. Unclassified Disorders Waveform.

1.2 ~--- _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _-,
.......... -.-..,/.'... " - - / - •• ,, ' - - ' - " " " - " - " - " ,-
~ .';\
"<*'
., .... '\ .,..',
.... ,,,..' ..........-...--••••,•••
0.8
a:
~ 0.6
0.4
0.2
o +---____-+________~------~------~--------~
51 101 151 201 251
Time Steps
Figure 7.9. Credit function profile for a subject with oedema of the optic nerve.
__ : Correct Credit Function; __ : Other Credit Functions
o -F~';;"""'-='-~~~=~;""'~=';=-=-=-=-='F=-=-=-=-='
51 101 151 201 251
Time Steps
Figure 7.10. Credit function profile for a subject with optic neuritis.
_ : Correct Credit Function; __ : Other Credit Functions
~
~" 0.5
51 101 151 201 251

Time Steps
lumped classifiers trained separately on preprocessed VER data for all classes
of disorders, except optic neuritis. Classification accuracy averaged over all
cases is also higher for PREMONN. Finally the PREMONN classifier was ap-
plied to a wider class of disorders. While it could theoretically be possible to
train and "tune" very carefully a lumped ANN architecture to classify a single
disorder with better accuracy than PREMONN does, the predictive approach
gives significantly better results for the full set of disorders classified at the
same time.
Such a reliability of classification, obtained simultaneously for all classes of
disorders is very promising and indicates that PREMONNs can be successfully
used at the first stage of diagnostic of major ophthalmological disorders. They
may be regarded as a key element of an expert system to support the doctor's
decision about qualifying the patient to more sophisticated and more expensive
diagnostic methods.
An important feature of PREMONN classification, which emerges from this
application is noise robustness. As has already been discussed, what matters
in PREMONN classification is not the absolute predictive accuracy but the
relative one; in other words even if a predictor predicts poorly, classification
will be accurate as long as the predictor predicts better than its competitors.
It can be seen that from Table 7.1 that the lowest classification accuracy is
observed in the category of unclassified disorders. We find this particularly in-
teresting, because this is not really a single separate category; rather it contains
all cases which cannot be further classified. It would be interesting to consider
122
whether "unclassified disorders" can be separated into several subcategories,

perhaps using one of the unknown sources classification algorithms which will
be presented in Part III of the book. If such a categorization succeeds, it
may serve as a guideline for medical research to identify new ophthalmological
disorders.
8 PREDICTION OF SHORT TERM
ELECTRIC LOADS
In this chapter we present the application of the PREMONN methodology to

a prediction problem, namely short term electric load forecasting (STLF)1,2.
8.1 INTRODUCTION
Short term load forecasting refers to the prediction of hourly electric loads in a
power system. Generally, predictions must be made one day ahead of time; for
instance every evening predictions must be made of 24 values, corresponding
to the electric loads of every hour of the next day. Accurate predictions are
required so that the operation of the power system generators can be sched-
uled for the next day and the security of the system (probability of failure to
satisfy power requirements) can be assessed. Hence, the formulation of eco-
nomic, reliable and secure operating strategies requires accurate short term
load predictions.
Electric loads are influenced by a variety of factors; for instance previous
loads, weather and temperature conditions, the day of the week for which fore-
1 We prefer to use the term "prediction", following our usage in the rest of this book; however
the problem discussed in this chapter has traditionally been described as a "forecasting"
problem. The terms "forecasting" and "prediction" may be considered to be equivalent.
2The work reported here was carried out jointly by us and A. Bakirtzis and S. Kiartzis, both
of the electrical and computer engineering department, Aristotle University of Thessaloniki;
we want to thank them for allowing us to present this work here.

124
casts are required (for instance loads are lighter during weekends) and so on.
System operators have an intuitive appreciation of such factors and are able to
apply expert knowledge to scheduling power generation. However, because of
the economic importance of solving the STLF problem, more formal approaches
have been attempted and a large number of computational techniques have been
applied. Statistical models, expert systems, artificial neural networks and hy-
brid fuzzy neural networks are some of the approaches that have been tried.
No completely satisfactory solution to the problem has been found; since the
improvement of prediction accuracy by even a fraction of one percent results
in very significant savings the STLF problem is the subject of intense research.
8.2 SHORT TERM LOAD FORECASTING METHODS

Let us now consider in more detail some of the methods which have been applied
to the STLF problem.
Statistical STLF methods generally belong to the family of classical regres-
sion and time series algorithms (Papalexopoulos and Hesterberg, 1990; Vemuri,
Huang and Nelson, 1981) Both static and dynamic models have been used.
In static models, the load is considered to be a linear combination of time
functions, while the coefficients of these functions are estimated through lin-
ear regression or exponential smoothing techniques (Christiaansen, 1971). In
dynamic models weather data and random effects are also incorporated and
autoregressive moving average (ARMA) models are frequently used. In this
approach the load forecast value consists of a deterministic component that
represents load curve periodicity and a random component that represents de-
viations from the periodic behavior due to weather abnormalities or random
correlation effects. An overview of various statistical approaches to the STLF
problem can be found in (Gross and Galiana, 1987). The most common (and
arguably the most efficient) statistical predictors apply a linear regression on
past load and temperature data to forecast future load. For such predictors,
we will use the generic term Linear Regression (LR) predictors.
Expert systems have been applied to STLF (Ranman and Bhatnagar, 1988;
Ho et aI., 1990) with relative success. This approach, however, presumes the
existence of an expert capable of making accurate forecasts and, perhaps more
importantly, presenting his or her expertise in a format appropriate for com-
puter implementation. This has proved to be a serious problem in many cases.
The application of artificial neural networks to STLF has yielded encouraging
results; a discussion can be found in (Niebur et aI., 1995). The usual arguments
are cited in favor of the use of neural networks: this approach does not require
explicit adoption of a functional relationship between past load or weather vari-
ables and forecasted load. Instead, the functional relationship between system
inputs and outputs is learned by the network through a training process. Once
training has been completed, current data are input to the neural network,
which outputs a forecast of tomorrow's hourly load. It must be noted however
that considerable craftsmanship is required for encoding prior knowledge about
the short term load demand into a neural network. A minimum-distance based
PREDICTION OF SHORT TERM ELECTRIC LOADS 125
identification of the appropriate historical patterns of load and temperature

used for the training of the neural network has been proposed in (Peng, Hubele
and Karady, 1992), while both linear and non-linear terms were used in the
neural network structure. Due to load curve periodicity, a non-fully connected
neural network consisting of one main and three supporting neural networks
has been used (Chen, Yu and Moghaddamjo, 1992) to incorporate input vari-
ables like the day of the week, the hour of the day and temperature. Various
methods were proposed to accelerate the neural network training (Ho, Hsu and
Yang, 1992), while the structure of the network has been proved to be sys-
tem depended (Lu, Wu and Vemuri, 1993). Recently proposed neural network
models for STLF tune the model performance efficiency utilizing practical ex-
perience gained by the implementation of Energy Management Systems (EMS),
(Papalexopoulos, How and Peng, 1994; Mohammed et al., 1994; Bakirtzis et
al., 1995a). AI; can be seen application of neural networks to the STLF problem
requires many delicate choices.
Hybrid neuro-fuzzy systems applications to STLF have appeared recently.
Such methods synthesize fuzzy-expert systems and neural network techniques
to yield good results, see (Srinivasan, Chang and Liew, 1995; Bakirtzis et al.,
1995b).
8.3 PREMONN PREDICTION

As the reader may have guessed, the proliferation of a large number of STLF
methods is a sure indicator that no method is superior to the rest; each has
its own advantages and shortcomings. Our own experience is that no single
predictor type is universally best. For example, a neural network predictor may
give more accurate load forecasts during morning hours, while a LR predictor
may be superior for evening hours. Hence, a method that combines various
different types of predictors may outperform any single lumped predictor of
the types discussed above.
The above observations led us to the adoption of a PREMONN approach to
the STLF problem. We started from the following point of view: if a prediction
method has superior performance for a particular time period, it may be as-
sumed that the predictor yields a good representation of the underlying process
which is generating the short term load time series during the respective period.
The existence of several models, each yielding superior predictive performance
for a particular period, makes plausible the assumption that several different
mechanisms (i.e. sources) participate in the generation of the overall time se-
ries. Hence the use of a PREMONN method for combining the various models
is natural: the goal is to combine several predictors so as to obtain the best
performance of each predictor.
More specifically, we found ourselves in the following position: a number of
lumped predictors had been developed for STLF on a specific power network
and each predictor yielded superior predictions on certain occasions; no pre-
dictor was globally outperforming all the rest. We proceeded to implement a
PREMONN system that pooled together the best elements from each predic-
126
Figure 8.1. Hourly load time series for a typical day.
300, ......................................................................................................................................................................... ;
200
io
..J
100
o+---__-+______+-____-+______ +_~
1 6 11 16 21
Hour
tor, finally resulting in a combined predictor with performance superior to that

of all the lumped ones. Let us now present the details of this implementation.
B.3.1 Input Data

The problem we are considering is the short term load forecasting for the power
system of the island of Crete. In the summer of 1994 this system had a peak
load of about 300 MW. Load and temperature historical data are available for
the years 1989 to the present 3 . The hourly load (24 hours) for a typical day
is presented in Figure 8.1. The annual load (365 days) for a typical hour is
presented in Figure 8.2.
8.3.2 Predictors
Three STLF predictors were developed, each using a different motivation. We
call these lumped predictors, to distinguish them from the "combined" PRE-
MONN predictor. Let us consider the characteristics of each lumped predictor.
"Long Past" Linear Predictor. This predictor (abbreviated as LP LR

predictor) performs linear regression (LR) on two time series: daily loads (for
a given hour of the day) and maximum daily temperature. Nl + N2 inputs
are used, where Nl is the number of past days' loads and N2 is the number
3 Actually, we used data up to the year 1994, when this study was conducted.
Figure 8.2. Annual load time series for a typical hour.
300 , ................................ _..................................................................................................................................................,
200
100
o+-------+-------+-------~------~
91 181 271 361
Day
of past days' maximum temperatures used. Several values of N l , between 21

and 56, have been employed, i.e. data from the last 21 to 56 days (this is the
reason that the term "long past" is used, in contradistinction to the term "short
past", which will be used a little later). The best value of Nl was determined
experimentally to be 35. N2 was always set equal to 2. The predictor's output
is the next day's load for the given hour. Hence, for a complete 24-hour load
prediction, 24 separate predictors were developed. The regression coefficients
were determined by minimization of the total squared prediction error, using a
standard matrix inversion routine. The training phase was performed only once
and offline. It should also be mentioned that the hourly load data were analysed
and "irregular days", such as national and religious holidays, major strikes,
election days, etc., were excluded from the training data set and replaced by
equivalent regular days; of course this substitution was performed only for the
training data. Load and temperature data for the years 1992 and 1993. were
used. Training error was computed by the formula
E
1
= __ T241
. '"' '"' Ym,t - Ym,t .
'I
24·T ~~ ,
t=l m=l Ym,t
in other words it is the ratio of prediction error divided by the actual load, and
averaged over all days and hours of the training set; this turned out to be 2.30%.
We observed a "ceiling" effect regarding the possible reduction of forecast error:
while training error could be reduced below 2.30% by the introduction of more
regression coefficients, this improvement was not reflected on the test error.
This is a typical case of overfitting.
128
"Short Past" Linear Predictor. This predictor (abbreviated as SP LR

predictor) is quite similar to the previous one. Again, straightforward linear
regression is applied to the electric load time series. The input consists of
loads for all hours of the day, in addition to maximum and minimum daily
temperature. Hence there are 24Nl + 2N2 inputs, where Nl is the number
of past days' loads and N2 is the number of past days' temperatures used.
Several values of N l , between 1 and 8, were employed and it was determined
that the best value was 4, which means data from four past days are used
as input to the predictor. N2 was always set equal to 2. In particular, for a
given forecast day, we used as input the two immediately previous days and the
same weekday of the previous two weeks. Hence this predictor uses a relatively
"short past". The predictor output is the next day's load for every hour of
the day. The regression coefficients are again determined by minimizing the
total squared prediction error. The previous remarks regarding training and
overfitting apply here as well. Training error (computed as the ratio of forecast
error divided by the actual load, averaged over all days and hours of the training
set) was 2.36%.
Neural Predictor. The final lumped predictor used was a fully connected
feedforward neural network, with sigmoid neurons and one hidden layer. The
neural network comprised of 57 input neurons, 24 hidden neurons and 24 output
neurons representing next day's 24 hourly forecasted loads. The first 48 inputs
represent past hourly load data for today and yesterday. Inputs 49-50 are
maximum and minimum daily temperatures for today. The last seven inputs,
51-57, represent the day of the week; for instance Mondays are encoded by
setting input no.51 equal to one and inputs 52 to 57 equal to zero. Other
input variables were also tested but they did not improve the performance
of the predictor. The neural network was trained by minimizing the total
squared prediction error, using an incremental back propagation algorithm;
i.e. input/output patterns were presented until the average error between the
desired and the actual outputs of the neural network over all training patterns
became less than a predefined threshold. Once again, "irregular days" were
removed form the training data. The training data set consisted of 90·4 + 30=
390 input/output patterns created from the current year and the four past
years historical data as follows: 90 patterns were created for the 90 days of
the current year prior to the forecast day. For everyone of the 4 previous
years, another 30 patterns were created around the dates of the previous years
that correspond to the current year forecast day. After an initial offline training
phase was completed, the neural network parameters were updated daily (using
the day's incoming data) for an additional one month period. The network was
trained continuously, until the average training error became less than 2.5%.
It was observed that further training of the network (for example to a training
error threshold equal to 1.5% ) did not improve the accuracy of the prediction
on the validation data. We believe this is also evidence of overfitting.
8.3.3 PREMONN Prediction

This was implemented by the usual procedures described in Chapter 3. We
actually organized the electric load data into 24 distinct time series (one for
every hour of the day) which can be denoted as
Ym,l, Ym,2, ... ,Ym,t, ...
where m = 1,2, ... ,24 corresponds to the hour of the day. We also developed 24
distinct PREMONN combined predictors. The m-th predictor combined the
output of the three lumped predictors by the formula
*_1.1+2.2+3.3
Ym,t - Pm,t Ym,t Pm,t Ym,t Pm,t Ym,t,
where P~ ,t ( k = 1,2,3) was computed using

_(Ym.t-y!':..t)2
e :.;r:" k
. Pm ,t-1
(Ym:;;;"m.t
n )2
k
Pm,t =
~3 e-
L.Jn=l . P~,t-1
I.e. a standard multiplicative PREMONN credit update
8.4 RESULTS
Let us now compare the performance of the PREMONN predictor with the
three lumped predictors. Twenty four separate cases must be considered, and
these are listed in Table 8.1. The results presented in this table correspond to
the period from April to June 1994. In particular,we present prediction errors
(averaged over this period) for the four types of predictors used, and for the 24
hours of the day.
The reader can observe that for most hours ofthe day, and on average perfor-
mance, the PREMONN outperforms all lumped predictors; even in cases where
a lumped predictor outperforms PREMONN, the difference is small. The main
point, however, is that, for a given hour of the day, it cannot be known in
advance for which time period a particular predictor will yield the best predic-
tion; it is exactly the role of PREMONN to track predictor performance online
and to select the best performing predictor. These points can be appreciated
by considering Figure 8.3, which presents a comparative plot of true loads and
forecasts for a representative day.
An even better understanding of the above point can be reached by looking
at the evolution of credit functions. Consider for example Fig. 8.4, where the
evolution of posterior probabilities for the predictors of Ipm load is plotted for
the period July 1st, 1994 to September 30th, 1994. Similarly, in Fig. 8.5 the
evolution of posterior probabilities is plotted for the predictors of Ipm load,
over the same time period.
The reader will observe that in Fig. 8.4 the highest credit is generally as-
signed to the LP LR predictor, even though over short time intervals one of the
130
Table 8.1. Hourly average relative errors for June to September 1994 (error is expressed
in percent units).
Hour Long Past LR Short Past LR Neural Net PREMONN

1 2.89 1.92 2.26 1.96
2 2.22 1.72 2.09 1.63
3 2.14 1.93 2.50 1.69
4 2.55 2.38 2.49 2.17
5 2.71 2.23 2.44 2.31
6 2.55 2.31 2.41 2.16
7 2.47 2.15 2.16 2.01
8 3.11 3.09 2.72 2.38
9 2.57 2.85 2.17 2.07
10 2.72 2.95 2.53 2.36
11 2.53 2.86 2.72 2.32
12 2.44 2.87 2.91 2.43
13 2.24 3.07 2.85 2.16
14 2.29 3.19 2.51 2.10
15 2.26 2.77 2.36 1.95
16 2.29 2.93 2.44 2.12
17 2.35 3.30 2.38 2.13
18 1.82 2.96 2.41 2.06
19 1.98 2.97 2.51 2.16
20 2.30 3.17 2.52 2.34
21 1.83 2.98 2.76 1.99
22 1.95 2.58 2.40 1.93
23 1.70 1.84 1.92 1.56
24 2.20 1.96 2.13 1.78
Average 2.34 2.62 2.44 2.07
other two predictors may outperform it. Similarly, in Figure Fig. 8.5 the high-
est credit is generally assigned to the SP LR predictor, even though over short
time intervals one of the other two predictors may outperform it. These results
are consistent with the general test errors of Table 8.1; the additional infor-
mation presented in Figs. 8.4, 8.5 is that a predictor which generally performs
poorly, may still outperform its competitors over short time intervals; in such
cases the PREMONN will take this improved performance into account,as ev-
idenced by the adaptively changing posterior probabilities. This explains why
the PREMONN is generally better than the best lumped predictor.
Finally, it is quite instructive to compare average total errors for training
and test data as given for various values of the total number of regression coef-
ficients. These are presented in Table 8.2. The rows in bold letters correspond
to the lumped predictors actually used in the PREMONN combination.
Figure 8.3. A typical daily load curve and the lumped and combined predictions.
300
250
200
150
100
50
8 10 12 14 16 18 20 22 24
Table 8.2. Dependence of relative prediction error on number of parameters (error is given
in percent units and training time is given in seconds).
Predictor Nr. of Train Test Train

Type Parameters Error Error Time
LPLR 864 2.41 2.44 1.05
LP LR 1032 2.34 2.37 1.22
LPLR 1200 2.30 2.34 1.38
LP LR 1368 2.27 2.36 1.60
LPLR 1536 2.26 2.38 1.85
LP LR 1704 2.25 2.39 2.09
SP LR 1248 2.53 2.69 0.13
SP LR 1824 2.43 2.68 0.34
SPLR 2400 2.36 2.62 0.86
SP LR 2928 2.32 2.72 1.45
NN 1992 2.50 2.44 2.35
NN 1992 2.00 3.76 8.92
132
Figure 8.4. Evolution of posterior probabilities for the predictors of 1pm load is plotted
for the period July 1st, 1994 to September 30th, 1994. The solid line corresponds to the
credit of the "Long Past" linear predictor, the dotted line to the credit of the "Short Past"
linear predictor and the dashed line to the credit of the neural predictor.
_ : LP Predictor; . .. : SP Predictor; _ _: Neural Predictor
!5
'fi
.rc: 0.5
O+-----~----~------+_----~----_+------r
16 31 46 61 76 91
Day
The reader can see that an increase in the number of regression coefficients
yields improved training errors but test errors remain the same or even increase.
This is an instance of over fitting. On the other hand, the PREMONN also uses
an increased number of coefficients, namely the sum of the numbers of coeffi-
cients of the three predictors. In our case this would be 1992+2400+1200=5592.
While we have not tried to train any lumped predictor with 5592 free regres-
sion coefficients, extrapolating from Table 8.2, one expects the test error to
be actually larger than that of any lumped predictor with fewer coefficients.
However, PREMONN increases the number of coefficients in a judicious and
structured way, resulting in the marked decrease of test error to 2.07%. Simi-
larly, training time scales very efficiently for the PREMONN. It is equal to the
total time for training the three lumped predictors, which on a 66 MHz 486
PC was 1.38+0.86+2.35=4.59 seconds. To compute the time for training one
lumped predictor with 5592 coefficients, one could use the data in Table 8.2 and
extrapolating linearly, to obtain an expected training time between 7 and 25
seconds, i.e. anywhere between 50% to 500% of the PREMONN training time.
In fact, however, linear extrapolation is probably too optimistic. For the LR
predictors, it is known that matrix inversion time scales cubically with the size
Figure B.S. Evolution of posterior probabilities for the predictors of lam load is plotted
for the period July 1st, 1994 to September 30th, 1994. The solid line corresponds to the
credit of the "Long Past" linear predictor, the dotted line to the credit of the "Short Past"
linear predictor and the dashed line to the credit of the neural predictor.
_ : LP Predictor; . .. : SP Predictor; _ _: Neural Predictor
I' ~ ~ 1\ "\I'," "\ 1'\ A n /"\

\
r , ,\
\ I' 1\
I' '
"I IJ I J
,,
........ J \'
/ \
~
,.
,',
f
I', I \ ,
I
, V \ " -.J \ It, I \ I, , \
, I \ I \I I I \ I \ I \
0.5 \ / I " I \ I ,I \ I
\ I \ I I /,' "
I I I I
"
O+-----_r-----+------~----+_----_r----_+
16 31 46 61 76 91
of the problem; as for the neural network predictor, increasing the size of the
network may result in either exponentially long training time or, even worse,
complete failure of the training procedure (e.g. entrapment at local minima).
8.5 CONCLUSIONS
The results indicate that PREMONN predictor combination outperforms all
conventional, "lumped" prediction methods in the test problem we have con-
sidered and yields a significant decrease of STLF error, which may have serious
economic impact. The use of PREMONN enables us to pick the best fea-
tures of each lumped predictor in a dynamic and unsupervised manner. From
a somewhat different point of view, PREMONN can be seen as a judicious
and systematic method for combining a large number of regression coefficients,
avoiding overfitting problems.
9 PARAMETER ESTIMATION FOR
AND ACTIVATED SLUDGE PROCESS
In this chapter we present an application of predictive modular parameter esti-

mation, using the hybrid PREMONN I genetic algorithm presented in Chapter
5. This algorithm is applied to the estimation of parameters of an activated
sludge waste water treatment plant!.
9.1 INTRODUCTION
While the method of activated sludge is very widely used for waste water treat-
ment, it appears that the process is so complex that it has not yet been accu-
rately modeled. Nevertheless the so called IAWPRC no.l Model (Henze et al.,
1983) is widely used by chemical engineers for computer simulation purposes.
Indeed it is stated in (Henze et al., 1983) that: "[Computer] Modeling is an
inherent part of the design of a wastewater treatment system, regardless of the
approach used.... [because] ... limitations of time and money prevent explo-
ration of all potentially feasible solutions." The final goal of using the model
is gaining a better understanding of the activated sludge process under various
operating conditions.
IThis is joint work we have carried out with Manos Paterakis, doctoral candidate at the
Department of Electrical and Computer Engineering, Aristotle University; we want to thank
him for allowing us to incorporate this work here, as well as for his assistance in the writing
of this chapter.

136
The main difficulty with the use of this model is the estimation of its pa-
rameters and considerable effort has been expended towards finding a reliable
parameter estimation method. Generally, the strategy followed for evaluation
of various parameter estimation algorithms is to run a computer model using
certain parameter values, then pretend that these values are unknown and use
the produced data as input to some parameter estimation algorithm. Here we
also follow the same approach and produce several time series (corresponding
to the observable states of the waste water treatment system) by computer
simulation; these time series are then used as input to the PREMONN/genetic
algorithm which estimates the values of several key parameters which govern
the evolution of the process.
9.2 THE ACTIVATED SLUDGE MODEL

The typical activated sludge system is the most common and useful system of
biological waste water treatment. Its main function is the removal of organic
matter from the waste water. A diagram of the process is presented in Figure
9.1.
Figure 9.1. Diagram of an activated sludge waste water treatment plant.
WasteWater
~
Aeration
Tank
Precipitation
f----t Tank
--
Recirculation
Pump
Recirculated
Sludge Surplus
Sludge
The process can be briefly described as follows. The waste water is placed
in an aeration tank. and brought in contact with a microorganism-containing
solution. The organic material comes in contact with the microorganisms and
is removed from the liquid phase; then, hydrolytic enzymes are added to the so-
lution which enable the microorganisms to metabolize the organic matter. The
mixture of organic matter and microorganisms is forwarded from the aeration
tank to a precipitation tank, where the microorganisms settle down and then
PARAMETER ESTIMATION FOR AND ACTIVATED SLUDGE PROCESS 137
are recycled into the waste water treatment plant, while the processed waste
water is removed from the system.
Several complex chemical processes are involved in this procedure and con-
siderable effort has been devoted to modeling these. A major landmark in
modelling the activated sludge process was the introduction of the so-called
IAWPRC Model no.l (Henze et al., 1987) which is described by the following
nonlinear differential equations.
d8NH . 8 80
- - = -zxbMH- - . X H-
dt 8 + ks 80 + kOH
(ixb+ ~ )MA 8 NH S, 80k XA +U2(I+r)(8NHl-8NH)(9.5)

YA 8 NH +kNH 0+ OA
d80 = _1-YH M H _ 8_ . 8 0 X H-
dt YH 8 + ks 8 0 + kOH
(4.57 - YA )MA 8NH S, 8 0k XA+mbso(9.6)

YA 8NH+kNH 0+ OA
d8 1 = -~MH 81 . kOH 8 No 1
dt YH 8 1 + k S k OH + 8 0 8 NOI + k NO ngXH1 +
dXHl = MH 81 . kOH 8NOI

dt 8 1 + ks kOH + 8 0 8N01 + kNO ngXHl +
U1rXR - U1(1 + r)XHl (9.8)
d8 NH1 = -';xb 8 1 kOH 8 No 1 X

dt • 8 1 + k S kOH + S0 8 N01 + k NO ng HI +
U18NHin + U1r8NH (9.9)

138
Table 9.1. IAWPRC model no.1 parameters.
Symbol Meaning Value
YA Yield for autotrophic biomass 0.24

YH Yield for heterotrophic biomass 0.67
ixb Mass of nitrogen per mass of COD 2 0.086
ks Half saturation coefficient for heterotrophic biomass 20
kNH Ammonia half saturation coefficient for autotrophic biomass 1
MH Maximum specific growth rate for heterotrOphic biomass 6
bH Decay coefficient for heterotrophic biomass 0.62
ng Correction factor for MH under anoxic conditions 0.8
kOH SNOI
S k ngXH1 +
kOH + So NOl+ NO
It can be seen that the model is described by ten state variables: S, XH, So,
XA,SNHl> SNH, SNO, Sl, XH1,SNOl, Four of these states are observable,
namely: S (readily biodegradable substrate); XH (active heterotrophic bio-
mass); So (oxygen concentration); SNO (nitrate and nitrite nitrogen concen-
tration). Because of space limitations it is not possible to give an explanation
of the physical significance of the above equations and variables; the reader is
referred to (Henze et al., 1983) for details.
It can be seen that a large number of parameters are also involved in the
above equations. Of these parameters some are known a priori, some can be
estimated using experimental methods and, finally, some must be estimated al-
gorithmically. The following table lists some important parameters of the latter
category, their physical significance and the values we have used in our simu-
lations. These values are provided in (Henze et al., 1987) and are customarily
used in studies of the model.
While the IAWPRC Model no.1 has been widely used as a reasonable model
of the activated sludge process, it is generally accepted that using it to describe
the operation of a particular waste water treatment plant is not an easy task.
One of the main difficulties is estimation of the model parameters. The values
appearing in Table 9.1 are simply "reasonable" values; parameter estimation
must be performed to obtain the values corresponding to a particular plant.
During the last decade, various methods have been applied to the parame-
ter estimation problem and considerable effort expended towards its solution.
Many methods have been used. In the original report (Henze et al., 1987)
various experimental methods have been proposed for measuring some of the
model parameters. Apart from the considerable effort required to perform the
necessary measurements, the obtained results may be fairly inaccurate. Hence
other approaches has been tried, involving the use of various parameter es-
timation algorithms. For instance, (Jeppson and Olson, 1994) use extended
Kalman filtering to perform parameter estimation for a reduced model; (Ayesa
et al., 1994) also applies extended Kalman filtering in conjunction with a sen-
sitivity analysis; (von Sperling, 1994) uses multiple Monte Carlo simulations
of the model (with various parameter values) and then applies a classification
algorithm to separate satisfactory from nonsatisfactory (in terms of accuracy)
simulations; (Finnson, 1994) essentially applies a trial and error strategy in-
volving multiple simulations.
In all of the above cases, it is generally accepted that a very accurate estima-
tion of all parameters is not feasible. In fact, for some parameters estimation
errors of up to 50% appear to be acceptable; the general goal is to obtain pa-
rameter values which capture the qualitative behavior of the actual (in fact
computer simulated) system; of course the final goal is to obtain accurate esti-
mates, which will yield quantitatively correct results.
9.3 PREDICTIVE MODULAR PARAMETER ESTIMATION

We formulate the parameter estimation problem as an optimization problem:
the cumulative squared output error of the model is the cost function to be
minimized by appropriate choice of the model parameters. In particular, we
consider the following eight parameters to be unknown: YA, YH, ixb, ks, kNH'
M H , bH , ng (these are the parameters appearing in Table 9.1, where their
physical significance is explained). The remaining parameters can be estimated
directly, using experimental methods which provide fairly accurate estimates.
Now, following the formulation of Chapter 5, we define a parameter vector e
as follows
Taking e = Bo, where Bo is obtained by using the values listed in Table 9.1, and
using eqs.(9.1)-(9.10) we obtain a particular instance of the activated sludge
model; this is then discretized in time and simulated on the computer, resulting
in a vector time series of system outputs: Yl, Y2, ... , YT, where
The system is simulated for a real time interval of 20 days; the time discretiza-
tion step used is Ti =1.5 min and the sampling step Ts =3 min. This results in
a total number of T =9600 observations. Representative time series of biomass
X H and oxygen concentration So are plotted in Figures 9.2 and 9.3.
On the other hand, picking any value of B and running the discretized system
£or t Imes
· - 1, 2, ... , T ,we 0 b
t -, ' a new time
taln . ' . Yl'
senes. (J ... 'YT'
(J Y2' (J Now
we can write the cumulative squared output error as a function of B:
T
J(B) =L IYt - yfl
t=l
140
Figure 9.2. So Time Series.
8~.......•..............................................................................................•...•.•................................,
o~ ______ ~ ______ ~ ____ ~ ______ ~
10 15 20 25 30
Time Steps X 10
Figure 9.3. XH Time Series.
1000 , ......................•..................................................•.......•......................•........•.....................................................,
~ 500
o~~ __ ~ ______ ~ ______ ~ ____ ~
10 15 20 25 30
Time Steps X 10
The parameter estimation task is to find a value of e which minimizes J(e)

{preferrably this value should be eo, but a close approximation is also accept-
able). This is effected by the hybrid genetic algorithm of Chapter 5 which is

applied with the following parameter values.
1. Genetic algorithm parameters: number of bits per parameter n=10, crossover

probability q=0.6, population size K=200, number of break points m=12,
maximum number of generations Imaz=5000.
2. Parameter space Q: it is a subset of R S , namely Q= [A1,B1jx [A 2 ,B2 jx ... x

[As, Bsj. Here, for j=l, 2, ... 8, Aj is the lower bound of the j-th parameter,
chosen to be 30% of the true parameter value, and Bj is the upper bound
of the j-th parameter, chosen to be 200% of the true parameter value.
9.4 RESULTS
We have run several parameter estimation experiments using the above setup.
Following the same noise model used in (Ayesa et al., 1994; von Sperling, 1994),
the measurements are contaminated with multiplicative, white, zero-mean noise
with Gaussian distribution. Noise level is characterized by the standard devi-
ation (T1J of the noise; this is taken to be 0, 0.01, 0.03 and 0.05 resulting in
four experiment groups. For every choice of noise level 50 experiments are run;
the accuracy of the parameter estimates for every experiment is expressed by
a relative error 8 defined as follows
",S 69(n)
L.Jn=l (ir.iI
8= 0
8
Here O~n) is the value of the n-th true parameter, while co(n) is the n-th pa-
rameter error, i.e. the difference between the true n-th parameter and the one
estimated at the conclusion of the algorithm.
For comparison purposes we have also attempted to solve the parameter es-
timation (error minimization) problem using a genetic algorithm with selection
probabilities proportional to the inverse of mean square error (we call this a
M8E genetic algorithm). In Table 9.2 we present the performance of the predic-
tive modular genetic algorithm for four different noise levels ((T1J= 0.00, 0.01,
0.03 and 0.05) and of the standard (MSE) genetic algorithm with noise free
observations. Each row in the table shows the number of experiments (out of
a total of fifty) for which the relative error 8 was less than the indicated value.
It can be seen that the predictive modular genetic algorithm gives significantly
better results than the MSE genetic algorithm. In addition, this is true even
when we compare the performance of the predictive modular algorithm with
noisy observations to that of the standard one with noise free observations.
To better illustrate the performance of the predictive modular genetic algo-
rithm, we also present histograms of the MH estimates for four levels of noise
((T1J=0.00, 0.01, 0.03, 0.05). These are presented in Figures 9.4 to 9.7.
In Figures 9.8 and 9.9 we present graphs of the 80 (oxygen concentration)
time series obtained from the true and the estimated model (Fig.9.8 corresponds
142
Figure 9.4. Histogram of MH estimates (over all experiments at noise level (Tv= 0.00).
0.2 - , - - - - - - - - - - - - - - - - - - - - - - ,
o+-----,-----~------~-----~
o 2SO SOO 7SO 1000
Mh
0.2 ..------------~-----------.,
o+------,------~--------~-------~
o 2SO SOO 750 1000
Mh
0.2 - , - - - - - - - - - - - - + - - - - - - - - - - - - ,
O+--------,--_____~I--------,_------~
o 250 500 750 1000
Mh
0.2,-----------t------------,
~
c:
~ 0.1
~
u.
Or-----~----~II~--~----~
o 250 500 750 1000
Mh
144
Table 9.2. Parameter estimation results for the predictive modular genetic algorithm and
(last column) for the standard genetic algorithm.
S< Noise Free O"v=O.Ol O"v=0.03 O"v=0.05 TSE

0.01 3 0 1 1 0
0.02 34 20 14 3 1
0.03 73 56 50 13 6
0.05 99 98 78 36 25
0.10 100 100 100 91 85
0.15 100 100 100 100 99
0.20 100 100 100 100 100
Figure 9.8. So time series obtained from the true and estimated (at noise level 0.00)
model.
__ : True Observation; ___ : Model (Noise Free Estimate)
O+--------,---------r--------~------~
10 15 20 25 30
Time Steps X 100
to estimates obtained at O"v=O.OO, while Fig.9.9 corresponds to estimates ob-

tained at O"v=O.05). It can be seen that these time series are in good agreement
both qualitatively and quantitatively..
Finally, it should be noted that the average duration of one run of the para-
meter estimation algorithm is from 1 to 5 hours on an HP Apollo 735 worksta-
Figure 9.9. So time series obtained from the true and estimated (at noise level 0.05)
model.
_ : True Observation; ___ : Model (Noisy Estimate)
8~.- .. _ . ._ .............................................................. _ ..................................._..........................................._ ........,
o+---____-.________-.________.-______ ~
10 15 20 25 30
Time Steps X 100
tion. Experiment duration depends on the noise level: more noisy experiments
take longer because the genetic algorithm requires more epochs to converge.
9.5 CONCLUSIONS
Estimation of the eight parameters of the IAWPRC model no.l is a hard prob-
lem. Hence the results which we have obtained are highly satisfactory, since
they provide parameter estimates which are quite accurate and capture both the
qualitative and quantitative behavior of the true system, even in the presence
of noise in the observations. In short, the predictive modular genetic para-
meter estimation algorithm works well on a challenging problem. It would be
worthwhile to test the algorithm on real (rather than simulated) data and check
whether the resulting parameter estimates capture the behavior of a real world
activated sludge process.
III Unknown Sources
10 SOURCE IDENTIFICATION
ALGORITHMS
In this chapter we explore the problem of black box time series identification
for the case of source switching. This amounts to unsupervised development of
models for a time series which is generated by a collection of alternately acti-
vated, initially unknown sources. We present two algorithms which accomplish
this task and present guidelines which can be used to develop variations of these
algorithms. A concept which is central to our presentation is data allocation.
Numerical experiments are presented to illustrate our point of view.
10.1 INTRODUCTION
Up to this point we have been mainly concerned with classification and pre-
diction of time series which are generated with known sources. However, in
Chapter 5 we have considered the identification problem, where one or more
models of the input/ output behavior of the time series must be developed. In
this part of the book we will concentrate on this problem. We will consider
the case where initially no information at all is available regarding such input/
output behavior. The only assumption we will make is that the time series is
produced by more than one sources.
Under the circumstances, our goal is to discover the number of sources in-
volved in the generation of the time series and to develop a black box input/
output model for each such source. We refer to this problem as source identi-

150
fication. In this chapter we will consider a family of PREMONN source iden-

tification algorithms which are characterized by the following features.
1. They are designed for online operation.
2. Their basic components are a data allocation phase and a predictor training
phase.
3. The two phases are executed alternately and repeatedly.
4. Data is allocated to several competing predictors, on the basis of prediction

accuracy.
5. Allocated data are used to retrain the predictors.
6. Predictors may be added as needed, until several well trained predictors are
obtained, one predictor corresponding to each active source.
Hence the proposed source identification algorithms are modular (since sev-
eral predictors are involved) and predictive (since data allocation depends on
a predictive criterion). In our presentation we assume that neural predictors
are used, but this is not a crucial point; according to previous remarks, any
convenient predictor model can be employed.
We consider predictor training to be a straightforward task and are mostly
concerned with the data allocation component. In other words, we believe
that if the training data are separated into groups, each group containing data
generated by a single source, it will be an easy matter to use one of the many
available neural network training algorithms so as to obtain a well trained
predictor for every source.
Hence the critical component of source identification is the data allocation
scheme; in the rest of this chapter we consider in detail two such schemes, one
implementing parallel data allocation and the other implementing serial data
allocation. These terms will be explained in detail later; for the time being it
suffices to say that they stand in two extremes of a spectrum which also con-
tains various hybrid (partly serial, partly parallel) data allocation schemes. In
the next two chapters we will consider the convergence properties of data allo-
cation schemes and we will prove that, subject to certain reasonable conditions,
"correct" data allocation can be successfully performed.
As soon as the source identification phase is completed (i.e. as soon as
sufficiently well trained predictors become available) any of the PREMONN
classification algorithms can also be executed. Actually, the source identifica-
tion and time series classification (or prediction) algorithms will usually run in
parallel. Given the convergence properties of the identification algorithms, it
may be expected that well trained predictors will always be available to the
classification algorithm, excluding transient periods which are associated with
the activation of new sources. Hence it may be expected that the classifica-
tion (and prediction) algorithms will perform successfully, in accordance with
SOURCE IDENTIFICATION ALGORITHMS 151
the convergence results of Chapter 4. In other words, the combination of algo-

rithms presented in Parts I and III presents a complete solution to the problems
of time series classification and prediction for both the known and unknown
sources situations.
10.2 SOURCE IDENTIFICATION AND DATA ALLOCATION

10.2.1 Identification through Data Allocation and Predictor Training
Let us recall the terminology introduced in Part I. Namely, an observable time
series Yt, t = 1,2, ... is generated by a source time series Zt, where Zt takes
values in a finite source set e = {I, 2, ... , K}. At time t, Yt depends on the
value of Zt, as well as on Yt-b Yt-2, ... according to the equation
Yt = F zt (Yt-b Yt-2, ... , Yt-M).
Hence the time series is generated by the combination of K functions: Fl (.),
F2(.)' ... , FK(.). However, unlike the cases considered in Parts I and II,
the number K is unknown and neither the functions Fk(.) nor approximating
predictors !k(.) are available. The PREMONN classification algorithm cannot
be applied before predictors fk(.) (approximating the functions F k (.) ) are ob-
tained. The source identification task consists in obtaining the !k (.) predictors
(for k = 1,2, ... , K).
If the source time series Zt were observable, then it would not be hard to ob-
tain input/output models fk(.) of the time series Yb Y2, ... (for k = 1,2, ... , K).
It would only be required to separate the observed data Yb Y2, ... , Yt, ... into
K groups, one group corresponding to each observed value of Zt (Le. to each
active source). Then it would be easy to train an accurate input/output model
fk (.) for each source, using one of the many efficient neural network training
algorithms.
However, since the source time series cannot be observed, it is not obvious
how to group the observed data Yb Y2, ... , Yt, ... . The first step in the
source identification process must be data allocation, Le. the separation of the
observed data into groups, one group corresponding to each active source. Note
that data allocation is similar to time series classification (since in both cases it
is required to estimate Zt for every time step t). However, classification requires
that the predictors fk(.) are available, while data allocation is necessary exactly
in order to train the fk(.)'S. Hence it appears that we have entered a vicious
circle.
The source identification algorithms presented in the following section, break
out of this circle by implementing an online, unsupervised process of gradual
predictor training. The critical component in this algorithm is the data alloca-
tion scheme.
10.2.2 Parallel versus Serial Data Allocation

Let us now explain in more detail what is meant by parallel and serial data
allocation. Recall that our general approach to time series problems is charac-
152
terized by the use of multiple models which are compared according to their
predictive performance. This approach is also followed in the data allocation
problem: data are allocated to predictors according to their predictive perfor-
mance.
To illustrate this point, consider a very simple example, which involves a
time series generated by two sources, and two predictors, each modeling (im-
perfectly) one source. Now suppose that one of the two sources is activated and
generates Yt, the next observation of the time series. Assume that the active
source is the one corresponding to predictor no.l; then we can use Yt to retrain
predictor no.l and hence improve its predictive (modeling) accuracy. How can
we test our assumption? Well, if predictor no.l is "reasonably well trained",
then we can expect the prediction error IYt - yll to be "small". Finally, there
are at least two ways to make the term "small" operationally meaningful.. We
I,
can compare IYt - yll either to a fixed number d or to IYt - y~ the error of the
second predictor. In other words, two strategies can be followed for allocating
each datum Yt to one of the two available predictors.
1. The errors can be compared to each other and Yt allocated to the predictor
with minimum prediction error:
~f IYt - Y{I
~ IYt - Y~I
If Yt - Yt > Yt - Yt
then Yt is allocated to pred. no.l;
then Yt is allocated to pred. no.2.
Because in this case the two predictor errors are used simultaneously, we
refer to this data allocation scheme (and its generalizations to the case of
more predictors) as parallel data allocation.
2. The errors can be compared, one at a time, to a threshold d and Yt allocated
to the first predictor with error less than the threshold:
if IYt - Y11 ~ d then Yt is allocated to pred. no.l;

if Yt - yl > d then Yt is allocated to pred. no.2.
Because in this case the two predictor errors are used one at a time, we refer
to this data allocation scheme (and its generalizations to the case of more
predictors) as serial data allocation.
What can be said about the behavior of these data allocation schemes? In
particular, what can we expect in case the predictors are not well trained? The
answer is that either of the above data allocation strategies is self reinforcing:
even if initially the predictors are not well trained, eventually each predictor will
tend to collect data which "predominantly" originate from one source, rejecting
all other data.
We will attempt to justify the above claim in Section 10.3 and (more rigor-
ously) in Chapters 11 and 12. However, the basic idea should be clear at this
point. Let us again consider the case of two sources and two predictors. In both
the parallel and serial case, each predictor initially is not "specialized" in any
particular source. However, if one predictor happens to collect more data from
one source, as soon as it is trained on such data it will tend to accept more data
from the same source and reject data from the other source. This will result
in further specialization in the same source, which will lead to the predictor
collecting more data generated by it; at the same time, the other predictor
will start collecting data from the other source and hence start specializing in
it. Under appropriate conditions this process will be self reinforcing and hence
lead to complete specialization of each predictor in one source.
10.2.3 Hybrid Data Allocation

In case there are more than two predictors, in addition to the parallel and serial
data allocation schemes, a number of hybrid schemes can be used, which utilize
both parallel and serial comparisons of prediction errors. To illustrate the basic
ideas, let us consider the case of three predictors. In this case, data allocation
can be performed in several ways; let us list a few possibilities. The sample Yt
is allocated to predictor no.k*, where k* can be defined in the following ways.
A. Purely parallel data allocation. k* = argmink=1,2,3IYt - yfl.

B. Purely serial data allocation. k* = mink=1,2,3{k : IYt - yfl < d}.
C. Hybrid data allocation. Compare serially predictor no.1 and composite
predictor no.(2,3); then compare in parallel predictors 2 and 3.
k* _ { 1 if IYt - yll < d

- arg mink=2,3IYt - yfl else.
D. Hybrid data allocation. Compare in parallel predictor no.1 and com-

posite predictor no.(2,3); then compare serially predictors 2 and 3.
k* ={ 1 if IYt - yll < mink=2,3IYt - yf I

mink=2,3{k: IYt - yfl < d} else
Schemes C and D are hybrids, lying between the purely parallel scheme A
and the purely serial scheme B. Since the labeling of predictors is arbitrary, the
above list essentially exhausts all possibilities involving three predictors. For
instance, a scheme comparing predictors 2 and (1,3) serially and then predictors
1 and 3 in parallel, is essentially equivalent to scheme 3. Hence, in the case of
0
three predictors, there are essentially four possible arrangements, which can be
illustrated graphically in Figures 10.1-10.4.
As the number of predictors increases, so does the number of possible com-
binations of serial and parallel comparisons, resulting in a multitude of hybrid
data allocation schemes. The possibilities are further increased if the options
of rejecting data and/or adding new predictors are also included. For instance,
an algorithm may be devised which adds a new predictor in case the smallest
error of all existing predictors is above the error threshold d.
154
Figure 10.1. Scheme A: a fully parallel architecture for data allocation to three predictors.
Figure 10.2. Scheme B: a fully serial architecture for data allocation to three predictors.
Figure 10.3. Scheme C: a hybrid (serial/parallel) architecture for data allocation to three
predictors.
Figure 10.4. Scheme D: a hybrid (parallel/serial) architecture for data allocation.
10.2.4 Comparison of Data Allocation Schemes

Which of the above schemes should be used in a particular source identifica-
tion problem? We cannot offer any definitive answers to this question, but
we believe the answer is related to the number of active sources. Since this
number will be initially unknown, it makes sense to start using few predictors
(perhaps one or two) and use an algorithm which has the option of adding new
predictors as required. Parallel, serial and hybrid algorithms can all satisfy
this requirement for "growing potential". As it will be seen in Section 10.4,
experiments indicate that both "pure" data allocation schemes have excellent
performance. In addition, subject to reasonable condition, both the purely se-
rial and purely parallel algorithms can be proved to converge (the convergence
proofs are presented in Chapters 11 and 12, respectively). Once convergence
of the pure schemes is established, it is relatively easy to prove convergence of
hybrid schemes as well.
In the following section we will present two source identification algorithms;
the first uses a purely parallel and the second a purely serial data allocation
scheme. In both the parallel and serial case, the unknown number of sources
is handled by starting with one or two predictors and allowing for the dynamic
introduction of new predictors when necessary.
156
10.3 TWO SOURCE IDENTIFICATION ALGORITHMS
10.3.1 Parallel Source Identification Algorithm

The source identification algorithm with data allocation by parallel prediction
error comparison and an option for introducing new predictors (in short: the
parallel source identification algorithm) will now be presented. In addition to
the quantities defined in Part I, the algorithm requires the use of two new
quantities: a threshold d and a critical length N c ; both of these parameters are
utilized for deciding when to increase the number of predictors K.
PARALLEL SOURCE IDENTIFICATION ALGORITHM

Initialization
Set K = 2. Initialize randomly 2 predictors described by the functions fkO) (.),

k = 1,2.
Main Routine
For t = 1,2, ...

For k = I, ... ,K.
Compute prediction ~k = fkt-1)(Yi_l, ... ).
Compute prediction error = IEtl IYi -
~kl·
Next k
Set k* = argmink=1,2, ... ,K IEtl.
Assign the current block Yi to the predictor k*.
If t = n· L (for any n) then:
For k = 1, ... ,K.
Retrain the k-th predictor (using all data assigned to it) for J
iterations and so obtain a new fk t ) (.) neural network.
Next k
End If
For k = 1, ... ,K
If the size of the k-th training set is larger than Nc and 2::: i IE! I >
d.
SetK-K+I
Replace the k-th predictor by two identical copies of itself.
Allocate all data of the replaced predictor to both new predic-
tors.
End If
Next k
Next t
Remark 1. The modular nature of the algorithm is evident. The algorithm

is competitive: a data block may be allocated to the k-th predictor even when
the error IEfl is large; what matters is that it is smaller than the remaining
predictions. Note that computation of predictions and training of predictors
can be performed in parallel, resulting in significant execution speedup.
Remark 2. It has been assumed that training of the predictors is performed by
an iterative procedure. For instance, backpropagation can be used for sigmoid
feedforward neural networks. In cases where exact training is possible (e.g. for
linear predictors) it is not necessary to perform J training iterations.
Remark 3. The algorithm can be modified so as to allow merging of predic-
tors. Merging can occur in case the errors of two predictors are consistently
of small and of comparable size for the same observations Yt. For simplicity
of presentation we did not include this option in the above description of the
algorithm, but it should be obvious how to implement it.
Remark 4. If source identification takes place in conjunction with classifica-
tion or prediction, the source identification algorithm can be executed in two
modes. The first mode is the one described above, where the source identifi-
cation phase is executed first; after the predictors have been sufficiently well
trained (this can be judged by monitoring prediction error) the classification
or prediction phase commences. In the second mode, the identification and
classification phases are executed concurrently. Source identification is not ob-
tained from training data; instead the actual test data are used as they become
available. At every time step the predictors are updated and the resulting
predictions are used for the classification phase. If, after a while, the source
identification is completed, it may be deactivated and predictor retraining dis-
continued. In case there is evidence of a new source being introduced (for
instance, large prediction errors) the identification scheme may be reactivated
to train a new predictor. Executing the algorithm in this mode requires some
obvious modifications to the above description, which are omitted for brevity
of presentation.
10.3.2 Serial Source Identification Algorithm

We now present the source identification algorithm with data allocation by se-
rial prediction error comparison and an option for introducing new predictors
(in short: the serial source identification algorithm). This algorithm, just like
the parallel one, can be executed either prior to or concurrently with the classi-
fication algorithm. In addition to the quantities defined in Part I, the algorithm
requires the use of a threshold d; similarly to the case of the parallel algorithm,
this is used to decide when to increase the number of predictors K.
158
SERIAL SOURCE IDENTIFICATION ALGORITHM

Initialization
Set K = 1. Initialize randomly one predictor, described by the function 11°) (.).
Main Routine
For t = 1,2, ....
For k = 1, ... ,K.
Compute ~k = l!t-l)(Yi_l, ... ).
Compute IEfl = IYi - Yll·
If IEfl < d then
assign the current block Yi to the k-th predictor.
End If
Next k.
If Yi has not been assigned to any predictor,
setK+-K+l
assign Yi to the K - th predictor.
End If
If t = n· L (for any n) then:
Loop for k = 1, ... , K.
Retrain the k- th predictor (using all data assigned to it) for J
iterations and so obtain a new predictor I!t)(.).
Next k
End If
Next t
Remark 1. This algorithm is also, evidently, of a modular nature. It is not

as competitive as the parallel algorithm: data allocation does depend on the
absolute size of IEfl. Note that the term "serial" refers only to comparison of
prediction errors; computation of predictions and retraining of predictors can
be performed in parallel.
Remark 2. Similarly to the parallel algorithm, the serial algorithm can be
modified so as to allow for merging of predictors, in case it is observed that
prediction errors of two predictors are consistently of small and comparable size
for a large number of observations Yt. For simplicity of presentation we do not
include this option in the above description of the algorithm, but it should be
obvious how to implement it.
Remark 3. The serial algorithm can also be run prior to or concurrently to
a classification or prediction algorithm, much like the parallel algorithm. The
remarks concerning the parallel algorithm can be repeated for the serial one as
well.
10.3.3 Emergence of Specialization

Both the parallel and serial algorithm perform the source identification task
by effecting a gradual specialization of predictors, finally producing one well
trained predictor for every source which has participated in the generation of
the time series. Let us present here an informal argument to support this
claim; a more rigorous justification will be presented in Chapter 11 (regarding
the parallel algorithm) and in Chapter 12 (regarding the serial algorithm).
Emergence of Specialization in the Parallel Algorithm. This algorithm

may reasonably be expected to produce specialized predictors, with a one-to-
one correspondence of active sources to predictors. Consider first the case of
two active sources and two predictors. Suppose that by time t predictor no.
1 has (perhaps accidentally) collected significantly more source no.I data than
source no.2 data. Assuming an efficient training algorithm has been used, by
this time predictor no.I may be expected to be a reasonably good input/output
model of the source no.I. Now, if at time t we have Zt = 1, predictor no.I will
be likely to produce a good estimate ~1 and hence to accept yt in its training
data pool; hence on retraining the predictor will be likely to produce a better
approximation of the input/output behavior of source no.I. On the other hand,
if Zt = 2, then (assuming the two sources have sufficiently distinct input/output
behavior) predictor no.I will be likely to produce a poor estimate ~2 and hence
yt will most likely be allocated to predictor no.2 which, upon retraining, will
be likely to produce a better approximation of the input/output behavior of
source no.2. It is reasonable to expect that, as t goes to infinity, this process
will reinforce itself, so that in the long run predictor no.I will "predominantly"
accept data generated by source no.I, and predictor no.2 will "predominantly"
accept data generated by source no.2. If, in addition, an efficient training
algorithm is used, a reasonably accurate predictor may be obtained for each
source. In other words, each predictor will specialize in one source. It should
be noted that there is no a priori reason that predictor no.1 should specialize
to source no.I; the above scenario is equally plausible in case predictor no.I
initially happens to specialize in source no.2; in this case predictor no.2 will
tend to specialize in source no.I, by exactly the same argument as above.
Further, if more than two sources are active, by the same arguments as
previously, it is reasonable to expect that each of the two first-level predictors
will specialize in a group of sources; so we may expect that, at the first level
of data allocation, the source set will be partitioned into two subsets. Each
subset (by exactly the same argument as above) can be expected to be further
divided into lower levels of the data allocation scheme.
Clearly the above analysis is highly informal and the conclusions may fail to
materialize, for a number of reasons. Once again, consider a concrete example
involving two predictors and two sources, which have fairly similar input/output
behavior. Suppose that the initial observations are generated by source no. 1
and (perhaps accidentally) accepted by predictor 1. This predictor specializes
in source 1, but since the two sources have similar behavior, it essentially learns
160
the input/output behavior of source no. 2 at the same time. Hence, if source
no. 2 is activated at a later time, predictor no. 1 has a high likelihood of
accepting yt's generated by source no. 2. In this manner, we may obtain a
predictor which is a satisfactory input/output model of both sources, but we
will never be aware of the fact that two sources have been active. In case we
are interested in classification applications, this may be a serious problem.
So it appears that the success of the data allocation scheme will depend on
the similarity of input/output behavior of the active sources. It must be em-
phasized that this similarity is relative, not absolute. In particular, it depends
strongly on the type of predictors used. A predictor with rich structure and a
large number of parameters may be capable of simultaneously capturing the in-
put/output behavior of two fairly distinct sources; conversely a predictor with
few parameters may furnish a poor model of even two fairly similar sources.
This may be expressed more formally: it can be expected that the data allo-
cation will succeed when the predictor capacity is not much higher than the
source complexity.
Emergence of Specialization in the Serial Algorithm. Consider next

the case of serial data allocation. Take again a time series Yl, Y2, ... ,Yt,
... generated by K sources and start with two randomly initialized predictors.
In the initial phase of data allocation predictor no.1 may collect (perhaps in
a random manner) a data set where one source is more heavily represented
than the rest. This will result in a slight specialization in this source and,
consequently, predictor no.1 will have a tendency to accept more data from the
preferred source. It follows that predictor no.2 will collect more data from the
remaining sources.
The error threshold is used to determine if and when a new predictor should
be added. After a while, predictor no.1 will be very well specialized in some
source. If the total number of sources is two, then it follows that predictor no.2
will also be specialized in source no.2, since it has mostly received data from it.
But it may be the case that the total number of sources is higher than two and
predictor no.2 receives a mixed data set and is unable to specialize; i.e. is still
characterized by a large prediction error. In this case, after a while an additional
predictor (predictor no.3) will be introduced and predictors no.2 and 3 will
receive all the data rejected by predictor no.I. Sooner or later predictor no.2 will
also specialize in one source and predictor no.3 will receive incoming data from
the remaining sources. Proceeding in this manner, predictors will keep being
added until one well specialized predictor corresponds to every active source .
In short, serial data allocation can lead to successful source identification.
10.3.4 Some Bibliographic Remarks

Our approach is related to several different lines of research. First, the two al-
gorithms presented above can be considered as generalizations of the k-means
clustering algorithm (Macqueen, 1965; Duda and Hart, 1973). Indeed, the
k-means algorithm partitions incoming data into clusters depending on the
Euclidean distance of each datum from each cluster's centroid. This is quite
similar to our approach, except that we compute "distance" from each cluster
through the use of prediction error. In k- means the centroids are periodically
recomputed using the new cluster members; this corresponds to the periodic
retraining of predictors, which our algorithms employ. Also, there are variants
of the k-means algorithm, for instance the ISODATA algorithm (Ball and Hall,
1965) which allow for splitting or merging clusters, similarly to the mecha-
nism we provide for adding new predictors. k-means is mainly related to the
parallel data allocation algorithm; however it is possible to setup a k-means
algorithm which assigns incoming data to clusters by serial comparisons; this
would correspond to our serial algorithm.
k-means is usually employed for the clustering of a fixed data set, rather
then for online clustering of an incoming data stream.. This suggests that our
algorithm could also be used for offline tasks, involving fixed data sets. Con-
versely, there are online versions of k-means. A notable example is Kohonen's
self organizing maps (SOM), which can be interpreted as an online k-means
algorithm.
If the hidden Markov model interpretation of our time series model is adopted
(as explained in Part I), then the source identification problem can be seen as a
joint state and parameter estimation task, with the state Zt taking values in a
discrete, finite set. This is essentially is a nonlinear filtering problem. Because
the state space has no particular structure (for instance a notion of distance) it
is a rather hard problem. A possible method of solution (for a fixed data set)
appears in (Levin, 1993) and is essentially a version of the EM (Dempster et al.,
1977) algorithm. It is conceivable that some online HMM parameter estima-
tion algorithm may be modified into a form suitable for the source specification
problem; see for instance (Baldi and Chauvin, 1996).
Finally, there is an obvious connection with constructive and / or growing
algorithms which have appeared in the neural networks literature of the last
decade. Tree growing algorithms are especially relevant. We do not furnish
any bibliographic references at this point; the subject is treated in Chapter 13,
where abundant references are provided.
10.3.5 Implementation
In practical implementation of the parallel and serial algorithms, certain para-
meter values must be chosen carefully to optimize performance. We list some
of these parameters below and discuss some issues which are related to the
determination of their values.
1. N,the length of data block. This is related to the switching rate ofZt. Let
Ts denote the minimum number of time steps between two successive source
switchings. While Ts is unknown, we operate on the assumption of slow
switching, which means that Ts will be large compared to N. Since the N
data points included in a block will all be assigned to the same predictor,
it is obviously desirable that they have been generated by the same source.
162
In practice this cannot be guaranteed. In general, a small value of N -will
increase the likelihood that most blocks contain data from a single source.
On the other hand, it has been found that small N leads to an essentially
random assignment of data points to sources, especially in the initial stages
of segmentation, when the predictors have not specialized sufficiently. The
converse situation holds for large N. In practice, one needs to guess a value
for Ts and then take N somewhere between 1 and Ts. This choice is consis-
tent with the hypothesis of slow switching rate. The practical result is that
most blocks contain data from exactly one source, and a few blocks contain
data from two sources. It should be stressed that the exact value of Ts is
not required to be known; a rough guess suffices.
2. L, the retraining period. If this is too large, then retraining requires a long
time. If it is too small, then not enough data are available for the retraining
of the predictors, which may result in overfitting, especially in the early
stages of the algorithm. Of course, if N is relatively large, meaning that
each data block contains relatively many data, then L can also be small,
since L counts data blocks, rather than isolated data points.
3. J, the number of training iterations. This should be taken relatively small,
since the predictors must not be overspecialized in the early phases of the
algorithm, when relatively few data points are available. The choice of J is
closely connected to that of L; if L is relatively small (frequent retraining)
then J can be small, too; i.e. it may be preferable to retrain the predictors
often and by small increments.
4. Finally, there are the growing parameters: Nc and d in the case of the parallel
algorithm and d in the case of the serial algorithm. As already remarked,
we have no specific recommendations to make regarding the choice of these
parameters, but we have found that, within reasonable bounds, their exact
values are not crucial to the performance of the algorithms.
10.4 EXPERIMENTS
In this section we present three groups of data allocation experiments which
were used to evaluate the performance of the parallel and serial data allocation
schemes.
10.4.1 Experiment Group A

In the first experiment two sources are used, i.e. Zt takes values in {l, 2}. The
sources are described by the following general form
Yt = fZt(Yt-J);
in other words the time series is generated by functions h (.), h (.). Specifically,
we have
h(x) = 4x· (1- x) (a logistic function);
Figure 10.5. Composite logistic and tentmap time series.
1.0000 T·····:·····;······································,..·"";"•.......:..............•...•••.•........•.•...............•••••..•...............••.............•••••...;..••••..1
l!!
'fij
en 0.5000
~"
0.0000 +---_+--'--_ _+-_ _ _+-_ _-"'

51 101 151
Time Steps
2x if x E [0,0.5)
h(x)= { 2.(1-x) if x E [0.5,1]
(a tent-map function).
The two sources are activated consecutively, each for 200 time steps, resulting
in a period of 400 time steps. The data allocation task consists in discovering
that two sources are active and separating the data Yb Y2, ... into two groups,
one group corresponding to each source. 200 time steps of the composite time
series are presented in Figure 10.5.
This particular segment of the time series includes a source switching. The
reader may be interested in guessing where this source switching takes place,
by looking at the Yt values (the times shown in the graph are not the real ones).
The answer is given in footnote 2, at the end of the chapter.
A number of experiments are performed using the time series described
above, observed at various levels of noise, i.e. at every step Yt is mixed with ad-
ditive white noise uniformly distributed in the interval [-A/2, A/2]. Six values
of A are used: 0.00 (noise free case), 0.02, 0.04, 0.10, 0.14, 0.20. The predictors
used are 1-4-1 sigmoid neural networks which are trained using a Levenberg-
Marquardt algorithml; the algorithm parameters are taken to be as follows:
block length N=lO, retrain period L =100 and J =5, Nc = 500 and d = 0.1.
Data allocation is performed using both parallel and serial data allocation
schemes. In every experiment performed, both schemes succeed in discovering
1 Implemented by Magnus Norgaard, Technical University of Denmark, and distributed as

part of the Neural Network System Identification Toolbox for Matlab.
164
Figure 10.6. Classification accuracy c.
1.20 . , - - - - - - - - - - - - - - - - ,
0.80
"
-- ................
..... ',,\
,,
-----,
,,
,,
'----- ---- ..
0.40
0.00 +----+-----+-----+-------1
0.00 0.05 0.10 0.15 0.20
Noise Level A
the existence of two sources and proceed in allocating data to the two corre-
sponding predictors. Two quantities are of interest: the time Tc at which the
sources are discovered and the classification accuracy c after time Tc. Tc is
computed as follows: a running average of prediction errors is calculated for
every time t; if for every data allocation the predictor which receives the in-
coming sample Yt has prediction error less than one half that of the remaining
predictors, and if this condition holds for 50 consecutive data allocations (Le.
for 500 observations), then it is assumed that all sources have been discov-
ered and all predictors specialized and Tc is set equal to the current t. Then
classification accuracy is computed for the 200 data blocks (2000 observations)
corresponding to times Tc + 1, Tc + 2, ... , Tc + 200; i.e. c is set equal to T /200
where T is the number of correctly classified samples.
For both parallel and serial data allocation, six experiments are performed
at every noise level and the resulting c and Tc values are averaged. The average
c is plotted as a function of noise level A in Figure 10.6 and the average Tc is
plotted as a function of noise level A in Figure 10.7.
It can be seen that both schemes perform very well at low to medium noise
levels. At high noise levels the parallel data allocation scheme still shows very
good performance; the serial scheme achieves a relatively low level of correct
classification and, in particular, fails to satisfy the Tc computation criterion
(hence the respective part of the graph is missing in Figure 10.7. On the other
hand, at low to middle noise levels, the serial scheme achieves classification
faster: the average values of Tc are lower than these of the parallel scheme.
Figure 10.7. Classification time Te.
2000
1500
u
I-
CD
E
1=
!5 1000
~..,.,
0
..
500
O+-----~-----r-----r----~
0.00 0.05 0.10 0.15 0.20
Noise Level A
Figure 10.8. Evolution of data allocation; serial algorithm, A=O.OO.
r:{ --\\[ V~/. ______ .____ r-----.---1 ___________ r------

100 200 300 400 500 600 700 600 900
The evolution of the data allocation process in three representative exper-

iments is presented in Figures 10.8, 10.9 and 10.10. Figure 10.8 corresponds
to serial data allocation, at noise level A= 0.00, Figure 10.9 corresponds to
parallel data allocation, at noise level A= 0.00 and Figure 10.10 corresponds
to parallel data allocation, at noise level A= 0.14.
166
Figure 10.9. Evolution of data allocation; parallel algorithm, A=O.OO.
2 _ _ Source
....... Predictor
o
o 400 800
Time Steps
Figure 10.10. Evolution of data allocation; parallel algorithm, A=O.14.
3
c::
0
'"~ 2 , ' , _ _ Source
:: ::, '.'.
\ / \
0
~ :.'::, ' . f ....... Predictor
.!!l
'"
Cl
0
0 400 800
Time Steps
10.4.2 Experiment Group B

In this group of experiments, three sources are used, i.e. Zt takes values in
{I, 2, 3}. The sources are described by the following general form
in other words the time series is generated by functions 11 (.), 12 (.), is (.).
The first two functions are as described in the previous section, while 13(.) =
11(11(.)) (i.e. a double logistic).
The three sources are activated consecutively, each for 200 time steps, re-
sulting in a period of 600 time steps. The data allocation task consists in
discovering that three sources are active and separating the data Yl, Y2, ... into
three groups, one group corresponding to each source. 200 time steps of the
composite time series are presented in Figure 10.11.
Figure 10.11. Logistic, tent-map, double logistic time series.
1.00 ,--;;................-1"""••••••.--....•-•.-._.,-••.•-.•..•-····--·-----,--··-·······-···-r-···:·····r,.·:·········J
0.00 -1'-_--'-'--'--+--_--'--_+--_ _ _+---'_-'---'

51 101 151
TimeSleps
This particular segment of the time series includes a source switching. The
reader may be interested in guessing where this source switching takes place,
The answer is given in footnote 2, at the end of the chapter.
Once again, experiments are performed using the above time series, observed
at various levels of noise, i.e. at every step Yt is mixed with additive white noise
uniformly distributed in the interval [- A/2, A/2]. Six values of A are used: 0.00
(noise free case), 0.02, 0.04, 0.10, 0.14, 0.20. Block length N, predictor types,
training algorithm, and algorithm parameters are taken the same as in the
previous section.
Again both the parallel and serial data allocation schemes are used. In
every experiment performed, (with the exception of serial data allocation and
noise level A ~ 0.14) both schemes succeed in discovering the existence of
three sources and proceed in allocating data to the corresponding predictors.
It should be noticed that in the case of parallel data allocation, there is an ini-
tial phase where data are allocated to two groups; after specialization in these
two groups (composite sources) takes place, two new predictors are introduced
for each group and the new incoming data are further allocated to four sub-
groups, two for each original group. However, it is easily established (using
a predictive error comparison criterion) that in one group the two subgroups
really correspond to one source, and so these subgroups are merged, resulting
in three final groups and predictors. The quantities c and Tc are computed as
in the previous section.
168
1.20 r .............. ·...... ··.... · .......... · .......... · ..........·.......... ··· .. ··· .. ··· ........................................................,
0.80
,
,
-,
,,
',----- ---1
0.40
0.00 +-_ _-+--_ _--+-_ _--+_ _----<

0.00 0.05 0.10 0.15 0.20
Noise Level A
c is plotted as a function of noise level A in Figure 10.12 and the average Tc is
plotted as a function of noise level A in Figure 10.13.
Again it can be seen that both schemes perform very well at low to medium
noise levels (where the serial scheme achieves faster classification) and the par-
allel data allocation scheme also has very good performance at high noise levels.
10.4.3 Experiment Group C

In the final experiment group, the time series used is obtained from three
sources of the Mackey-Glass type. The original data evolves in continuous time
and satisfies the differential equation :
dy = -0.1 (t) 0.2y(t - td)

dt y + 1 + y(t - td)1°'
For each source a different value of the delay parameter td is used, namely
td=lO, 17 and 30. The time series is sampled in discrete time, at a sampling
rate T = 6, with the three sources being activated alternately, for 100 time
steps each. The final result is a time series with a switching period of 300. 200
time steps of the composite time series are presented in Figure 10.14.
This particular segment of the time series includes one source switching. The
reader may be interested in guessing where this source switchings take place,
The answer is given in footnote 2.
4000 r····················· .......·...•.....•. ·.......·...·.......................................................................................,
3000
~
CD
E
F
15 2000
~
i..
tl
1000
0.00 0.05 0.10 0.15 0.20

Noise Level A
Figure 10.14. Mackey-Glass time series.
I." , -_ _ _ _ _ _ _ _ ..... _ _ _ _ _ _ _ _ _._ ... _ _ _,
1.00 IV
0."
0.00 ' - -_ _ _~---~_ _ _ _~-----'
~
TimeSleps
Once again, experiments are performed using the above time series, observed
at various levels of noise, i.e. at every step Yt is mixed with additive white noise
uniformly distributed in the interval [- A/2, A/2]. Six values of A are used: 0.00
(noise free case), 0.02, 0.04, 0.10, 0.14, 0.20. Block length N is equal to 10,
170
1.20 r····· ........ ·..........·..........·.......... ·.......... ·.......... ·.......... ·.......... ·............................................,
'--.
0.80
0.40
0.00 +--_-+--_ _--+-_ _--+_ _----j

0.00 0.05 0.10 0.15 0.20
Noise Level A
the predictors are 5-5-1 sigmoid neural networks. The training algorithm is the
same as in the previous sections; the algorithm parameters are taken to be as
follows: L =100, J =10, Nc = 100 and d = 0.05.
Once again both the parallel and serial data allocation schemes are used
and in every experiment both schemes succeed in discovering the existence of
three sources and allocating data to the corresponding predictors. Similarly to
experiment group B, in the case of parallel data allocation there is an initial
phase (first level data allocation) where data are allocated to two groups and a
second level phase, where two new predictors are introduced for each group. By
using the predictive error comparison criterion, it is established that the two
subgroups of one group really correspond to one source, and so these subgroups
are merged, resulting in three final groups and predictors. The quantities c and
Tc are computed as in the previous sections.
c is plotted as a function of noise level A in Figure 10.15 and the average Tc
is plotted as a function of noise level A in Figure 10.16. Both data alloca-
tion schemes perform very well at all noise levels. The serial scheme achieves
considerably faster classification.
10.4.4 Discussion of the Experiments

The results presented above indicate that both serial and parallel data alloca-
tion perform quite well. While serial data allocation reaches a sufficient level
of specialization faster, parallel data allocation is more robust to noise. It
1000
~
co
E
F
"0 500
~..,
..'"
D
0+-----+-----1-----~----~
0.00 0.05 0.10 0.15 0.20
Noise Level A
should be added that the computation requirements are quite modest for both
schemes; for instance in experiment group B, allocating 1000 observations takes
on the average 3 minutes of processing, for both the serial and parallel data
allocation schemes. This processing time corresponds to implementation using
MatLab 5, running on a 200 MHz Pentium II computer. Optimized C code
would undoubtedly results in much shorter execution times. Hence the above
schemes are suitable for online implementation (keep in mind that MATLAB
is an interpreted language)2.
10.5 A REMARK ABOUT LOCAL MODELS

Our presentation so far has been based on the assumption that once data allo-
cation has been performed correctly, then predictor training is an easy matter.
This assumption has been corroborated by the experiments presented.
The assumption can be expected to hold in many cases, in light of the
universal approximation properties of neural networks. For example it has
been shown that a wide class of functions can be approximated by a sigmoid
(Funahashi, 1989) or RBF (Park and Sandberg, 1993) neural network with
sufficient neurons and weights. However, in any particular problem, neural
networks of a particular size will be selected to model the observed time series
2Let us give here the answer regarding the source switching times in Figures 10.5, 10.11 and
10.14. In Figure 10.5 the switching time is t= 79; in Figure 10.11 the switching time is t=
52; finally, in Figure 10.14 the switching time is t= 51.
172
without a priori knowledge of the source functions Fk {.). Hence it may be

expected that in some cases the chosen neural networks will not be able to
approximate the Fk{.)'s. Even in these cases (which can be expected to be rare)
it does not necessarily follow that our source identification algorithms will fail
to produce well specialized predictors. What we expect to happen is that if one
network fk{.) is not "big" enough to capture the behavior of a source Fk{')'
then our algorithms will develop two or more networks specialized in the same
source. This has been partially corroborated by numerical experimen.ts, which
are not reproduced here, for lack of space. In effect, the source identification
algorithms develop local models of the source F k {.), i.e. models each of which
describes the behavior of Fk (.) over a particular subset of its domain. This is
a connection of our method to local or multiple models methods, which will be
discussed in greater detail in Chapter 13.
10.6 CONCLUSIONS
We have presented two online unsupervised PREMONN algorithms for source
identification. These algorithms, used in conjunction with a PREMONN clas-
sification or prediction algorithm can solve time series problems involving un-
known sources.
The two PREMONN source identification algorithms are quite similar. Both
of them employ a bank of (neural network) predictors and both consist of a
data allocrtion component and a predictor training component. Any neural
network training algorithm can be used to implement the predictor training
component. The crucial component is the one performing data allocation. The
first algorithm presented here use "parallel" data allocation, i.e. incoming
data are allocated by comparing all prediction errors concurrently. The second
algorithm uses "serial" data allocation, i.e. prediction errors are examined one
at a time and an incoming datum is allocated to the first predictor with error
below a threshold.
Both algorithms produce "correct" data allocation and "well trained" predic-
tors, which reproduce accurately the input/output behavior of the time series
generating sources. Both algorithms are fast (the serial algorithm is faster than
the parallel one) and have light computation requirements. The parallel algo-
rithm is very robust to observation noise. All of the above facts have been
established by numerical experimentation.
Moreover, the source identification algorithms follow the general PREMONN
philosophy, in that they operate on the basis of predictive error and are mod-
ular: training and prediction are performed by independent modules, each of
which can be removed from the system, without affecting the properties of
the remaining modules. This results in short training and development times.
Finally, the algorithms have good convergence properties; this fact will be es-
tablished by mathematical analysis in the following two chapters.
11 CONVERGENCE OF PARALLEL
DATA ALLOCATION
In this chapter we examine the convergence properties of the parallel data

allocation scheme presented in Chapter 10. The convergence properties of the
serial data allocation scheme are examined in the next chapter.
11.1 THE CASE OF TWO SOURCES

Our analysis starts by considering the case of two sources and two predictors.
We introduce and discuss some crucial assumptions regarding the data alloca-
tion process; then we present two theorems regarding the convergence of the
parallel data allocation scheme. The case of K sources and variable number of
predictors is treated in Section 11.2.
11.1.1 Some Important Processes

We have already introduced the block processes yt, Zt, yr, Ef (k = 1,2).
In the case of two active sources the source process Zt takes values in {1,2}.
Recall that time series generation takes place according to the equation
yt = FZt(Yi-l, Yi-2, .... ).

For the convergence analysis which will follow, we need some model of the
source switching process; we use the simplest such possible model. Namely, we
assume that every source switching take place independently, according to a

174
fixed probability distribution. In other words, the probability of source i being

active at time t is independent of time (the source process is stationary) and is
denoted (for i = 1,2) by
71'i ~ Pr(Zt = i).
To avoid trivialities, we make the further assumption that 71'1 > 0, 71'2 > 0, i.e.
that both sources are really active.
We will now introduce some variables which are related to data allocation.
Define the data allocation process j (i,j = 1,2) by Ni
NtI l .. no. of source 1 data assigned to predictor 1 at time t;
N t12.. no. of source 1 data assigned to predictor 2 at time t;
N t21 .. no. of source 2 data assigned to predictor 1 at time t;
N t22 •. no. of source 2 data assigned to predictor 2 at time t.
(Note that N6 j = 0, for i,j = 1,2.) Next, the specialization process X t is

defined by
X t ~ (Nl 1 - N;l) + (N;2 - Nl 2) .
(So we have Xo = 0.) The significance of the X t process will be explained
presently; let us remark at this point that it provides the foundation of our
convergence analysis. Before expounding on this, let us also introduce the
following variables (for i,j = 1,2)
Mij ..:... Nij _ Nij
t - t t-1
and
Vt ~ Xt - Xt- 1.
It should be obvious that the above definitions immediately imply the following
relationships (for i, j = 1,2)
t
Ni j = LM;j
.=1
and
t
Xt = LV..
• =1
While N;j and X t are the primary variables, it will be more convenient to work
with M;J and Vt. The following variables will also be useful for the convergence
analysis:
d ..:... M12
Mt -t + M21
t,
d ..:... N 12
Nt -t + N t21·
Again, it is rather obvious that
t
Nt=LMf .
• =1
CONVERGENCE OF PARALLEL DATA ALLOCATION 175
Let us now present a few additional properties of the above processes. It is

clear that for i, j = 1, 2 we have
Mij ~ {I if Zt = i and yt is allocated to predictor no. j;
t 0 else.
From this follows immediately that for every t exactly one of the four variables
Ml1, Ml2, Mil, Mi 2 is equal to one and the remaining variables equal o.
In other words, these variables are dependent (and in fact deterministically
related) to each other. Using the definitions of X t and Vi we can now see that
Vi = 1 {::} Mi 2 + Ml1 = 1; (11.1)

Vi = -1 {::} Ml2 + Mil = 1.
This can be expressed equivalently in terms of Mt
Vi = 1 {::} M td = 0; (11.2)
Vi = -1 {::} Mt = 1.
In other words, 2:;=1 M:; counts the number of times when Vi = -1.
Finally, we will need the following processes
Nl: No. of times source no.l has been activated up to time t
Ni No. of times source no.2 has been activated up to time t
These processes can be defined in terms of the N;j processes:
N t2 -. .:. . N t21 + N t22 .

11.1.2 Some Important Assumptions
Let us now discuss the interpretation and significance of X t . This is best un-
derstood by considering the following possibilities.
1. If X t is positive and large, then either predictor 1 specializes in source 1

(Nl 1 :» Nil) or predictor 2 specializes in source 2 (Ni2 :» Nl 2) , or both.
2. If X t is negative and large, then either predictor 1 specializes in source 2
(Nil:» Nl 1) or predictor 2 specializes in source 1 (Nl 2 :» Ni2) , or both.
It follows that: if the absolute value of X t is large, then at least one of
the two predictors specializes in one of the two sources. When a predictor
specializes in a source, it is expected that it will tend to accept data from
this source and reject all other data. For instance, if (at time t) X t is large
and positive, one of the two predictors has, to some extent, specialized in one
source. To be specific, suppose that predictor no.l has specialized in source
no.I. Then predictor no.l will be likely to accept further samples from source
no.l and reject samples from source no.2; this means that the term Nl 1 - Nil
will be likely to increase. At the same time, source no.2 data (rejected by
predictor no.l), will be likely to be accepted by predictor no.2, which will result
176
in an increase of the term N;2 - Nl 2. The final result is an increase in X t ,

which can be seen to characterize the specialization of the entire two-predictor
ensemble. It may be expected that, under certain conditions, this process will
reinforce itself, resulting, in the limit of infinitely many samples, in "absolute"
specialization of both predictors. To test this conjecture mathematically, we
need a more precise model of the evolution of X t . Two components are required
for such a model. First, we assume that
AO For i = 1, 2 the following is true:
Pr(Pred. nr.i accepts YtIZt, Zt-l, .... X t - 1 , X t- 2, ... , Yt-I. Yt-2, ... ) =
Pr(Pred. nr.i accepts YtIZt, Xt-d.

In other words, it is assumed that assignment of Yt only depends on the cur-
rently active source and the current level of specialization. This is reasonable,
in view of the previous discussion regarding the significance of the specialization
process X t .
Second, some assumption must be made regarding the data allocation prob-
abilities mentioned in AO. Let us first define these probabilities more explicitly.
For n = ... , -1, 0,1, ... define
an ~ Pr (Pred. nr.l accepts Yt IZt = 1, X t - 1 = n),
bn ~ Pr (Pred. nr.2 accepts Yt IZt = 2, X t - 1 = n).
In other words,
1. an is the probability that predictor 1 accepts a sample from source 1, given

that so far the specialization level is n;
2. bn is the probability that predictor 2 accepts a sample from source 2, given
that so far the specialization level is n.
From assumption AO follows that X t is a Markovian process on Z. The

transition probabilities of X t are defined by
Pm,n = Pr(Xt = n + llXt - 1 = m)

and can be computed explicitly as follows (for n = ... ,-1,0,1, ... )
Pn,n+l = Pr(Xt = n + llXt - 1 = n) =
Pr(Zt = 1) . Pr(Pred. nr.l accepts YtIZt = 1, X t - 1 = n)+
Pr(Zt = 2)· Pr(Pred. nr.2 accepts YtIZt = 2,Xt- 1 = n) =?
and
Pn,n-l = Pr(Xt = n - llXt - 1 = n) =

Pr(Zt = 1) Pr(Pred. nr.2 accepts YtjZt = 1, X t - 1 = n)+

Pr(Zt = 2) Pr(Pred. nr.l accepts YtjZt = 2, X t - 1 = n) ~
Pn,n+1 = 71"1 • (1 - an) + 71"2 • (1 - bn ).
Transitions for all other m, such that jn - mj of 1 must be equal to zero. In

short we have
Pn,n+1 = 7I"l a n + 7I"2 bn,
= (1 - an) + 71"2· (1 - bn ),
Pn,n+1
Pn,m = °
71"1·
if jn - mj of 1.
Now, regarding the probabilities an, bn , the following assumptions are made.
Al For all n, an > 0, and lim an = 1, lim an = 0.

n~+oo n~-oo
A2 For all n, bn > 0, and lim bn

n--++oo
= 1, lim bn
n-+-oo
= 0.
In other words the following are assumed.
1. As the specialization level increases to plus infinity (which means that either
predictor 1 has received a lot more data from source 1 than from source 2,
or that predictor 2 has received a lot more data from source 2 than from
source 1, or both):
(a) predictor 1 is very likely to accept an additional sample from source 1,

while it is very unlikely to accept a sample from source 2;
(b) predictor 2 is very likely to accept an additional sample from source 2,
while it is very unlikely to accept a sample from source 1.
2. Similarly, as the specialization level decreases to minus infinity (which means

that either predictor 1 has received a lot more data from source 2 than from
source 1, or that predictor 2 has received a lot more data from source 1 than
from source 2):
(a) predictor 1 it is very likely to accept an additional sample from source

2, while it is very unlikely to accept a sample from source 1;
(b) predictor 2 is very likely to accept an additional sample from source 1,
while it is very unlikely to accept a sample from source 2.
It must be stressed that whether the above assumptions Al and A2 are sat-
isfied depends on three factors: (a) the input/output behavior of the sources;
(b) the type of predictors used; (c) the training algorithm used. In short, as-
sumptions AI, A2 characterize the sources/ predictors/ training combination.
Given the above consideration and taking into account that X t = L::~=l Vs,
and also eqs.(l1.1), (11.2), we see that the specialization process is a species of
178
Figure 11.1. The specialization process is an inhomogeneous random walk on the integers .
... +=01+==..~~(P ... ~ ...
inhomogeneous random walk. In other words, we can imagine a particle moving

along the line of integers, as illustrated in Figure 11.1.
At every time t, the particle moves either to the right (if Vi = 1) or to the left
(if Vi = -1). The position of the particle after t steps is given by X t = L:!=l Vi.
Strictly speaking, this is not a classical random walk, because the probability
of moving to the left or to the right depends on the current position of the
particle, Le. on the current value of X t . In fact, as the particle moves more
to the right, further moves to the right are reinforced and moves to the left
are "discouraged"; a symmetric situation holds if the particle has moved far to
the left. Nevertheless, it cannot be ruled out that, after the particle has moved
far to the right, it will actually revert its course (by a sequence of improbable
moves) and go far to the left. In terms of specialization, while the following
possibility is unlikely, it cannot be excluded: after a predictor has received a
large number of source no.1 data, and has specialized in source no.1, it will
actually start rejecting source no. 1 data and collecting source no.2 data.
From the point of view of the data allocation process, a reversal of previous
specialization is not particularly distressing. The really undesirable outcome is
that the particle (Le. the predictor) will keep oscillating between positions in
the right and positions in the left. The desirable outcome is that the particle
will eventually (perhaps after an initial transient phase) will spend all the time
beyond a certain value of specialization, which must be either far to the right
(specialization in source no.1) or far to the left (specialization in source no.2).
We want to prove that this is the outcome that will prevail, rather than the
oscillating (nonspecialized) behavior. This is the conclusion of Theorem 11.1,
which will be presented in the next section; this can then be used to prove a
stronger convergence result, which is stated as Theorem 11.2.
11.1.3 Convergence Results

Now we are ready to state our results. Two convergence theorems are presented
here. The corresponding proofs are quite lengthy and hence are presented at
the end of this chapter, in Appendices 11.A, 11.B and 11.C.
The first theorem regards the behavior of X t . Recall that "Lo." means
"infinitely often" .
Theorem 11.1 If conditions AD, Al, A2 hold, then
'<1m E Z Pr(Xt =m i.o.) = 0, (11.3)

Pr (lim
t-+CXl
IXtl = +00) = 1, (11.4)
Pr (lim
t~(X)
Xt = +00) + Pr (lim
t-+oo
X t = -00) = 1. (11.5)
In this theorem, the most important conclusion is (11.5): if conditions AO,

Al and A2 hold, then "in the long run" two possibilities exist.
Xt ---+ +00: Either predictor no.l will accumulate a lot more source no.l sam-
ples than source no.2 samples, or predictor no.2 will accumulate a lot more
source no.2 samples than source no.l samples, or both.
Xt ---+ -00: Either predictor no.l will accumulate a lot more source no.2 sam-
ples than source no.l samples, or predictor no.2 will accumulate a lot more
source no.l samples than source no.2 samples, or both.
The total probability that one of these two events will take place is one,
i.e. predictor no. 1 will certainly specialize in one of the two sources. Reverting
to the random walk interpretation, the particle will either wander off to plus
infinity, or will wander off to minus infinity; in either case, after a finite time
it will never fall below any given level of specialization.
Notice that Theorem 11.1 does not quite say that both predictors will spe-
cialize. But in fact, they will specialize, each in a different source and the
specialization is stronger than implied by Theorem 11.1. This is stated in the
next theorem.
Theorem 11.2 If conditions AD, Al, A2 hold, then
1. If Pr( lim X t
t-+CXl
= +00) > 0 then
= 0 It-+CXl
N21 )
Pr ( lim
t-+CXl
Ntll
t
lim X t = +00 = 1, (11.6)
= 0 It-+CXl
N12 )
Pr ( lim
t-+CXl
N~2
t
lim X t = +00 = 1. (11. 7)
2. If Pr( lim X t
t-+CXl
= -00) > 0 then
Pr ( lim
t-+CXl
N~l
Nll
t
= 0 It-+CXl )
lim X t = -00 = 1, (11.8)
(11.9)
Theorem 11.2 states that, with probability one, both predictors will special-
ize, one in each source and in a "strong" sense . For instance, if X t ---+ +00,
180
then the proportion Nfl/Nl l (no. of source 2 samples divided by no. of source
1 samples assigned to predictor no.l) goes to zero; this means that "most" of
the samples on which predictor 1 was trained come from source 1 and, also,
that "most" of the time a sample of source 1 is assigned (classified) to the pre-
dictor which is specialized in this source. Hence we can identify source 1 with
predictor no.1. Furthermore the proportion Nl 2/Nt22 (no. of. source 1 samples
divided by no. of source 2 samples assigned to predictor no.2) also goes to zero
; this means that "most" of the samples on which predictor two was trained
come from source 2 and, also, that "most" of the time a sample of source 2 is
assigned (classified) to the predictor which is specialized in this source. Hence
we can identify source 2 with predictor no.2. A completely symmetric situation
holds when X t -+ -00, with predictor no.l specializing in source no.2 and pre-
dictor no.2 specializing in source no.1. Since, by Theorem 11.1, X t goes either
to +00 or to -00, it follows that specialization of both predictors (one in each
source) is guaranteed.
It must be stressed that for the conclusions of the above theorems to ma-
terialize, it is necessary that conditions AO, AI, A2 hold. Since the validity
of these conditions will depend not only on the behavior of the sources, but
also on the user's choice of predictors and training algorithm, it follows that
considerable skill is required to ensure actual convergence of the data allocation
scheme. Hence the above theorems are mostly of theoretical value, i.e. they
furnish conditions sufficient to ensure convergence. The actual enforcement of
these condition is left to the user.
11.2 THE CASE OF MANY SOURCES

Let us now consider the question of convergence of the data allocation scheme
when more than two sources are active. This case is handled by recursive
application of the data allocation scheme, as explained in Chapter 10; it follows
that the number of predictors is variable.
To be specific, suppose that there are K active sources, i.e. the source
set is 6 = {I, 2, ... , K}. The parallel data allocation scheme starts with two
predictors. Suppose that there is a partition of 6, say 6 1 = {kb k2' ... ,~} and
6 2 = {k](+l' k K1 +2, ... , kK },where kb k2' ... , kK is a permutation of 1,2, ... , K
and K is between 1 and K. The k-th source has an input/output relationship
which can be described by
Yt = Fk(Yt-b Yt-2' ... , Yt-M).
Now, we can consider two composite sources; the first one has the form
K
Yt = L l(Zt = ki ) . FdYt-l' Yt-2' ... , Yt-M). (11.10)
i=l
and the second has the form
K
Yt = L I(Zt = ki ) . Fk; (Yt-b Yt-2' ... , Yt-M). (11.11)
i=K+l
Using the form of eqs.(I1.lO), (11.11) it follows that the generation of the time
series can be described by an equation of the form
(11.12)
In other words, we can consider yt to be produced by a new ensemble of two suc-
cessively activated sources, where source activation is denoted by the variable
Zt, taking values in {I,2} and each of the two new sources is actually a com-
posite of simpler sources. Now it follows from eq.(11.12) that the two-sources
analysis presented in the previous section also applies to the many sources case,
as long as each of the sets 61. 62 is considered as a composite source. In partic-
ular, if predictor type, training algorithm and allocation threshold are selected
so that the partition/ predictors/ training algorithm / threshold combination
satisfies assumptions AO, AI, A2, then the parallel data allocation scheme
will be convergent in the sense of Theorem 11.2. Hence the incoming data will
be separated into two sets; one set will contain predominantly data generated
by the composite source no. 1 and the other set will contain predominantly data
generated by the composite source no.2. To be more precise, if the variables
Ni i (i,j = 1,2) have the meaning explained in the previous section, but with
respect to the composite sources no.l and 2, then with probability one we will
have either
N 12
lim N~2
t-+oo t
= 0,
or
NIl N 22
lim
t-+oo
N~l
t
= 0, lim
t-+oo
Nt12
t
= O.
Consider, to be specific, the first case. In this case, the proportion of source
no.2 generated data that is found in the training data of source no.I goes to
zero. Suppose now that the parallel data allocation scheme is applied once
again, only to these data, using a new combination of predictors/training algo-
rithm. Suppose that there is a further partition of the source subset 6 1 into
sets 6 11 ,6 12 . If conditions AO, AI, A2 hold true for this new combination of
composite sources, predictors and training algorithm, and given that data from
sources belonging to set 6 2 will be contained in a vanishing proportion in the
training data, it follows from Theorem 11.2 that the training data will be fur-
ther separated into two subsets, one corresponding to each source subset 6 11 ,
6 12 , Of course, exactly the same argument applies to source set 6 2 which will
be separated into subsets 6 21> 6 22 , each with a corresponding training data
subset. This procedure continues until the original source set 6 is hierarchi-
cally partitioned into a number of sets for which no further partitions satisfying
conditions AO, AI, A2 are possible. By judicious choice of the predictors and
training algorithm it is possible to reduce the sets of the final partition to sin-
gletons, i.e. break down the original set 6 to K subsets of the form {k 1 }, {k 2 },
... , {kK} where (k 1 , k2 , ... , k K ) is a permutation of (1,2, ... , K). In other
words, exactly one predictor corresponds to each subset / source.
182
It must be pointed out that, for the above procedure to succeed, judicious
selection of the predictors and training algorithm is necessary at every stage of
the data allocation scheme. Some guidelines for fine tuning the data allocation
scheme have been presented in Section 10.4 of Chapter 10. As already remarked
for the two sources case, the value of the above convergence argument is mostly
theoretical, in pointing out conditions sufficient to ensure convergence.
11.3 CONCLUSIONS
In this chapter we have formulated the problem of parallel data allocation

rigorously and presented conditions sufficient to ensure convergence. We have
first treated the case of two active sources and then generalized our results to
the case of K sources. The convergence conditions are expressed in terms of
allocation probabilities. It must be stressed that the validity of these conditions
depends on the combination of sources/ predictor type/ training algorithm/
allocation threshold.
Two convergence theorems have been presented. Theorem 11.1 describes
convergence of the specialization process X t , which takes place with proba-
bility one. Since specialization is expressed in terms of the difference in the
number of samples generated from each active source, the conclusion of The-
orem 11.1 is that at least one predictor will collect "many more" data from a
particular source than from all other sources (actually the difference goes to
infinity). In this sense at least one predictor specializes in exactly one source
and, furthermore, each source is associated with one predictor. Theorem 11.2
indicates that if one predictor specializes in the "difference" sense, then all
predictors will specialize. And in fact they will specialize in a strong sense, i.e.
there is a one-to-one association of predictors and sources, such that the data
allocated to each predictor satisfy the following: the ratio of data generated by
any non-associated source to the data generated by the respective associated
source goes to zero with probability one. The final conclusion results from
combination of the results of Theorems 11.1 and 11.2: since one predictor will
certainly specialize in the "difference" sense, it follows that all predictors will
specialize in the "ratio" sense.
Having proved the convergence of the parallel data allocation scheme, we
have shown that our source identification algorithm works. We stress once
again that the convergence results presented here depend on the validity of as-
sumptions AO, AI, A2, which will only hold true if considerable judiciousness
is exercised in setting up the parallel data allocation scheme. However, this
requirement may be less stringent than it appears. As we have seen in Chapter
10, the actual data allocation schemes used produce very satisfactory results.
At any rate, the main importance of our convergence results, is in using math-
ematical analysis to illuminate the various factors which influence convergence
of the parallel data allocation process, rather than in providing practical design
guidelines. In particular, the important conclusion we have reached is that the
specialization X t will not oscillate. Returning to the random walk interpre-
tation of the specialization process, the particle will not oscillate around the
origin but will wander off either to plus or minus infinity.
Finally, let us note that while our data allocation scheme is based on pre-
dictive modular credit assignment, the convergence results presented here may
be applied in a more general context, encompassing static, as well as dynamic
(time series) classification problems. The generality of our conclusions follows
from the generality of the assumptions on which the data allocation analysis
rests.
Appendix 1l.A: General Plan of the Proof

The proof of Theorems 11.1 and 11.2 is fairly lengthy and is presented in the
following sections of this Appendix. It may be helpful for the reader to first
outline here the general idea of the proof, which is rather simple.
Proof of Theorem 11.1
Our analysis hinges on the Markovian specialization process X t . We have
already pointed out that X t is an inhomogeneous random walk, i.e.
1. the value of X t may be taken to indicate the position at time t of an imagi-

nary particle moving on the line of (positive and negative) integers.
2. At every time t the particle will move only one step, either to the left or to
the right.
3. The probability of moving to the left or to the right changes with the position
of the particle.
Using the source activation probabilities 7rl and 7r2 and the data allocation
probabilities an and bn , we have obtained the transition probabilities of X t .
Our first goal is to establish that X t does not pass from any particular state
infinitely often. Technically, this is expressed by saying that X t is a transient
Markov chain. This result is established using the classical Theorem 11-ll.B.l
and Lemma 11-ll.B.2 which shows that the conclusions of Theorem 11-ll.B.l
can be applied to the process X t .
Using Lemma 11-ll.B.2 we show that X t does not pass from any state
infinitely often; this is the first conclusion of Theorem 11.1. Then it follows
that X t must spend most of its time at either plus or minus infinity; this is
the second conclusion of the theorem. Finally, this is used to prove that X t
cannot oscillate between plus and minus infinity, since then it would have to
pass through the intermediate states infinitely often! Hence it is concluded that
either X t --+ 00 or X t --+ -00; which is the final part of Theorem 11.1.
Theorem 11.2 describes the behavior of N;j, i,j = 1,2, i.e. it tells us how
many samples each predictor collects from each source. At every time t, the
N;j processes increase or remain unchanged (but cannot decrease) according
to certain probabilities which depend on X t , i.e. on the current specialization.
184
Rather than examining the N;j, we work with the process Vi and the associated
process Mt .As already remarked, at every t, Vi is either -lor 1. However,
because the associated probabilities depend on X t - 1 , which in turn depends on
Vi-I, Vi-2' ... the random variables VI, V2, ... , Vi, ... are not independent and
this renders an analysis of their behavior difficult. Rather than working with
the Vi's directly, we relate their behavior to that of an auxiliary process ~,
V~, ... , ~, .... The random variables ~, ~, ... , ~, ... are constructed in
such a manner that they are independent and they take the values -1, 1 with
appropriate time invariant probabilities. For instance, in case that X t -+ 00,
~ is constructed in such a manner that we always have Pr(Vi = -llfor all
T ~ n X T ~ m) $ Pr(~ = -1); here m and n are appropriately selected.
It is easy to prove (this is shown in Lemma 11-11.C.l) that with probability
one E:tJ v;' goes to a quantity ;:Y(m), which tends to zero as t goes to infinity.
Then, in Lemma 11-11.C.2 we show that the probability (conditioned on the
event "for all T ~ n X.,. ~ m") of E:tJ V. exceeding any number is less
than the probability of E:-/ v;' exceeding the same number. Combining the
results of Lemmas 11-11.C.l and 11-11.C.2 we obtain Lemma 11-11.C.3: the
probability (conditioned on the event "for all T ~ n X.,. ~ m") of E:t 1 V.
exceeding 2;:y(m) is zero for any m.
Now, using Lemma 11-11.C.3 we can prove Theorem 11.2. For instance, we
show that the probability (conditional on X t going to plus infinity, i.e. either
predictor no. 1 specializing in source no. 1, or predictor no.2 specializing in source
no.2, or both ) of all the following events is one.
1. !if. goes to zero (i.e. the total number of ''wrong'' allocations is very small);
N l2
2. from (1) follows that ~ goes to zero (i.e. the total number of ''wrong''
allocations of the type source no.l -+ predictor no.2 is very small);
N ll
3. from (1) also follows that ~ goes to 71'1 (i.e. the total number of "correct"
allocations of the type source no.l -+ predictor no.l is very large);
4. and finally it follows that Nh

N l2
goes to zero (i.e. the total number of ''wrong''
t
allocations divided by the total number of correct allocations goes to zero).
We repeat that all of the above events (and similar ones corresponding to
predictor no.2), happen with probability one, conditional on X t tending to +00.
Similar results are obtained for the case of X t tending to -00.
Appendix 11.B: Convergence of Xt

To prove Theorem 11.1, the following theorem is necessary; it is essentially a
paraphrase of a classical theorem which appears in (Feller, 1968), as well as in
Billingsley (Billingsley, 1986).
Theorem ll-ll.B.l Consider any irreducible Markov Chain X t taking val-

ues in Z, with transition probabilities Pm,n, m, n EZ. Suppose that the system
of equations
n E Z- {O} (ll.B.l)
has a solution {un} nEZ-{O}' which satisfies:
1. 'Vn we have 0 ~ Un ~ Ij
2. 3n such that 0 < Un.
Then the state 0 is transient.
Proof. Every sequence {Un}nEZ-{O} which solves eq.(I1.B.l) and satisfies

o~ Un ~ 1 for all n EZ-{O} is called an admissible solution. Note that
eq.(I1.B.l) always has a trivial admissible solution, namely Un = 0 for all n.
If in addition there is some no such that uno> 0 then {Un}nEZ-{O} is called a
nontrivial admissible solution.
Let us now obtain an admissible solution to eq.(ll.B.l) by following a par-
ticular procedure. Define P to be the transition matrix of X t : P == [Pm,n]m,nEZ'
Define Q to be the matrix obtained from P by deleting the O-th row and O-th
column: Q == [Pm,n]m,nEZ-{O}' Note that qm,n = Pm,n for all m, n E Z-{O}.
Also, define Q(t) to be the t-th power of Q. In other words, for m, n EZ-{O}
(1) ...:...
qm,n -Pm,n
(HI) . " (t)" (t)
qm,n = L.Jk#O Pm,kqk,n = L.Jk#O qm,kPk,n t = 0,1,2, ...
Now, define quantities r~, with m EZ-{O}, as follows.
r~) == 1
(t) . " (t) (1l.B.2)
rm = L.Jk#O qm,k t = 1,2, ...
An interpretation of r~ will be given a little later. For the time being, note
that for every m, n EZ-{O} we have qm,n= Pm,n= Pr (Xt = nlXt - l = m)j it
follows that Ek#oPm,k ~ 1. Then, from eq.(1l.B.2) it follows that
ril) = L q~~k = LPm,k ~ 1 = riO),

k#O k#O
For any m and for any t ~ 1, it follows from (I1.B.2) that
r(t+l)
m
= '" m,n = '"
L..J q(t+!) '" q(t) P = '"
~ L..J m,k k,n
" Pk ,n <
m,k 'L..J
L..J q(t) - '
L..J m,k = ret)
" q(t) m .
n#O n#O k#O k#O n#O k#O
So we see that for every m we have
1 = r~) ~ r~) ~ ... ~ r~ ~ r~+l) ~ '" ~ o.
186
When t ~ 00, this bounded and decreasing sequence has a limit. For any m in
Z-{O}, define
r
m
=t-+oo
lim r(t)
m -
> O·,
note that for all m in Z-{O}, and for t= 0,1,00' we have r~;I? ~ rm. Now, note
that
r(t+1)
m
= '" q(t+1) = '"
~ m,n L.., "'P
L.., m, kq(t)
k,n
= "'P
L.t m, k 'L.t
" q(t)
k,n
=
n#O n#Ok#O k#O n#O
'~Pm,k
" '~qk,n
" (t) = '~Pm,krk,'
" (t) (Il.B.3)
k#O n#O k#O
Take the limit as t ~ 00 in the above equation and interchange the order of limit
and summation (using the Bounded Convergence Theorem). Then eq.(1l.B.3)
becomes
rm = LPm,krk' (I1.B.4)
k#O
So {rm}mEZ-{O} is an admissible solution of eq.(I1.B.I). So, we have proved

that every equation of the form (ll.B.I) has an admissible solution which is
defined by the above procedure. However note that this particular admissible
solution may actually be the trivial one.
Now, suppose that there was another admissible solution of eq.(I1.B.I), call
it {8m }mEZ-{O}. Then
8m = LPm,k 8 k' (I1.B.5)

k#O
Since r~) = 1, by definition of an admissible solution, we must have 8m <

r~) = 1 for all m. Then
r~) = LPm,kriO) ~ LPm,k8k = 8m
k#O k#O
Since 8 m ::; r~) for all m, we have
Continuing in this manner, it is seen that 8 m ::; r~ for all m, t, which implies
rm · r m(t) >
11m
= t-+CX) _ 8m . (Il.B.6)
So the special admissible solution {rm}mEZ-{O}, defined by the above proce-

dure, is greater than any other admissible solution. Hence it makes sense to
call {rm}mEZ-{O} the maximal admissible solution of eq.(Il.B.I).
Regarding the interpretation of r m, note that
r~) = Lq~~k = LPm,k = L Pr (Xt +1 = k IXt = m) =?

k#O k#O k#O
r~) = Pr (Xt+1 f:. 0 IXt = m) .
Also,
r~) = LPm,kri1) =
k#O
L Pr (Xt+ 1 = k IXt = m) Pr (Xt+2 f:. 0IXt+1 = k) =?

k#O
Continuing in this manner, for t = 3,4, ... , T, it is seen that for any T
r~) = Pr (Xt+1 f:. 0, X t+2 f:. 0, ... , Xt+T f:. 0 IXt = m) ;

and then, in the limit as T ---+ 00 we have
rm = Tlim
..... OO
r~) = Pr (Xt+1 f:. 0, X t+2 f:. 0, ... IXt = m) .
In other words,
rm = Pr (Xt never going to 0IXt starting at m).
Now, if eq.(l1.B.1) has a nontrivial admissible solution, then it also has a

nontrivial maximal admissible solution {rm}mEZ-{O}. This means that for
some m f:. 0, rm = Pr(Xt never passing from 0 IXt started at m) > O. Also,
since the chain is irreducible, Pr(X going to miX started at 0 ) > O. Hence
Pr(Xt never going to 0 IXt starting at m) . Pr(Xt going to mlXt starting at 0
> 0 which implies that
Pr (Xt never going to 0IXt starting at 0) > 0 =?
Pr (Xt going to 0IXt starting at 0) < 1.
By definition, this implies that the state 0 is transient and the proof of the
Theorem is complete .•
In addition to Theorem 1l-1l.B.1 we will need the following Lemma.
Lemma 1l-1l.B.2 Suppose that conditions AD, Al, A2 hold and that the
specialization process X t has transition probability matrix P = [Pm,n]m,nEZ.
Then the system
n E Z - {O} (l1.B.7)
188
has a nonzero solution Un, n EZ-{O} with 0 ~ Un ~ l.
Proof. The system of eq.(I1.B.7) reduces to two decoupled systems, one

defined by eqs.(I1.B.8) and (I1.B.9), and another defined by eqs. (ll.B.IO)
and (ll.B.ll):
= PI,2U2
UI (I1.B.9)
Un = Pn,n-IUn-1 + Pn,n+IUn+1 n= 2,3, ... ; (I1.B.IO)
U-I = P-I,-2U-2, (I1.B.ll)
Un = Pn,n-IUn-1 + Pn,n+IUn+1 n = -2,-3, ... (ll.B.12)
It will be shown that each of the above systems has an admissible solution.
Start with eqs.(I1.B.8) and (I1.B.9). Eq.(ll.B.9) implies
(Pn,n-l + Pn,n+l) . Un = Pn,n-IUn-1 + Pn,n+IUn+1 ::::}
Pn,n-IUn + Pn,n+IUn Pn,n-l Un-l + Pn,n+l Un+l ::::}

Pn,n-l . (un - Un-I) Pn,n+l . (un +1 - Un) ::::}
( Un+l - Un ) = ---.
Pn,n-l (
Un - Un-l
)
::::}
Pn,n+l
= P2,a
}~
U3 - U2 P2 1
U2 -
'"-"-=.
(
UI )
{ U4 - U3 = '-"=.
Pa 2 ( U3 -
Pa,4
U2 ) = '-"=
Pa 2 . P2 1
Pa,4 P2,a
'"-"-=.
( U2 - UI )
U - U _ = PN-l.N-2·PN-2,N-a'",·Pa.2·P2.1 • (U - U ).
N N I PN-l,N'PN-2,N-l'''''Pa,4'P2,a 2 I
_
UN - U2 + {~
L...J
Pn-l,n-2 . Pn-2,n-3 ..... P3,2 . P2,1}
.
(
U2 - UI
)
(ll.B.12)
n=3 Pn-l,n' Pn-2,n-1 ..... P3,4 • P2,3
Now, from eq.(I1.B.8), it follows that

(PI,O + PI,2) . UI = PI,2 U 2 ::::} PI,OUI = PI,2 . (U2 - UI)
Choose any UI such that 0 < UI < 1. Then, since PI,O,PI,2 > 0, also
U2 - UI > 0 ::::} U2 > UI > O.
Then, from eq.(I1.B.12) for N= 3, 4, ... we also have UN > O. So a solution
to eqs.(I1.B.8), (I1.B.9) has been obtained, which satisfies UN > 0 for N =
1,2, .... Now if
{
~
L...J'
Pn-l n-2 . Pn-2 n-3 ..... P3
,
2. I} (
".
P2
U2 - UI
)< 00. (ll.B.13)
n=3 Pn-l,n' Pn-2,n-1 ..... P3,4 . P2,3
then u'rv can be defined for N = 1,2,3, ... by

UN
(I1.B.14)
UN = U2 + { nI:=3 Pn-l,n-2·Pn-2.n-a·",·Pa,2·P2,1
} • (u - U )
Pn-l,n 'Pn-2,n-l' ''''Pa,4 'P2,a 2 I
It is evident that for N = 1,2, ... the u~'s satisfy eq.(I1.B.8), (I1.B.9) and
o < u~ ~ 1. So, it only needs to be shown that the inequality (I1.B.I3) will
always be true if conditions AO, AI, A2 hold. To show this, note that
Pn-l,n-2 =
7rl . (1 - an-lb) 7r2 . + (1 - bn - 1 ) }.
Pn-l,n =
7rl . an-l 7r2' n-l +
If we define
h(n) ~ Pn-l,n-2 ,
Pn-l,n
then, since lim n ...... oo an = 1 and lim n ...... oo bn = 1, it follows that limn ...... oo h(n) =
O. Hence for any 0 < p < 1 there is some no such that for all n ~ no we have
h( n) < p. Then we can write
00
L Pn-l,n-2 . Pn-2,n-3 ..... P3,2 . P2,1
n=3 Pn-l,n' Pn-2,n-l ..... P3,4 . P2,3
L
00
Pn-l,n-2 . Pn-2,n-3 ..... Pno-l,no-2
G(no) + H(no)' (1l.B.I5)
Pn-l,n . Pn-2,n-l ..... Pno-l,no
n=no
where G(no) and H(no) only depend on no. It follows that the expression
(I1.B.I5) is less than
00
G(no) + H(no)' Lpn-no < 00 ~

n=no
00
' " ' Pn-l,n-2 . Pn-2,n-3 ..... P3,2 . P2,1
~ =------'-----'=-----::.!."'--=----....:.....:=......:.....:= < 00.
n=3 Pn-l,n' Pn-2,n-l ..... P3,4 . P2,3
Hence it has been proved that if AO, AI, A2 hold, eqs.(I1.B.8), (I1.B.9) and
so also eq.(I1.B.7) have a nontrivial admissible solution; consequently X t is
transient. It can also be proved that eqs.(11.B.IO), (I1.B.ll) have a nontrivial
admissible solution. The method of proof is quite similar to the one already
used and will not be presented here. •
We now prove Theorem 11.1 using Theorem ll-ll.B.I and Lemma 1l-1l.B.2.
Proof of Theorem 11.1: In Lemma 1l-1l.B.2 it has been proved that
eq.(I1.B.7) has an admissible solution, so by Theorem ll-ll.B.I, 0 is a transient
state of X t . Then, by Theorem A.lO, for all m, i EZ
Pr (X t =m Lo. IXo = i) = 0
~ Pr (X t =m Lo.IXo = i) . Pr (Xo = i) = 0 ~
Pr (X t =m Lo. and Xo = i) = 0
~ L Pr (Xt =m i.o. and Xo = i) = 0 ~
iEZ
190
Pr(Xt = m Lo.) = 0 for all m E Z.
This completes the proof of eq.{11.3}.

Take any positive integer M and consider
Pr (IXtl :::; M Lo.) = Pr ([Xt = -M or X t = -M + 1 or ... X t = M] i.o.) =
Pr ({Xt = -M Lo.} or {Xt = -M + 1 Lo.} or ... or {Xt = M Lo.}) :::;
Pr(Xt = -M Lo.) +Pr( X t = -M + 1 Lo.) + ... +Pr(Xt = M Lo.) = O.
So, for all MEN,

Pr (IXtl = M Lo.) = O::::}
Pr (IXtl > M a.a.) = 1.
(Recall that "a.a." means "almost always".) Define the event AM = {IXtl > M
a.a. }. Clearly, if N > M then AN C AM. So, Al,A2'." is a decreasing
sequence of sets; defining A =
M=l
n
AM we have (by Lemma A.2): Pr(A) = lim
M-oo
Pr(A M ) = 1. But
00
A=
M=l
n {IXtl > M a.a.} =
{\iM :3 tM such that \it ~ tM IXtl > M} = {lim

t_oo IXtl = oo}.
Hence Pr (lim
t_oo IXtl = 00) = 1 and this completes the proof of eq.{11.4}.
Consider all sample paths such that lim IXtl = 00. These paths have total
t_oo
probability one. Among these paths take anyone such that X t does not go to
+00. Then there is some MEN and some times h < t2 < t3 < ... such that
for n = 1,2, ... we have Xtn < M. (11.B.16)
But, since t_oo
lim IXtl = 00, it follows that there is some to such that
\it ~ to we have IXtl > M. (11.B.17)

It can be assumed without loss of generality that to < tl < t2 < .... Then,
from eqs.{11.B.16) and (11.B.17) it follows that for the previously mentioned
MEN and times to < tl < t2 < ... we have:
for t ~ to
fort=tl,t2,."
Is there some t > to such that X t ~ - M? Suppose that for t = I > to we

have XI = Min ~ -M. It must also be true that IXII > M, so it follows that
M m > M. So, XI = Min > M and X t1 = Ml < -M. Obviously, tl =I- I;
say, without loss of generality, that t < tl' Then, since X t must move through
neighboring states, there must also be a time t with t < t < tl and Xt = OJ
but this contradicts eq.(I1.B.17). So it has been proved that for all paths such
that lim IXtl = 00, either lim X t = +00 or lim X t = -00. From (ii) it
t..... oo t..... oo t--+oo
follows that the set of all paths for which it is not true that lim IXtl = 00 has
t--+oo
probability zero. This completes the proof of eq.(11.5), hence the proof of the
theorem is complete .•
Appendix 11.C: Convergence of Nt

In this section we prove Theorem 11.2, regarding the limiting behavior of N;j
for i,j = 1,2. This proof is based on Lemma 11-11.C.3, which bounds the
probability that Nt exceeds a certain number infinitely often (conditioned on
X t tending to +00 or to -00). We first need to define some quantities which will
be used in the remained of this section. Recall that the transition probability
Pr (Xt = nlXt _ 1 = m) is denoted by Pm,n' However, it is more convenient to
use the following alternative notation: define (for m E Z)
a(m) == Pm,m+b . -y(m) == Pm,m-l.
Now define the following quantities for x, y EZ .
. {a{x) if x < y
p(ylx) = -y(x) if x > y.
( IX) == {P(Ylx) if x ~ m
qY p(ylm) if x < m.
In other words, q(Ylx) is identical to p{Ylx), except when x is less than m.
Both p(.I.) and q(.I.) are probability functions for any x, provided we limit Y
to values such that Ix - yl = 1.
Two Auxiliary Stochastic Processes
Before examining the properties of Nt we need to define two auxiliary sto-
chastic processes. These depend on certain probabilities which will now be
defined. Define (for all m E Z ) the following.
;y(m) ==sup -y(n), a(m) == 1- ;y(m).
n;:::m
Obviously a(m) +;y(m) = 1. Note that 0 ~ ;y(m) ~ 1 and hence 0 ~ a(m) ~ 1.
In addition, ;y(m) is monotonically decreasing, with limm..... oo;y{ m) = OJ it
follows that a( m) is monotonically increasing, with limn --+ oo a( m) = 1. Note
that for all m E Z we have
;y(m) ~ -y(m) and a(m) ~ a(m).
Since a( m) and ;y(m) are nonnegative and add to one, they can be considered
to be probabilities. Now, consider m fixed and for z E {-I, I}, define
_( ) == { a( m) if z = 1,
PZ ;y{m) if z = -1.
192
(We have suppressed, for brevity of notation, the dependence of p(.) on m.)
Consider a sequence of independent random variables V~, V;, ... which (for
t = 1,2, ... and Z E {-1,1} satisfy:
Pr(V;' = z) = p(z).
In other words, the V;"s are a sequence of Bernoulli trials; in particular they
are independent and identically distributed.
Finally define the stochastic process M;' by the following relationships
Mm
t
_
-
{O iff V;' = 1;
.-m (l1.C.1)
1 1ff V t =-1.
Clearly, for z E {O, I} we have
Pr(~ = z) = Pr(V;' = 1- 2z)

Hence V;' and ~ have essentially the same probability function p( z).
Comparing eq.(l1.C.:!l with eq.(l1.2) the simil~es as well as the differ-
ences between lit and V t (and between M td and M t ) become obvious. In
particular:
1. lit defines an inhomogeneous random walk and the random variables Vi, \t2,
... are dependent;
2. !:J: defines .an homogeneous random walk and the random variables v;n,
V2 , ... are mdependent;
3. E!=l Mf counts the number of lit moves to the left; E!=l M:n counts the
number of =
V t moves to the left;
4. if for all t greater than some to we have X t ~ m, then the probability of lit
taking a move to the left is no greater than :-y( m); that of V;' taking a move
to the left is always :-y(m).
It is the last observation that is really important. What we ultimately want

to prove is that when X t - 00, the number of times when lit equals -1 will
be small. More specifically, we want to show that '"' Md
~ - O. Because Vi,\t2,
... are dependent, it is difficult to analyze the behavior of EtMtd
. Because v;n,
V;" ... are independent, it is easier to analyze the behavior of E ~;:' . This is
the whole point of introducing the processes V;', ~ .
In particular, it is easy to obtain the following useful lemma, which describes
the behavior of the stochastic process M;'.
Lemma 11-11.C.1 For any mE Z, 3tm such that Vt ~ tm we have
p, e:::~; M7 ~ 2'y( m)) < t~

Proof. This is essentially one half of the Central Limit Theorem. For any m,
take a positive 8 and define the following sets.
C
t
= {I Jt.
~!=1M"; -t·;y(m) I> J2810g(t)}
')'(m) . (1 - ')'(m)) - ,
C ={ ~!=1M"; -t·;y(m) > J2810g (t)}

t Jt.,),(m).(1-,),(m)) - ,
Ct = {~M"; ~ 2t· ;y(m)} .

Clearly, C t C Ct. Also note that
Ct = {~M'; > t· ;Y(m) + J28· ')'(m) . (1 - ')'(m)) .Jt log(t) } .
Since ;y(m) -+ 0, it follows that for t large enough (say for all t greater than
some appropriate t;" ) we have
t· ;Y(m) > J28. ')'(m) . (1- ')'(m)) .Jtlog(t).
Hence for all t ~ t;" we have Ct C Ct.
Now, it is clear that AI';', M~n, ... is a sequence of Bernoulli trials, with
expectation E(M~n,) = ;Y(m). Then, using Theorem A.9 (see Mathematical
Appendix) it follows that there is some t: such that for all t > t';,. we have
Pr(Ct ) < 10.
Then it follows that Pr(Ct ) S Pr(C t ) S Pr(Ct} < In short 10.
for every m there is some tm = max (t;", t';,.) such that for all t ~ tm we have
t18 >Pr(C t) = Pr ( ~M';

t >2t;Y(m) ) = Pr (~ts=~ M s
m
>2;Y(m) )
and, taking 8 = 2, we have proved the Lemma .•

Relating xr;
and Mt
Recall that, by the construction of v7'
the probability that = -1 is v;n
;Y(m) and this is larger than ')'(m) , which is the probability that Vi = -1.
Since ~!=l M'; counts the times that v;n
= -1 and ~!=l M: counts the
times that X t = X t - 1 - 1, it may be reasonably expected that, for any given
~t M'" ~t Md
number c > 0, '-1 t ~ c is more likely than r/ t ~ c. The next
Lemma formalizes this observation.
Lemma 1l-1l.C.2 If conditions AO, Al, A2 hold, then, for any m E Z,
and for the associated process xr;
we have (Jor any n ~ 0, c > 0, t ~ 0)
Pr s=nt Md
( ~n+t s ~ c I\lT ~ n X'T ~ m
)
S Pr (~ts-~ 'JVl
s ~ c
) (11.C.2)
194
Proof. Choose some m E Z and some n ~ 0, E: > 0, t ~ 0; consider these fixed

for the rest of the proof. Recall that the choice of m determines V~ through
the probability p(.) and that V~ determines M~. Now define L ~ E: • t. We
have
Pr (Xs < X s - I at least L times, for s = n + 1, .... , n + t\'v'T ~ n, X.,. ~ m).
(If L is not an integer, then "L times" should be taken to mean "fLl times",
where fLl means the integer part of L plus one).
Now, choose any Xo E Z and define the following conditions on sequences
(Xl, X2, ... , xt) E zt.
CI For s = 1,2, .... , t we have Xs < Xs-I at least L times.
C2 For s = 1,2, .... , t we have \x s - Xs-I\ = 1.
C3 For s = 1,2, .... , t we have Xs ~ m.
Taking into account the dependence on Xo, let us define
1. Am(xo): the set of sequences that satisfy CI, C2, C3; and
2. A(xo): the set of sequences that satisfy CI, C2.
Obviously, Am(xo) C A(xo) for all m. Now define
Q ~ Pr (Xs < X s - I at least L times,s = n + 1, .... , n + t\'v'T ~ n, X.,. ~ m)

and
R(xo) ~ Pr(Xn = Xo \'v'T ~ n,X.,. ~ m).

It follows that
R(xo)' P (XI\XO) ..... P (Xt-I\Xt) .

xo=m,m+l, ...
Recall that
1. p(Xs\Xs-I) = q(xs\xs-d when XIX2 ... Xt belongs to Am(xo),
2. Am(xo) C A(xo), and
3. q(Xs\Xs-I) ~ 0;
then, defining
(1l.C.3)
it follows that
:l:o=m,m+l, ...
Let us now bound Qt (xo); then it will be easy to bound Q as well. As can
be seen from eq.(I1.C.3), Qt(xo) consists of a sum of products of q(.I.) terms.
We will proceed in t steps, at every step producing a greater expression by
replacing one of the q(.I.) terms by a 15(.1.) term. Namely, at step 1 we will
replace q(XtIXt-l) by 15 (XtIXt-l); at step 2 we will replace q(xt-llxt-2) by
15 (xt-ll xt-2) and so on, until at the t-th step we obtain an expression which is
greater than Q t (xo) and comprises entirely of 15( .1.) terms.
To bound Qt (xo) , it will be useful to define some additional sets of sequences
(Yl, Y2, ... , Yt) E {-I, IP. We define
B ~ {(Yl, .. , Yt): for 8 = 1, .. , t we have Ys E {-I, I}, no. of -l's ~ L}.
Note that the sets A(xo) and B are in a one-to-one correspondence: for Xo fixed
and any (Xl, X2,", .Xt) E A(xo), a unique (Yl, Y2,", .Yt) E B is defined by taking
Ys = Xs-Xs-l (8 = 1,2, ... ); conversely, for Xo fixed and any (Yl, Y2, .. , .Yt) E B,
XS = Xs-l + Ys defines a unique (Xl, X2, .. , .Xt) E A(xo). Hence there are one-
to-one functions Y : A(xo) -+ B and X : B --+ A{xo),where X = y-l. Note
also that B is independent of Xo, i.e. for any xo,x~ E Z, we have Y (A(xo)) =
Y (A(x~)). Now, define three sets as follows.
{(Yl, Y2, ... , Yt) E Band Yt = -1 and (Yl, Y2, ... , Yt-l, 1)
- -1 -2
- B-B t -B t ·
The sets B!, i = 1,2,3 partition B, i.e.

1. For i,j = 1,2,3 and i #j we have B! nB~ = 0 and
3 =i -
2. U B t = B.
i=l
To see (1), first note that all sequences in B!

end in 1 and all sequences in
B~ end in -1, so that these two sets have empty intersection; as for B~, it does
not intersect any of the other sets by definition. To see (2), note that: is B!
the set of all sequences in B ending in 1, and B~ U B~ is the set of all sequences
in B ending in -1.
196
-1 -2
It is also clear that the elements of B t and B t are in a one-to-one corre-
spondence.
Finally, it is worth noting that B: is the set of sequences ending in -1 for
which the total no. of -l's is exactly equal to L. To see this, consider a sequence
-3 . -3 -1 . . -3 -.
(Y!o ... , Yt) E B t · Smce B t n B t = 0, It follows that Yt = -1. Smce B t C B, It
follows that the no. of -l's is greater than or equal to L. Suppose it is equal
to L' > L. Now consider the sequence (Y!o ... , Yt-1, 1). This ends in 1 and the
no. of -l's it contains is L" = L' - 1 ;::: L. Hence (Y1, ... , Yt-1, 1) belongs to
~ 2
B t · But then (Y!o ... , Yt-1, -1) = (Y!o ... , Yt-!o Yt) must belong to B t . However,
since B: n B~ = 0, we have reached a contradiction. Hence L' = L and our
claim has been justified; it also follows that B~ is the set of sequences ending
in -1, for which the no. of -l's is greater than L.
Let us now proceed to implement the replacement procedure. We have
(11.C.4)
(I1.C.5)
(11.C.6)
(I1.C.7)
Now, each q(x1Ixo)· q(x2Ixt}· .... q(xt-1Ixt) term in the above expressions
corresponds to a sequence (Xl. X2, ... , Xt). Recall the following facts.
1. Sequences (Xl. X2, ... , Xt) E A(xo) are in a one-to-one correspondence with
sequences (Y1, Y2, ... , Yt) E B; this correspondence is expressed by Xs = Ys +
X s -1 (s = 1,2, ... , T).
-1 -2
2. Sequences from B t and B t are also in a one-to-one correspondence: for
-1 . -2
every (Y!o Y2, ... , Yt-!o 1) E B t there IS a (Y!o Y2, ... , Yt-!o -1) E B t where
Y1 , Y2, ... , Yt-1 are the same in both sequences.
It follows that the terms in expressions (11.C.5), (11.C.6) are also in a one-to-
one correspondence, which can be expressed by the following rule: exactly one
q (x1Ixo) . q (x2Ix1) ..... q (Xt-1 - llxt-1) in expression (I1.C.6) corresponds to
every q (x1Ixo)·q (x2Ixt)· .... q (Xt-1 + llxt-1) in expression (I1.C.5). Using the
above facts, we can rewrite expression (I1.CA), which is the sum of expressions
(I1.C.5), (I1.C.6), (1l.C.7), as
L q(xd xo)·.·.·q(Xt-ll xt-2)·

(X 1 ,X 2 ,. .. ,x t )Ex(n! )
(I1.C.8)
In expression (1l.C.8), the terms in square brackets add to one, so they can be
replaced by [P(I) + p( -I)J (which also equals one) without altering the value
of the expression. Suppose now that in the expression (I1.C.9), each term in
the sum is replaced by q(xllxo)' ... q(xt-llxt-2)· p(Xt - Xt-l), i.e. q(XtIXt-l)
is replaced by p(Xt - Xt-l). Recall that for sequences in X we have (En
Xt = Xt-l - 1; hence
p(Xt - Xt-l) = P(Xt-l - 1 - Xt-l), q(Xt - Xt-l) = q(Xt-l - 1 - Xt-l).
On the other hand
and
( 1 ) = { P(Xt-l -Ilxt-l) = ,(Xt-t) if Xt-l 2: m;

q Xt-l - 1Xt-l (
pm- 11)
m =,m () if Xt-l < m.
If Xt-l 2: m, q(Xt-l - Ilxt-l) = ,(Xt-l) ~ ;Y(m); if Xt-l < m, q(Xt-l -
Ilxt-l) = ,(m) ~ ;Y(m). In either case q(Xt-l - Ilxt-d ~ ;y(m) = p(Xt-l -
1- Xt-l)' Hence, replacing all the q(XtIXt-l) terms with p(Xt - Xt-l) = p( -1)
terms, the expression is not decreased and it follows that
L q(xllxo)· ... ·q(Xt-llxt-2)·

(Xl,X2, ..• ,x.)Ex(n! )
L q (xllxo) ..... q (Xt-l - Il xt-2) . [P( +1) + p( -I)J

(X 1 ,X2 , ... ,x.)Ex(n! )
+
198
L q (X1Ixo) . q (x2I x1) ... , . q (Xt-1I Xt-2) . p(Xt - Xt-l)

(X 1 ,X 2 , ... ,X t )EX(:B; )
L q (x1Ixo) ..... q (Xt-1I Xt-2) . p(Xt - Xt-1)'

X1X2 ... X tEA(xo)
Recall that
and now define
Then it is clear that
So Qt(xo) has q(.I.) factors at every position up to position t, and is smaller

than Qt-1(xO) which has q(.I.) factors at every position up to position t - 1
and P factors in position t. The idea is to continue replacing q(.I.) terms with
p(·I·) terms, producing an increasing sequence Qt(xo) :::; Qt-1(xO) :::; Qt-2(xO)
:::; ... :::; Qo(xo) , where Qo(xo) has only p(.I.) terms.
Since the remaining steps of the replacement procedure are very similar to
the first one, we will only present briefly the second step. Define three new sets
as follows.
B!_l . .:. . {(Yb ... , Yt) E Band Yt-1 = I}

B~_l . .:. . { (Y1, ... , Yt) E Band Yt-1 = -1 and (Yb Y2, ... , 1, Yt) E B!_l}
B 3t-1 . .:. . B - B1t-1 - B2t-1'
These three sets partition B, i.e.
1. For i,j = 1,2,3 and i =1= j we have B!_l n B{_l = 0 and

3 -=i -
2. . U B t -
• =1
1 = B.
-1 -2
It is also true that the elements of B t - 1 and B t - 1 are in a one-to-one cor-
respondence, that B;_l is the set of sequences with -1 in the t - 1 position for
which the no. of -1's is exactly equal to L , and that B;_l is the set of sequences
for which the no. of -1's is greater than L. The arguments to prove these claims
are much the same as the ones regarding B!
and will not be repeated.
Now let us continue the replacement procedure in the same way as in the
previous step.
L q (X1\XO) ..... q (Xt-1\Xt-2) . p(Xt - Xt-1)

(Xl,X2, ... ,X.)EX(B)
L q (X1\XO) ..... q (Xt-1\Xt-2) . p(Xt - Xt-1)+ (I1.C.lO)

(Xl ,X2, ... ,xt}EX (B!_l)
L q (X1\XO) ..... q (Xt-1\Xt-2) . p(Xt - Xt-1)+ (ll.C.ll)

(Xl ,X2, ... ,x.)EX (B~_l)
(I1.C.12)
The argument previously used to group together expressions (1l.C.5), (I1.C.6)

-1 -2
can now be used (with respect to the sets B t - 1 , B t - 1 ) to group expressions
(I1.C.I0), (I1.C.ll). Hence Qt-l(xO) can be rewritten as
L q(Xl\XO)· ... ·q(Xt-2\Xt-3)·

(Xl,X2, ... ,x.)EX (B!_l)
+
(Xl ,X2, ... ,xt}EX (B~_l)
(I1.C.14)
In eq.(I1.C.13), the terms in square brackets add to one, so they can be replaced
by
[P(Xt-2 +1 - Xt-2) + P(Xt-2 - 1 - Xt-2)]
= [P(Xt-l - Xt-2) + P(Xt-l - Xt-2)]

200
which also equals one. Also, in the sum in eq.(11.C.14), each term can be
replaced by q (xllxo)· q (x2Ixt)· .... q (xt-2Ixt-3)· p(Xt-2 -1- Xt-2)· p(XtIXt-l),
for the same reasons as previously. Hence, by replacing all the q(xt-lIXt-2)
terms with P(Xt-l - Xt-2) we find that Qt-l (xo) is no greater than
L q (xllxo) . q (x2I xl) ..... q (xt-2I xt-3) .

(Xl ,x2, ... ,xt)EX ("B!)
L q (xllxo) . q (x2I x l) ..... q (Xt-2I Xt-3) . P(Xt-l - Xt-2)·

(Xl ,X2 , ... ,xt)EX ("B!)
p(Xt - Xt-l) =
L q (Xllxo) . q (x2Ixt) ..... q (Xt-2I Xt-3) . P(Xt-l - Xt-2)·
(Xl ,X2 , ... ,Xt)EA(xo)
If we define
then we have just proved that

This completes the second step of the replacement procedure. Continuing in

this manner for t - 2 more steps, we obtain
where
and
Qo(Xo) =
Then it follows that
R(xo)'
xo=m,m+l, ...
L
xo=m,m+l, ...
R(xo)' L
(Xl,X2, ... ,xt}EA(xo)
q (xllxo) ..... q (xt-IIXt) ~
L
xo=m,m+l, ...
R(xo)' L P (Xl - xo) ..... p(Xt - Xt-l) ~
(Xl ,X2," .,xt}EA(xo)
L R(xo)'
L R(xo)'
(1l.C.16)
since
L R(xo) =L Pr(Xn = Xo IVT ~ n,X,. ~ m) = 1.

xoEZ xoEZ
Expression (1l.C.15) is exactly Pr ( L.J.-nr

""n+t Mol
t ~ 10 IVT ~ n XT ~ m
)
and
expression (1l.C.16) is the probability that, for s = 1,2, ... , t , M'; = 1 at least
L = 10 • t times . In short, what has been proved is that
E:,!!+1
t Mt -> 10 IVT >
- n >
X T_ m) < Pr
-
(E!=lt
M;-> e)
and the lemma has been proved .•
202
Long Run Behavior of M td

The previous lemma compared the behavior of M td to that of lVf'; over finite
times. The next lemma tells us something about the behavior of M'j in the
long run (and without connection to M';).
Lemma 1l-1l.C.3 If conditions AD, At, A2 hold, for all m E Z, and for
all n E N, we have
Pr (2:;~~+1
t M'j > 'Y m ).z.o. I'v'T >_ n Xr >_
_ 2-( m) = O.
Proof. The idea is to show that
8 (2:
00
Pr
n
+t Md
s=nt s 22::y(m) I'v'T 2 n Xr 2 m
)
< 00.
Then, the conclusion of the Lemma will follow immediately from the Borel-
Cantelli Lemma (se Mathematical Appendix). Now, from Lemma 11-11.C.2,
setting c = 2::y(m), for any m E Z, for any n, tEN, we have that
Pr (2:;~~t M'j 2 2::y( m) I'v'T 2 n Xr 2 m ) ::;
Pr (2:~=~ lVf'; 2 2::y( m)) .

Also, from Lemma 11-11.B.2, we have that for all mE Z there is some tm such
that for all t ~ tm
Pr (2:~=~ :M'; 2 2::y(m)) < t;'

So for any mE Z,for any n E N, and for any t 2 t m , we have
(11.C.17)
Clearly
00
t~ Pr
("n+t
ws=nt
Md
s 2 2::y(m) I'v'T 2 n X(T) ~ m
)
< t~
00 1
t2 < 00.
From this it follows that
8 ("n+t
00
Pr ws=nt
Md
s 2 2::y(m) I'v'T 2 n X(T) 2 m
)
< 00
as well, which completes the proof.•

Proof of the Convergence Theorem

Now Theorem 11.2 can be proved using Lemma 1l-1l.C.3.
Proof of Theorem 11.2: Only the case lim X t = will be considered in +00
detail (the case lim X t =
t ...... oo
-00
is proved in exactly the same manner). The
t-HXl
proof will proceed in three steps. First we will show that
Pr (lim Nl
t ...... oo t
=0 I.,.lim
......
00
X,. = +00) = 1. (1l.C.I8)
Second, we will show that
Pr (lim
t ...... oo
Nlt 2 = I
0 .,.lim
...... 00
X.,. = +00) = 1. (I1.C.I9)
and that
Pr (lim N;l = 0
t ...... oo t
I.,.lim
......
00
X.,. = +00) = 1 (I1.C.20)
Third, we will show that
Pr (lim
t ...... oo
Nlt l = 7rl I..,.lim
...... 00
X.,. = +00) = 1. (I1.C.2I)
and that
Pr (lim N;2 =
t ...... oo t
7r2 I.,.lim
...... 00
X.,. = +00) = 1. (I1.C.22)
Finally , we will show that
Pr (lim
t ...... oo Nt
N~~ = I
0 .,.lim
...... 00
X.,. = +00) = 1. (I1.C.23)
and that
Pr (lim
t ...... oo
NN~~
t
= 0 Ilim X.,. +00)
.,. ...... 00
= = 1. (I1.C.24)
For m, n = 0, 1,2, ... define the events

Amn={\ir;::::n X.,.;::::m}.
Then, conditional probabilities Pr (... 1\17 ;:::: n X.,.;:::: m) can be written as
Pr ( ... IAmn) . From Lemma ll-I1.C.3 follows that for all m, n ;:::: 0 we have
",n+t 1 Mds )
Pr ( L..s-n
- : < 2-y(m) a.a. IAmn = 1~
204
Pr (
I
3tnm : "It > tnm

I
L
n+t
M: < 27{m)·t IAmn
)
= 1 =>
s=n+l
Pr (
I
3tnm : \It > tnm

I
~M: < 27{m) ·t+n IAmn) = 1 =>

"n+t Md t n )
L.Js=l s < 2;:Y{m) . - - + - - IAmn = 1.
I I
Pr ( 3tnm : "It > tnm

n+t n+t n+t
From the above equation follows that, for all m, n ~ 0, conditional on the event
A mn , we have (with probability 1)
"n+tMd
L.Js-l s <1={) t n
n + t < .&."( m . n + t + n + t'
In addition, the following inequalities are true (obviously with probability 1
and for all m, n ~ 0 )
t
\It -t- . 27{m) < 27{m), (1l.C.25)
+n
n
- - < ;:Y{m). (1l.C.26)
n+t
Taking tnm = max{t~m, t~m)' it follows that, for all m, n ~ 0 ,
Pr (3tnm : "It > tnm E!-: M: ~ 3;:Y{m) IAmn) = 1 => (1l.C.27)
Pr (~t ~ 3;:Y{m) a.a.IAmn ) = 1.
Define
00 00
Am ~ U
n=l
Amn =n=l
U {"IT ~ n XT ~ m} =
{3 n: V T ~ n X T ~ m} = {XT ~ m a.a.} ;
also, define
00 00
A ~ n Am = n {XT ~ m a.a.} = {V m ~ 1 X T ~ m a.a.}.
m=l m=l
It follows immediately that A = { lim

T-+oo
XT = co.}. Next, define for m= 1,2, ...
the events
Em ~ { ~t ~ 3;:Y{m) a.a} ,
Note that, since lim ;:Y(m) = 0, and since N td 2: 0, we have

m--+oo
B= {v m 2: 1 Nt::; 3;:Y(m) a.a} = {lim Nt =

t t--+oo t
o} .
Now, from eq.(11.C.27) it follows that
Pr (BmIAmn) = 1 ~ Pr (BmIAmn) Pr (Amn) = Pr (Amn)
and so
(11.C.28)
Note that for a fixed m and for n < n' we have Amn C A mn ,. Then, from
Lemma A.2 it follows that
lim Pr (Amn)
n-+oo
= Pr (u
n=l
Amn) = Pr (Am) . (11.C.29)
On the other hand, for a fixed m and for n < n' we have BmnAmn C BmnAmn' .
Hence
nl~~ Pr (Bm n Amn) = Pr CQl (Bm n Amn)) =
(l1.C.30)
Since
Pr (Bm n Amn) = Pr (Amn) for all m, n 2: 0,
it follows from (l1.C.28), (l1.C.29) and (l1.C.30) that for all m > ml
Pr (Am) = Pr (Am n Bm). (l1.C.31)
For m < m' we have: (a) {Xt 2: m' a.a.} ~ {Xt 2: m a.a.} ,hence Am' CAm
and, since ;:Y(m) decreases monotonically to 0, Bm' C Bm. It follows that
J~oo Pr (Am) = Pr (m~l Am) = Pr (A) , (l1.C.32)
J~oo Pr (Am n Bm) = Pr (m~l (Am n Bm)) =
(l1.C.33)
Then, from eqs.(l1.C.31), (11.C.32), (l1.C.33) and the assumption that Pr(A) >
0, it follows that
Pr (A n B)
0< Pr(A) = Pr(AnB) ~ Pr(A) = 1 ~ Pr(BIA) = 1.
206
In other words
Pr (V m 2:: 1 ~t < 3;:Y(m) a.a. IVm 2:: 1 X,. 2:: m a.a.) = 1 =>
Pr (lim
t-+oo
Nt
t
=0 I lim
7"-+00
X(T) = +00) = 1. (l1.C.34)
This completes the proof of eq.(l1.C.18).

Since we have
Pr(O ::; Nl 2 ::; Nl 2 + Nfl = N td I7"-+00
lim X(T) = +00) = 1
it follows that
N12 Nd
Pr ( 0::; lim sup _t_::; lim _ t = 0 I lim X(T) = +00 ) = 1 =>
t-+oo t t-+oo t 7"-+00
Pr ( lim
t-+oo
N12
_t_
t
I
= 0 7"lim
..... 00
)
X(T) = +00 = 1. (l1.C.35)
This completes the proof of eq.(l1.C.19); eq.(l1.C.20) is proved similarly.

From the Law of Large Numbers we have
Pr ( lim NIl
t
+t N t12 = 7f1 ) => Pr (lim -t-
N 1 = 7f1 ) = 1. (ll.C.36)
Note that eq.(ll.C.36) refers to an unconditional probability. However, by as-

sumption
Pr (lim X(T) =
7"-+00
+00) > O. (ll.C.37)
Using eqs.(l1.C.36), (l1.C.37) and invoking Lemma A.6 (see Mathematical

Appendix) we conclude that
Pr (lim Nl
t-+oo t
= 71"11 lim X(T)
7"-+00
= +00) = 1. (ll.C.38)
Since
NIl
_t_ Nl 1 +NF _ Nl2 Nl Nl 2
-t- - -t-' (ll.C.39)
t t t
by taking limits in eq.(l1.C.39) and using eqs.(l1.C.35), (l1.C.38) it follows
that
Pr (lim Nl
t-+oo t
1
= 71"11 lim X(T) = +00) = 1.
7"-+00
Similarly we can prove

N22
Pr ( lim
t-+oo
_t_
t
= 7f2.1 lim X(T) = +00 ) = 1.
7"-+00
This completes the proof of eq.(11.C.21); eq. (l1.C.22) is proved similarly.

Assuming that the following limits exist, we have
Since the limits exist with conditional probability one, we conclude that
Pr (lim
t-+oo
NN~~ = 01
t
lim X(T)
7-+00
= +00) = 1.
By an exactly analogous procedure it can be shown that
Pr (lim
t-+oo
NNi~ = 0.1
t
lim X(T)
'T--+OO
= +00) = 1.
This completes the proof of eq.(11.C.23) and hence the part of the theorem
which refers to the case lim Xr = +00
is complete.
r--+co
The part of the proof concerning the case lim Xr =
r--+co
-00 follows exactly
the same pattern as the previously presented results, requiring the proof of ad-
ditionallemmas, corresponding to Lemmas 11-l1.C.l, 11-11.C.2 and 11-11.C.3.
This is omitted for the sake of brevity. •
12 CONVERGENCE OF SERIAL DATA
ALLOCATION
In this chapter we examine the convergence of the serial data allocation scheme
presented in Chapter 10. This chapter evolves in parallel lines to Chapter
11, where the parallel data allocation scheme was treated. We will provide
conditions which are sufficient to ensure the convergence of the serial data
allocation scheme.
12.1 THE CASE OF TWO SOURCES
The study of convergence of the serial data allocation scheme starts by consid-
ering the case of two sources and two predictors. In other words, we consider,
for the time being, a slightly modified form of the serial data allocation scheme,
where all incoming data are either accepted by the first predictor, or passed
to the second predictor, which must necessarily accept them. The general ver-
sion of the scheme, with K sources and a variable number of predictors will be
discussed in the next section.
It must be emphasized that in the version of serial data allocation scheme

which is considered in this scetion, all data allocation decisions depend on the
performance of the first predictor.

210
12.1.1 Some Important Processes

In the case of two active sources the source process Zt takes values in {I,2}.
Recall that time series generation takes place according to the equation
Yt = FZ,(Yt-l, Yt-2, .... ).

The source process Zt is exactly analogous to the one used in Chapter 11;
namely it takes values in {I, 2} and at time t we have Pr (Zt = i) = 7ri, i = 1,2,
where it is assumed that for i = 1,2 we have 0 < 7ri < 1. The data allocation
processes Mij, Ni j are also defined in exactly the same manner as in Chapter
11.
However, the specialization process X t is now defined slightly different from
the parallel data allocation case. As has already been remarked, in the case of
serial data allocation with two predictors, allocation of Yt depends entirely on
the behavior of the first predictor. Hence the variable X t , denotes the difference
between the number of source 1 data and source 2 data assigned to predictor
1, i.e.
We also define the process vt as in the previous chapter:
vt ~ Xt - X t- l . (12.1)
vt and X t satisfy, as previously,
We will not ~e the process Mt in this chapter; instead we will work with the
processes M;J.
12.1.2 Some Important Assumptions

Let us now consider the significance of the variable X t . Consider the following
possibilities.
1. If X t is positive and large, then predictor no. 1 specializes in source no. 1

(Nl l :» N?l).
2. If X t is negative and large, then predictor no. 1 specializes in source no.2
(N?l :» NF).
It follows that: if the absolute value of X t is large, then predictor no.I
specializes in one of the two sources and hence it will tend to accept data
from this source and reject all other data. For instance, if (at time t) X t is
large and positive, predictor no.I has specialized in source no.I; then it will
be likely to accept further samples from source no.I and reject samples from
source no.2; this means that the term Nl l - Nfl will be likely to increase.
CONVERGENCE OF SERIAL DATA ALLOCATION 211
At the same time, source no.2 data (rejected by predictor no.I), will certainly
be accepted by predictor no.2, which will result in specialization of predictor
no.2 to source no.2. It may be expected that, under certain conditions, this
process will reinforce itself, resulting, in the limit of infinitely many samples,
to "absolute" specialization of both predictors.
To test this conjecture mathematically, in complete analogy to the parallel
data allocation case, we introduce three assumptions.
BO. For i = 1,2 the following is true:
Pr(Pred. nr.i accepts YtIZt,Zt-l, ... ,Xt- 1 ,Xt- 2, ... ,Yt-l,Yt-2, ... ) =
Pr(Pred. nr.i accepts YtIZt, X t - 1 ).
In other words, it is assumed that assignment of Yt only depends on the cur-

rently active source and the current level of specialization. This is, of course,
exactly analogous to assumption AO.
Second, we define data allocation probabilities, which are however somewhat
different than in the parallel data allocation case.
an ~ Pr (Pred. nr.I accepts Yt IZt = 1, X t - 1 = n),
bn ~ Pr (Pred. nr.I accepts Yt IZt = 2, X t - 1 = n).
Notice that, while an is exactly the same as in Chapter 11, now bn is the
probability that predictor no.I, (rather than predictor no.2) accepts a sample
from source 2, given that so far it has accepted n more samples from source
no.I than from source no.2.
From assumption BO it follows that X t is a Markovian process on Z. The
transition probabilities of X t are defined by
Pm,n = Pr(Xt = n + IIXt - 1 = m)

and can be computed explicitly as follows (for n = ... , -1,0,1, ... )
Pn,n+l = Pr(Xt = n + IIXt - 1 = n) =
Pr(Zt = 1) . Pr(Pred. nr.I accepts YtIZt = 1, X t - 1 = n) =>
also
Pn,n-l = Pr(Xt = n - IIXt - 1 = n) =

Pr(Zt = 2) Pr(Pred. nr.I accepts YtIZt = 2, X t - 1 = n) =>
and
Pn,n = Pr(Xt = n -IIXt - 1 = n) =

212
Pr(Zt = 1) . Pr(Pred. m.I does not accept YiIZt = 1, X t - 1 = n)+
Pr(Zt = 2) Pr(Pred. m.I does not accept YiIZt = 2, X t - 1 = n) =}
Transitions for all other m, such that In - ml > 1 must be equal to zero. In
short we have
Pn,n+1 = 7r1an,
Pn,n-1 7r2bn,
°
Pn,n 7r1 • (1 - an) + 7r2 • (1 - bn )
Pn,m ifIn - ml #1.
Now, regarding the probabilities an, bn , the following assumptions are made.
BI For all n, an lim an = 1,

> 0, and n--++oo lim an = 0.
n--+-oo
B2 For all n, bn lim bn = 0,

> 0, and n--++oo lim bn = 1.
n--+-oo
In other words we have made the following assumptions.
1. As the specialization level increases to plus infinity (which means that pre-
dictor no.I has received a lot more data from source no.I than from source
no.2) predictor no.I is very likely to accept an additional sample from source
no.I, while it is very unlikely to accept a sample from source no.2,
2. Similarly, as the specialization level decreases to minus infinity (which means

that predictor no.I has received a lot more data from source no.2 than from
source no. 1 ) predictor no. 1 it is very likely to accept an additional sample
from source no.2, while it is very unlikely to accept a sample from source
no. 1.
These are reasonable assumptions; their justification is similar to that of

assumptions AI, A2 and they may fail to materialize for reasons similar to
the ones discussed in Section 11.1. Note that, once again, the validity of these
conditions will depend on the combination of sources/ predictors/ training al-
gorithm/ data allocation threshold. However there is one important difference
between AI,A2 and BI, B2. Namely, BI, B2 depend on both sources, but
only on the first predictor, since serial data allocation depends only on the
performance of the first predictor. Finally, we can once again invoke the inho-
mogeneous random walk paradigm, to illustrate the behavior of specialization
(see Figure 12.1).
Consider Figure 12.1. An imaginary particle moves along the line ofintegers;
at every time step it may move one step to the left or to the right, or it may
stay in place. The probability of each such event taking place depends on the
position of the particle. lit again describes the inhomogeneous random walk;
Figure 12.1. The specialization process is an inhomogeneous random walk on the integers .
.. ,~..~ .. ,~ .. ,
now there are three possibilities for the imaginary particle: it may move one
step to the right or left, or stay in place:
Pr(vt = llXt - 1 = n) = Pn.n+l

Pr(vt = 0IXt - 1 = n) = Pn,n
Pr(vt = -IIXt - 1 = n) Pn,n-l
Generally, when the particle is "far to the right", it is more likely to keep
moving to the right, than to stay in place or move to the left. Conversely,
when the particle is "far to the left", moves to the left are preferred. While it
seems reasonable that the particle will wander off either to the far right or to
the far left, the possibility of oscillation cannot be excluded and a more precise
analysis is required. The results of the analysis are Theorems 12.1 and 12.2
presented in the next section; the proof of these theorems is presented at the
end of the chapter.
12.1.3 Convergence Results

Two theorems will be proved. The first theorem regards the behavior of X t .
Theorem 12.1 If conditions BO, Bl, B2 hold, then

"1m E Z Pr(Xt = m i.o.) = 0, (12.2)
Pr (lim
t--+oo
IXtl = +00) = 1, (12.3)
Pr (lim X t
t--+ CXJ
= +00) + Pr (lim
t--+oo
X t = -00) = 1. (12.4)
In this theorem, the most important conclusion is (12.4): if conditions BO,

Bl and B2 hold, then there are two possibilities in the long run.
Xt -+ +00: Predictor no.l will accumulate a lot more source no.l samples than
source no.2 samples.
Xt -+ -00: Predictor no. 1 will accumulate a lot more source no.2 samples than
source no.l samples.
The total probability that one of these two events will take place is one, i.e.
predictor no. 1 will certainly specialize in one of the two sources.
214
Notice that Theorem 12.1 does not quite say that both predictors will spe-
cialize. But in fact, Theorem 12.2 implies that both predictors will specialize,
each in a different source and the specialization is stronger than that implied
by Theorem 12.1.
Theorem 12.2 If conditions BO, Bl, B2 hold, then
1. If Pr{ lim X t
t-+oo
= +00) > 0 then
N21 )
Pr ( lim Ntll
t-+oo t
= 0 It-+oo
lim X t = +00 = 1, (12.5)
Pr (lim
t-+oo
NN~: = 0 I lim
t t-+oo
X t = +00) = 1. (12.6)
2. If Pr{ lim X t
t-+oo
= -00) > 0 then
Nll )
Pr ( lim N~l
t-+oo t
= 0 It-+oo
lim X t = -00 = 1, (12.7)
N22 )
Pr ( lim Nt12
t-+oo t
= 0 It-+oo
lim X t = -00 = 1. (12.8)
Theorem 12.2 states that with probability one both predictors will specialize,
one in each source and in the "strong" ratio sense, as already discussed in
Chapter 11. Since, by Theorem 12.1, X t goes either to +00 or to -00, it follows
that specialization of both predictors (one in each source) is guaranteed.
12.2 THE CASE OF MANY SOURCES

Let us now consider the convergence of the serial data allocation scheme when
more than two sources are active. This case is handled by recursive application
of the data allocation scheme, as explained in Chapter 10; it follows that the
number of predictors is variable.
To be specific, suppose that there are K active sources, i.e. the source set is
6 = {I, 2, ... , K}. The serial data allocation scheme starts with two predictors.
Suppose that there is a partition of 6, say 6 1 = {kIl and 8 2 = 8 - 8 1 . For
simplicity of notation, we can assume that k1 is equal to 1. Since the k-th
source has an input/output relationship which can be described by
Yt = Fk{Yt-b Yt-2' ... , Yt-M),

we can consider two sources; the first one has the form
(12.9)
and the second is a composite source of the form

K
Yi = L l(Zt = k) . Fk(Yi-1, Yi-2, ... , Yi-M). (12.10)
k=2
Hence eqs.(12.9), (12.10) can be written in the source-dependent form
= ~t (Yi-1, Yi-2, ... , Yi-M)'
Yt (12.11)
In other words, we can consider Yi to be produced by a new ensemble of two suc-
cessively activated sources, where source activation is denoted by the variable
Zt, taking values in {1,2} and the new source no.2 is actually a composite of
simpler sources. In this case, we can apply the results presented in the previous
section, regarding the two sources case. Hence, if predictor type and training
algorithm are selected so that the partition/ predictors/ training algorithm/
threshold combination satisfies assumptions BO, Bl, B2, then data allocation
will be convergent in the sense of Theorem 12.2. Hence the incoming data will
be separated into two sets; one set will contain predominantly data generated
by the simple source no. 1 and the other set will contain predominantly data
generated by the composite source no.2. To be more precise, if the variables
N? (i,j = 1,2) have the meaning explained in the previous section with re-
spect to source no.l and composite source no.2, then with probability one we
will have either
N 21 N 12
lim NtU = 0, lim N;2 = 0,
t--+oo t t--+oo t
or
· Nl1
11m 0 l' N't2 = O.
t--+oo
N21
t
=, 1m N12
t--+oo t
Consider, to be specific, the first case. In this case, the proportion of data
generated by composite source no.2 and collected by predictor no.l goes to zero.
Suppose now that after sufficient time has elapsed, a third predictor is added
to the serial data allocation scheme. The data reaching the second and third
predictor will contain a vanishingly small proportion of source no.l samples.
Hence Theorems 12.1 and 12.2 can now be applied to the pair of predictors
no.2 and no.3 and we can conclude that, if conditions BO, Bl, B2 hold true
for this new combination of source no.2, predictors and training algorithm,
predictor no.2 will specialize either in a simple source no.k2 or in the composite
source {2,3,4, ... ,K} - {k2}' In fact, without loss of generality, we can assume
that k2 = 2. Then it follows that either predictor no.2 or predictor no.3 will
specialize in source no.2. We can continue adding predictors after sufficient
time has elapsed and, by the previous argument, we can expect that as long as
conditions BO, Bl, B2 are satisfied for each active source (and the predictors/
training algorithm/ threshold combination), and given sufficient time the serial
data allocation algorithm will identify the K sources.
Since assumptions BO, Bl, B2 have to be satisfied by each active source
separately, it may be considered that serial data allocation has a higher chance
216
of success than parallel data allocation. On the other hand, the enhanced
competition of parallel data allocation (recall that all predictors compete for
the same data) may result in improved performance.
12.3 CONCLUSIONS
In this chapter we have presented conditions sufficient to ensure convergence of

the serial data allocation scheme. We have first treated the case of two active
sources and then generalized our results to the case of K sources. The con-
vergence conditions are expressed in terms of allocation probabilities. It must
be stressed that the validity of these conditions depends on the combination
of sources/ predictor type/ training algorithm/ threshold. However, these con-
ditions may be somewhat easier to satisfy than the corresponding ones in the
parallel data allocation case, since they depend on each active source separately.
For the case of two sources, we have presented two convergence theorems.
Theorem 12.1 describes convergence of the specialization variable X t , which
takes place with probability one. Since specialization is expressed in terms of
the difference in the number of samples generated from each active source, the
conclusion of Theorem 12.1 is that the first predictor will collect "many more"
data from one particular source than from the other; actually the difference
goes to infinity. Theorem 12.2 indicates that if one predictor specializes in the
"difference" sense, then both predictors will specialize in the stronger "ratio"
sense. In other words, a one-to-one association of predictors and sources will
emerge so that, for a particular predictor, the data allocated to each predictor
will satisfy the following: the ratio of the number of data generated by any
source (except the one associated with this predictor) over the number of data
generated by the source associated with this predictor will go to zero with
probability one. In this sense each predictor will specialize in exactly one
source. The final conclusion results from combination of the results of Theorems
12.1 and 12.2: since one predictor will specialize in the difference sense, both
predictors will specialize in the ratio sense.
Regarding the case of many sources we can conclude that convergence will
take place (as long as conditions BO, BI, B2 are satsfied) by repeatedly ap-
plying the above argument to successively refined partitions of the source set.
We stress once again that the convergence results presented here depend on
the validity of assumptions BO, BI, B2, which will only hold true if consider-
able judiciousness is exercised in setting up the serial data allocation scheme.
However, this requirement may be less stringent than it appears: as we have
seen in Chapter 10 very satisfactory results have been obtained by serial data
allocation.
At any rate, the main importance of our convergence results lies in using
mathematical analysis to illuminate the various factors which influence conver-
gence of the parallel data allocation process, rather than in offering practical
design guidelines. In particular, an important conclusion can be phrased in
terms of the random walk paradigm presented earlier. The particle that rep-
resents the evolution of the specialization process will not oscillate around the
origin but will wander off either to plus or minus infinity.
Appendix 12.A: General Plan of the Proof

Since the proofs of Theorems 12.1 and 12.2 is quite similar to that of Theorems
11.1 and 11.2, some arguments in the following sections will be presented rather
briefly. The reader is advised to read carefully the corresponding sections of
Chapter 11. On the other hand, there are some rather subtle differences be-
tween the cases of parallel and serial data allocation, and some additional care
is required in following the proofs in the serial case. Since the proofs are quite
lengthy, it is useful to first outline the general idea of the proof and point out
differences between the serial and parallel cases.
As already discussed, our analysis hinges on the specialization process X t .
This is a Markovian process; in fact, it is an inhomogeneous random walk, i.e.
1. the value of X t may be taken to indicate the position at time t of an imagi-
nary particle moving on the line of (positive and negative) integers.
2. At every time t the particle will move only one step (either to the left or to
the right) or stay where it is.
3. The probabilities of moving (left or right) or staying in place change with
the position of the particle.
Using the source activation probabilities 7r1 and 7r2 and the acceptance prob-
abilities an and bn the transition probabilities of X t can be computed. Our first
goal is to establish that X t does not pass from any particular state infinitely
often. Technically, this is expressed by saying that X t is a transient Markov
process. This results is established Theorem 11-11.B.1 (which was presented
in Chapter 11) and Lemma 12-12.B.1, which is exactly analogous to Lemma
11-11.B.2.
Having established that X t does not pass infinitely often from any particular
state, it follows that X t must spend most of its time at either plus or minus
infinity. This is shown in Theorem 12.1, along with the fact, that X t cannot
oscillate between plus and minus infinity, since in that case it must also pass
through the intermediate states infinitely often. In conclusion, either X t -+ 00
or X t -+ -00.
The proof of this theorem follows very closely the proof of Theorem 11.2 of
Chapter 11. However, in this case each argument presented in Chapter 11 has to
be repeated three times, since the specialization process may not only increase
or decrease, but also remain unchanged. At any rate, we take several shortcuts
in the presentation of the proof of Theorem 12.2, since some arguments are
repeated several times; for this reason the reader is advised to review the proof
of Theorem 11.2.
218
Theorem 12.2 describes the behavior of the processes Nl1 and Nl1. At every
time t, these processes increase or remain unchanged (but cannot decrease)
with certain probabilities which depend on X t , i.e. on the current specializa-
tion. Rather than examining the Ni
j processes directly, we introduce auxiliary
processes which are independent and hence fairly easy to analyze. Specifically,
we introduce three random walks: V t , ~, ~.
1. V t is used to bound the probability of ,,~ being greater than a certain

number". This is done using lemmas 12-12.D.l, 12-12.D.2 and 12-12.D.3.
2. ~ is used to bound the probability of ,,~ being smaller than a certain

number". This is done using lemmas 12-12.E.l, 12-12.E.2 and 12-12.E.3.
3. ~ is used to bound the probability of ,,~ being greater than a certain

number". This is done using lemmas 12-12.F.l, 12-12.F.2 and 12-12.F.3.
Lemmas 12-12.D.3, 12-12.E.3 and 12-12.F.3 are then used to prove Theorem
12.2 by showing that the following events, conditional on X t tending to +00,
happen with probability one.
N 21
1. ~ go to zero;
N 22
2. hence ~ goes to 71"2;
N ll
3. also ~ goes to 71"1;
N 12
4. hence ~ goes to zero;
N 21 N 12
5. and finally that both ~ and ~ goes to zero.
t t
By exactly analogous arguments, similar results can be obtained in case X t

goes to -00.
Appendix 12.B: Convergence of Xt

The proof of Theorem 12.1 is practically identical to that of Theorem 11.1. One
small difference exists in establishing that eq.(12.B.l) has nontrivial admissible
solutions; this difference results from the different form of the data allocation
probabilities.
Lemma 12-12.B.l Suppose that conditions BO, Bl, B2 hold and the special-
ization process X t has transition probability matrix P = [Pm,n]m,nEZ. Then
the system
n E Z - {O} (12.B.l)
has a nonzero solution Un, n EZ-{O} with 0 :::; Un :::; 1.

Proof. The system of eq.{12.B.1) reduces to two decoupled systems, one

defined by eqs.{12.B.2) and (12.B.3), and another defined byeqs. (12.BA) and
(12.B.5):
UI = PI,IUI+ PI,2 U 2, (12.B.3)
Un = Pn,n-IUn-1 + Pn,nUn + Pn,n+IUn+1 n = 2,3, ... ; (12.BA)
U-I = P-I,-IU-I + P-I,-2 U -2, (12.B.5)
Un = Pn,n-IUn-1 + Pn,nUn + Pn,n+IUn+1 n = -2, -3, ... {12.B.6)
It will be shown that each of the above systems has an admissible solution.
Start with eqs.{12.B.2) and (12.B.3). Eq.(12.B.3) implies
(Pn,n-l + Pn,n + Pn,n+d . Un = Pn,n-IUn-1 + Pn,nun + Pn,n+IUn+1 =?
Pn,n-IUn + Pn,n+IUn = Pn,n-IUn-1 + Pn,n+IUn+1 =?
Pn,n-l . (Un - Un-I) Pn,n+l . (Un+l - Un) =?
(Un+l - Un) = Pn,n-l . (Un - Un-I) =?

Pn,n+l
}~
U3 - U2 = Ehl . (U2 - UI)
P2.3
{ U4 - U3 = P3 2
~. U3
P3.4
( - U2 ) = P3 2
~
P3.4
. --==.
P2 1
P2.3
( U2 - UI )
U - U _ = PN-l.N-2·PN-2.N-3·· .. ·P3.2·P2.1 • (U - U ).
N N I PN-l.N.PN-2.N-l ..... P3.4.P2.3 2 I
Pn-l n-2 . Pn-2 n-3 ..... P3 2' P2 I} ( )

UN = U2 + {~
~' , " . U2 - UI (12.B.6)
n=3 Pn-l,n' Pn-2,n-1 ..... P3,4 . P2,3
Now, from eq.(12.B.2), it follows that

(PI,O + PI,1 + PI,2) . UI = PI,IUI + Pll,2 U 2 =?
PI,OUI = PI,2 . (U2 - ud .
Choose any UI such that 0 < UI < 1. Then, since PI,O,PI,2 > 0, also
U2 - UI > 0 =? U2 > UI > O.
Then, from eq.(12.B.6) for N= 3, 4, ... we also have UN > O. So a solution
to eqs.(12.B.2), (12.B.3) has been obtained, which satisfies UN > 0 for N =
1,2, .... Now if
~ Pn-l,n-2 . Pn-2,n-3 ..... P3,2 . P2,I} . (U2 -UI )< 00.

{ n=3 ~ (12.B.7)
Pn-l,n' Pn-2,n-1 ..... P3,4 . P2,3
then u'rv can be defined for N = 2,3, ... by

I UN
(12.B.8)
UN = U2 + { n =3 f Pn-l.n-2·Pn-2.n-3· .. ··P3.2·P2.1
Pn-l.n ·Pn-2,n-l·····PS.4·P2,3
} • (U2 - UI)
220
and, evidently, for N = 1,2, ... the u~s satisfy both eqs.(12.B.2), (12.B.3) and
o< u~ ::; 1. So, it only needs to be shown that relationship (12.B.7) will
always be true if conditions BO, B1, B2 hold. To show this, note that
Pn-1,n-2 = 71'2 bn-1 } =} Pn-1,n-2 = 71'2 • bn - 1.

Pn-1,n = 71'l a n-1 Pn-1,n 71'1 a n -1
If we define
h(n) ~ 71'2 • bn - 1 ,
71'1 a n -1
it is easy to see that limn--+oo h( n) = 0, so for any 0 < p < 1 there is some no
such that for all n ~ no we have h( n) < p. Consider
=
00
L Pn-1,n-2 . Pn-2,n-3 ..... P3,2 . P2,1
n=3 Pn-1,n' Pn-2,n-1 ..... P3,4 . P2,3
00
G( no+ ) ""' Pn-1 n-2
) H( nO·L....,,' . Pn-2 n-3 ..... P n o-1 no-2
, " (12.B.9)
n=no Pn-1,n . Pn-2,n-1 ..... P n o-1,no
where G(no) and H(no) depend only on no. It follows that expression (12.B.9)
is less than
00
G(no) + H(no)' Lpn-no < 00 =}

n=no
00
""' Pn-1,n-2 . Pn-2,n-3 ..... P3,2 . P2,1
L...." ~~~--~--~~--~~~~ < 00.
n=3 Pn-1,n' Pn-2,n-1 ..... P3,4 . P2,3
Hence it has been proved that if BO, B1, B2 hold, eqs.(12.B.2), (12.B.3) and
so also eq.(12.B.1) have a nontrivial admissible solution; consequently X t is
transient. It can also be proved that eqs.(12.BA), (12.B.5) have a nontrivial
admissible solution. The method of proof is quite similar to the one already
used and will not be presented here. •
Now Theorem 12.1 can be proved using Theorem 11-11.B.1 and Lemma
12-12.D.1.
Proof of Theorem 12.1: In fact the proof is now exactly the same as that
of Theorem 11.1, so it is omitted. •
Appendix 12.C: Convergence of N;j: Preliminaries

Theorem 12.2, regarding the limiting behavior of N;i (i, j = 1,2) is proved in
this section. The largest part of the proof is taken up by preliminary lemmas
which bound the probabilities that the N;i exceed certain numbers infinitely
often, conditioned on X t tending to +00 or to -00.
Let us first define some useful quantities. Recall that the transition probabil-
ity Pr (X t = nlXt - 1 = m) is denoted by Pm,n' Define, for nEZ, the following
quantities:
a(m) ~ Pm,m+1, ')'(m) ~ Pm,m-1, j3(m) ~ Pm,m'
These are just more convenient symbols for the transition probabilities of X t .
Now define the following for x, y EZ.
a(x) if x<y
p(ylx) ~ { j3(x) if x = y
')'(x) if x > y.
( IX) ~ {P(YIX) if x ~ m
qY p(ylm) if x < m.
In other words, q(ylx) is identical to p(Ylx), except when x is less than m.
When we restrict y to values such that Ix - yl ::; 1, both p(.I.) and q(.I.) are
probability functions for any x.
Here is an outline of the argument which we will follow in the next sections.
This is essentially the same argument presented in Section lI.C of Chapter II.
In sections sections 12.D, 12.E, 12.F we do the following.
1. We want to examine the behavior of the process N;j = 2: ~;j .

2. We introduce two auxiliary stochastic processes: one is a homogeneous ran-
dom walk and the other indicates whether the respective particle took a
move in a particular direction. The first type of processes is constructed as
follows
Sec.12.D: ~is more likely to go left than Vi;
Sec.12.E: v.m
_t
is less likely to go right than Vi;
Sec.12.F: "tm is more likely to go right than Vi.
The second type of processes is constructed as follows
t ~
Sec.12.D: 2:8=1 M 8 counts left moves of V t ;
Sec.12.E: t
2:8=1 -
M:' counts right moves of Vr;
t -
Sec.12.F: 2:8=1 M:' counts right moves of ~m.
3. Then, for any 10 > O,we show that
Sec.12.D: Prob. of ~
t· ~c is less than Prob. of E ~'; ~c
Sec.12.E: Prob. of ~
t· ::;10 is less than Prob. of EM'"
t· ::;10
Sec.12.F: Prob. of ~
t· ~c is less than Prob. of EM'"
t· ~c
4. Then, in Section 12.D (respectively: Section 12.E, SectionI2.F) we choose

an appropriate 10 and compute the probability of E ~'; being greater (re-
spectively: E f::' being smaller, E 7::' being greater) than exceeding c.
222
5. Then, in Section 12.D (respectively: Section 12.E, SectionI2.F) we use the

above facts to show that the probability of L 7~1 being greater (respectively
L7;1 being smaller, L7;1 being greater) than c infinitely often is zero.
Appendix 12.0: Bounding Nl 1 from Above

Two A uxiZiary Stochastic Processes
Before examining the properties of Nil, we need to define two auxiliary
stochastic processes. These depend on the following probabilities.
a(m) = inf a(n), ;y(m) = sup 'Y(n), {j(m) = 1- a(m) -;y(m).

n~m n~m
Obviously a(m) + {j(m) + ;y(m) = 1. It is also true that 0 :5 a(m) :5 1, 0 :5

;y(m) :5 1 and that for all m we have a(m) :5 a(m) and ;y(m) ~ 'Y(m). There
may be values of m for which {j( m) is negative. However, a( m) is monotonically
increasing with m, with limn ...... co a( m) = 11"1 ; ;y( m) is monotonically decreasing
with n, with lim n ...... co ;y(m) = 0; and limn ......co {j(m) = 11"2.
In view of the limiting behavior of {j(m), we can choose an m such that
for all m ~ m we have {j(m) > O. Then, for all m > m the quantities a(m),
{j(m), ;y(m) are nonnegative and add up to one, so they can be considered to
be probabilities. Now, choose any m > m and for Z E {-I, 0, I} define
a(m) if z = 1,
p(z) = { {j(m) if z = 0,
;y(m) if z = -1.
(We have suppressed, for brevity of notation, the dependence of p(.) on m.)
Consider a sequence of independent random variables li7,
v;', ... which, for
t = 1,2, ... and z E {-1,0, I} satisfy:
Pr(V~ = z) = p(z).
In other words, the v;' s are a sequence of independent, identically distributed
random variables. Finally define the stochastic process M~ by the following
relationships
M tm = {01 iff V~ = 0 or v;' = 1; (12.D.l)

iff V~ =-1.
In other words, M;l

indicates when v;' is equal to -1.
Comparing eq.(12.D~ with eq.(12.1) the similarities as well as the differ-
ences between Vi and V t (and between Mr and M~ ) become obvious. In
particular:
1. Vi defines an inhomogeneous random walk and the random variables Vb V2,

... are dependent;
2. !:t: defines an homogeneous random walk and the random variables V7,
V2 , ... are independent;
3. L:!=l M;l counts the number of Vi moves to the left; L:!=11VF; counts the
number of V~ moves to the left;
4. If X t ~ m for all t greater than some to, then the probability that Vi takes
a move to the left is no greater than ;Y(m); the probability that V~ takes a
move to the left is always ;Y(m).
The last observation is very useful. Our ultimate goal is to prove that when
Xt -+ 00, the number of times that Vi equals -1 will be small. More specifi-
'\"' M21
cally, we want to show that ~ -+ 0. Because V1 ,V2 , .•• are dependent, it
is difficult to analyze the behavior ofL: 7;1 . Because V;.", V;" ... are inde-
pendent, it is easier to analyze the behavior of L: 7::'. This is the reason for
introducing the processes V~, M'; .
the behavior of the stochastic process M~.
Lemma 12-12.D.l For any m > m, :ltm such that Vt ~ tm we have
Pr (L:!-~ XI;' ~ 2;y(m)) < t~.

Proof. This is proved in exactly the same manner as Lemma 1l-1l.C.1 and
hence the proof is omitted .•
Relating M;l and M';
Lemma 12-12.D.2 If conditions BO, Bl, B2 hold, then, for any m > m,
and for the associated process M'; we have (Jor any n ~ 0, € > 0, t ~ 0)
Proof. Choose some m > m and some n ~ 0, € > 0, t ~ 0; consider these fixed
for the rest of the proof. Recall that the choice of m determines V~ , through
the probability p(.), and that V~ determines M';. Now define L == €. t. We
have
224
Pr (Xs < X s- 1 at least L times, for s = n + 1, .... , n + tl'lfT ~ n, Xr ~ m).
(If L is not an integer, then "L times" should be taken to mean L1 times", "r
r
where L 1 means the integer part of L plus one).
Now, choose any Xo E Z and define the following conditions on sequences
(Xl,X2, ... ,Xt) E zt.
Cl For s = 1,2, .... , t we have Xs < Xs-l at least L times.
C2 For s = 1,2, .... , t we have Ixs - xs-Il ::; 1.
C3 For s = 1,2, .... ,t we have Xs ~ m.
1. Am(xo): the set of sequences that satisfy Cl, C2, C3j and
2. A(xo): the set of sequences that satisfy Cl, C2.

Q ~ Pr (Xs < X s- 1 at least L times, s = n + 1, .... , n + tl'lfT ~ n, Xr ~ m)
and
R(xo) ~ Pr(Xn = Xo I'lfT ~ n,Xr ~ m).
It follows that
L R(xo) = 1.
xoEZ
It is easy to see that
R(xo)'
xo=m,m+l, ...
Recall that
1. p(xslxs-d = q(XsIXs-l) when XIX2 ... Xt belongs to Am(xo),

2. Am(xo) C A(xo), and
3. q(XsIXs-l) ~ O.
Next define
(12.D.3)
Then, in view of the above remarks, we have
xo=m,m+l, ...
Let us now bound Qt (xo); then it will be easy to bound Q as well. As can be
seen from eq.(I2.D.3), Qt(xo) consists of a sum of products of q(.I.) terms. We
will follow the same replacement procedure as in Chapter 11, taking t steps,
at every step producing an expression which is greater than the previous one,
by replacing one of the q(.I.) terms by a 15(.1.) term. Namely, at step 1 we
will replace q (XtIXt-1) by 15 (Xtlxt-d; at step 2 we will replace q (xt-1Ixt-2) by
15 (xt-1I xt-2) and so on, until at the t-th step we obtain an expression which is
greater than Qt(xo) and comprises entirely of15(.I.) terms.
Once again we need auxiliary sets. The set B is definecl in a manner similar
to that of Chapter 11. Define
B ~ {(Yl, ... ,Yt): Ys E {-I,O,I} and for 8 = I,2, ... ,t no. of -1's ;::: L}.
Note that the sets A(xo) and B are in a one-to-one relationship: for Xo fixed and
any (Xl, X2, .. , .Xt) E A(xo), a unique (Y1, Y2, .. , .yt) E B is defined by taking
Ys = Xs-Xs-1 (8 = 1,2, ... ); conversely, for Xo fixed and any (Y1, Y2, .. , .Yt) E B,
XS = Xs-1 + Ys defines a unique (Xl, X2, .. , .xt) E A(xo). Hence there are one-
to-one functions Y : A(xo) -+ B and X : B -+ A(xo),where X = y-1. Notice
also that B is independent of Xo, i.e. for any xo,x~ E Z, we have Y (A(xo)) =
Y (A(x~)). Now, define four sets as follows .
..:...
..:...
{ (Y1 , Y2, ... , Yt) E Band Yt = °}
..:...
{(Y1,Y2, ... ,Yt) E Band Yt = -1 and (Y1,Y2, .. ·,Yt-1,I)
..:... B -
B1t - B2t - B3t·
The sets B~, i = 1,2,3,4 partition the set B for the same reasons as in Chapter
-1 -2 -3
11. Also, the elements of B t , B t and B t are in a one-to-one correspondence.
-1 -3 -1 -2
This is clear for the sets B t and B t . Regarding the sets B t and B t , note that
if some (Y1, Y2, ... , 1) E B! , and the no. of -1's is L' , then L' ;::: L; clearly
none of the -1's can be in the t-th position. But the same remarks hold for the
. -2
sequence (Y1,Y2, ... ,0). Th1s shows that (Y1,Y2, ... ,0) E B t ; so we have shown
-1 -2
that for every (Y1, Y2, ... , 1) E B t , there is exactly one (Yl, Y2, ... , 0) E B t . The
-2
argument can be reversed to show that for every (Y1, Y2, ... , 0) E B t , there is
-1 -1 -2
exactly one (Yl, Y2, ... , 1) E B t . So B t , B t are in a one-to-one correspondence;
-1 -3 -2
since B t , B t are also in a one-to-one correspondence, the same holds for B t ,
-3
Bt ·
Finally, by the same arguments as in Chapter 11, ~ is the set of sequences
ending in -1 for which the total no. of -1's is exactly equal to Land B~ is
the set of sequences ending in -1, for which the no. of -1's is greater than L.
226
Let us now proceed to implement the first step of the replacement procedure.
Since the argument is the same as that used in Chapter 11, it is presented briefly.
We have
(12.D.4)
(12.D.5)
(12.D.6)
(12.D.7)
(12.D.8)
Now, each q (xllxo)· q (x2Ixl)· .... q (xt-llxt) term in the above expressions cor-
-1 -2 -3
responds to a sequence (Xl, X2, ... , Xt). Since sequences from B t , B t and B t are
in a one-to-one correspondence, to every q (xllxo)·q (x2Ixl)· ... ·q (Xt-l + llxt-l)
in expression (12.D.5) we can correspond exactly one q (xllxo) . q (x2Ixt} .... .
q (xt-llxt-l) in expression (12.D.6) and exactly one q (xllxo) . q (x2Ixl) .... .
q (Xt-l -llxt-l) in expression (12.D.7). Using the above facts, we can rewrite
expression (12.D.4) as
L q(xllxo)· N.· q(Xt-lI Xt-2)·

("'1''''2 ,,,.,"'.)EX ('B!)
+ (12.D.I0)
In expression (12.D.9), the terms in square brackets add to one, so they can
be replaced by [P(+I) + p(O) + p(-I)] (which also equals one) without altering
the value of the expression. Suppose now that in the expression (12.D.1O),
each term in the sum is replaced by q (xllxo) . ... q (xt-llxt-2) . p(Xt - Xt-l), i.e.
q (XtIXt-l) is replaced by p(Xt - Xt-l). Recall that for sequences in X (~) we
have Xt = Xt-l - 1; hence
p(Xt - Xt-t} = P(Xt-l - 1 - Xt-l), q(XtIXt-l) = q(Xt-l - ll xt-l).
On the other hand

P(Xt~1 - 1 - Xt-I) = P( -1) = "t(m)
and
( p(Xt-1 - ll xt-l) = ,eXt-I) if Xt-I 2:: mj
q Xt-I - 11)
Xt-I = { pm-
( 11)
m =,m () if Xt-I < m.
If Xt - 1 2:: m, "t(m) 2:: q(Xt-1 - lIXt-l) = ,(Xt-I)j if Xt-I < m, "t(m) 2::
q(Xt-1 -llxt-l) = ,em). In either case, replacing all the q(XtIXt-l) terms with
p( Xt - Xt-l) = p( Xt-l - 1 - Xt-l) terms, the expression is not decreased and it
follows that
L q (xllxo) ..... q (Xt-Il xt-2) .
(Xl ,x2, ... ,Xt)EX(B"!)
+ (12.D.12)
L q (xllxo) ..... q (Xt-l - ll xt-2) . fpc +1) + P(O) + p( -1)J

(X1,X2, ... ,x.)EX(B! )
L q (xllxo) . q (x2I xl) ..... q (xt-ll xt-2) . p(Xt - Xt-l)

(X1,X2, ... ,x.)EX(B! )
+
(Xl ,X2 , ... ,Xt)EX (B:)
L q (xllxo) . '" . q (Xt-lI Xt-2) . p(Xt - Xt-l).

X1 X2... XtEA(xo)
Recall that
Qt(xo) =
228
and now define
(Xl ,X2 , ... ,X')EA(xo)

Qt(xo) ::; Qt-l (xo)
Continuing the replacement process in this manner we obtain
Qt(xo) ::; Qt-l (xo) ::; Qt-2(XO) ::; ... ::; Qo(xo)
where
and
Qo(xo) =
Then, using exactly the same argument as in Appendix l1.C, it follows that
R(xo)'
xo=m,m+l, ...
(12.D.14)
Expression (12.D.13) is exactly Pr ( z=n+t

.-nf
M2l
t 2: € I'lfT 2: n Xr 2: m ,and
)
expression (12.D.14) is the probability that, for s = 1,2, ... , t , M'; = 1 at least
L = € . t times. In short, what has been proved is that
z=ns=n;l
H M21 t 2: € I'lfT 2: n Xr 2: m ) ::; Pr ("t
ws=~ M"'
s 2: €
)
and the proof of the lemma is complete .•

Long Term Behavior of Mll
The previous lemma compared the behavior of Mll with that of M'; over
finite times. The next lemma tells us something about the behavior of M;l in
the long run (and without connection to M;').
Lemma I2-I2.D.3 If conditions AO, AI, A2 hold, for all m 2: m, and for
all n E N, we have
Pr (Z=;~~;l M;l 2: 2;y(m) i.o. I'lfT 2: n Xr 2: m ) = O.
Proof. This is proved in exactly the same way as Lemma 11-11.C.2 in Chapter
11 and so the proof is omitted .•
Appendix 12.E: Bounding NP from Above

Two A uxiZiary Stochastic Processes
In this section we will relate the behavior of
En+t Mil
.-nr .
to that of an asso-
ciated stochastic process. We need to define the following quantities
9(m) = sup 'Y(n) , (3(m) = sup (3(n), a(m) = 1- (3(m) - 9(m).

n~m n~m
°
Note that a(m) + (3(m) + 9(m)=1, $; 9(m) $; 1, $; (3(m) $; 1. Also note °
that (3(m) is monotonically decreasing with m, with limm--+ oo (3(m) = 11"2 and
9(m) is monotonically decreasing with m, with lim m......oo 9(m) = 0, and that
for all m we have
(3(m) ~ (3(m), 9(m) ~ 'Y(m).

Because of the limiting behavior of (3(m) and 9(m), we can choose an such m
that for all m ~ m
we have a( m) > 0. Then, for all m > the quantities m
a( m), (3( m), 9( m) can be considered to be probabilities. Now, choose any m
and for Z E {-1,0, I} define
a(m) if z = 1,
p(z) = { (3(m) if z = 0,
9(m) if z = -1.
(We have suppressed, for brevity of notation, the dependence of pO on m.)
Consider a sequence of independent random variables Vr, V2m, ... which, for
t = 1,2, ... and z E {-I, 0,1} satisfy:
Pr(~m = z) = p(z).
In other words, the ~m's are a sequence of independent, identically distributed
random variables. Finally define the stochastic process ~m by the following
relationships
- m
M t
_
-
{o iff
1 iff
~m =
l'tm
~
°
= 1.
or ~m = -1; (12.E.1)
In other words, ~m indicates when ~m is equal to 1.

The rationale for introducing ~m and ~m is similar to that of previous
sections. We want to analyze the behavior of L: 7/ 1
• Because Vi is more likely
to move right than ~m, the probability of E 7/ 1
being smaller than some
number is smaller than that of Ef:n. Because V ,V 1 2 , ... are dependent, while
Vr, V2m, ... are independent, it is easier to analyze the behavior of L: f:n
" M21
than that of ~.
230

the behavior of the stochastic process lViF.
Lemma l2-l2.E.l For any m > m, 3tm such that for all t 2:: tm we have
Pr ( 2:!=1 lVi;r' <a~()

m - -1 ) 1
<-.
t - m t2
Proof. This is proved in exactly the same manner as Lemma 11-11.C.l in

Chapter 11 and hence the proof is omitted.e
Relating Mll and NLm
m,
°
Lemma l2-l2.E.2 If conditions BO, Bl, B2 hold, then, for any m >
and for any n 2:: 0, c > 0, t 2:: we have
Proof. We follow the usual method of proof. Choose some m > m and some
n 2:: 0, c > 0, t 2:: OJ consider these fixed for the rest of the proof. Set L = c . t.
We have
",n+t Mll )
Pr (
L..J~-n
- ;
1 s
~ c. I\lT 2:: n X T 2:: m =
Pr ( L
n+t
M}l ~ L I\lT 2:: n X T 2:: m
)
=
s=n+l
Pr (Xs > X s- 1 at most L times, for s = n + 1, .... , n + t I\lT 2:: n X T 2:: m).
Now, for any Xo E Z, define the following conditions on strings XIX2 ..• Xt E zt.
Dl For s = 1,2, .... , t we have Xs > Xs-l at most L times.
D2 For s = 1,2, .... , t we have Ixs - xs-Il ~ 1.
D3 For s = 1,2, .... , t we have Xs > m.
1. Am(xo): the set of sequences that satisfy Dl, D2, D3j and
2. A(xo): the set of sequences that satisfy Dl, D2.
Q ~ Pr (Xs > X s- 1 at most L times, s = n + 1, .... , n + tl\lT 2:: n, X 2:: m)

T
and recall that

R(xo) = Pr (Xn = Xo IVr;::: n,Xr ;::: m);
then
R(xo)·
xo=m,m+l, ...
Now, define
(12.E.2)
it follows that
xo=m,m+l, ...
As usual, our goal is to bound Q and we will achieve this through a replace-
ment procedure. We define some additional sets of sequences (YI, Y2, ... , Yt) E
{-I, 0, l}t. We define
ii ~ {(YI, ... ,yt}: for s = 1,2, ... , t we have Ys E {-I, 0, I}, no. of 1's ~ L}
As usual, the sets A(xo) and ii are in a one-to-one correspondence: for Xo fixed
and any (Xl,X2, .. , .Xt) E A(xo), a unique (YI,Y2, .. , .Yt) E ii is defined by taking
Ys = Xs-Xs-I (8 = 1,2, ... ); conversely, for Xo fixed and any (Yl, Y2, .. , .Yt) E ii,
Xs = Xs-I + Ys defines a unique (Xl, X2, .. , .Xt) E A(xo). Hence there are one-
to-one functions Y : A(xo) -+ ii and X : ii -+ A(xo),where X = y-l. Note
also that ii is independent of Xo, i.e. for any Xo, x~ E Z, we have Y (A(xo)) =
y (A(x~)). Now, define five sets as follows.
iiI . .:. . { (YI, Y2, ... , Yt) E ii and Yt = 1}

B~2t . .:. . {~
(Yl, Y2, ... , Yt) E Band Yt = 0 and (YI, Y2, ... , Yt-l, 1) E B~l}
t
B~3t . .:. . {~ ~l}

(Yl,Y2,···,Yt)EBandYt=-land(YI,Y2,·.·,Yt-I,I) EB t
B~4t . .:. . {~ ~l}

(YI,Y2,···,Yt)EBandYt=Oand(YI,Y2, ... ,Yt-1,1) tJ.B t
B~5t . .:. . {~
(Yl, Y2, ... , Yt) E Band Yt = -1 and (Yl, Y2, ... , Yt-l, 1) tJ. B~l}
t
By ~.uments similar to these~of the previous section, it can be shown that the
sets B:, i = 1, ... ,5 partition B. In addition, it can be shown that
1. iit is the set of ii sequences with Yt = 0 and no. of 1's equal to L, while
2. ii'fis the set of ii sequences with Yt = 0 and no. of 1's less than L;
similarly
232
3. ii~ is the set of ii sequences with Yt = -1 and no. of 1 's equal to L, while
4. ii"t is the set of ii sequences with Yt = -1 and no. of 1 's less than L.
It is also clear that the elements of iiI, ii; and ii"t are in a one-to-one
correspondence.
Let us now proceed to implement the usual replacement procedure. We have
(12.E.3)
(12.EA)
(12.E.5)
(12.E.6)
(12.E.7)
(12.E.8)
Now, the terms in expressions (12.EA), (12.E.5), (12.E.6) are in a one-to-one

correspondence and we can rewrite expression (12.E.3) as follows
L q(Xl!XO)· .... q(Xt-l!Xt-2)· [q(Xt-l + l!Xt-l)+

(Xl ,X2, ... ,Xt)EX(Bi)
(12.E.9)
+
In expression (12.E.9), the terms in square brackets add to one, so they can
be replaced by W(l) + p(O) + p( -l)J (which also equals one) without altering
the value of the expression. Suppose now that in the expression (12.E.I0) each
term in the sum is replaced by q (xllxo) . ... q (xt-lIXt-2) . p(Xt - Xt-l), i.e.
q (XtIXt-l) is replaced by p(Xt - Xt-l). Recall that for sequences in X (~) we
have Xt = Xt-l; hence
P(Xt - Xt-l) = P(Xt-l - Xt-l), q(XtIXt-l) = q(xt-llxt-l).
On the other hand
and
(
q Xt-l IXt-l -
)- { p(Xt-llxt-d = f3(Xt-l)
p(mlm) = f3(m)
if Xt-l ~ m;
if Xt-l < m.
If Xt-l ~ m, then q(Xt-llxt-d = f3(Xt-l) ~ jj(m); if Xt-l < m, then

q(Xt-llxt-l) = f3(m) ~ jj(m). In either case, the expression is not decreased
by replacing q(XtIXt-l) terms with p(Xt - Xt-l) = p(O) terms.
Also, since for sequences in X (B:)
we have Xt = Xt-l - 1, by an exactly
analogous analysis we conclude that expression (12.E.ll) is not decreased by
replacing all the q(XtIXt-l) terms with P(Xt - Xt-d = P( -1) terms. Finally, we
have
L q (xllxo) ..... q (Xt-ll xt-2) .

(Xl ,X2, ... ,x')EX (BI)
[q(Xt-l + lI Xt-l) + q(Xt-lIXt-l) + q(Xt-l -lIXt-dJ

+ q (xllxo) . q (x2l x d ..... q (Xt-lIXt-2) . q (XtIXt-l) .
L q (XIlxo) ..... q (Xt-l - lI Xt-2) . W(l) + P(O) + P( -1))

(xl,x2, ... ,x.)EX(BI)
+
234
L q (xllxo) ..... q (Xt-lI Xt-2) . P(Xt - Xt-l).

Xl':2 ••• X tEA(xo)
We have just proved that

Qt(xo) ~ Qt-l (xo)
where
Continuing the replacement process in this manner, we obtain

Qt(xo) ~ Qt-l(XO) ~ Qt-2(XO) ~ ... ~ Qo(xo)
where
and
Qo(xo) =
Then, using exactly the same argument as in Appendix ll.C, it follows that
R(xo)·
xo=m,m+l, ...
(12.E.13)
2:n+t
.=nt
Mll
since 2: R(xo) = 1. Expression (12.E.12) is exactly Pr ( t ~e
xoEZ
I'v'r :?: n X T :?: m), while expression (12.E.13) is the probability that, for s =
1,2, ... , t , M~ = 1 at most L = e· t times. In short, what has been proved is
that
2: ns=n+l
+t Mll
t
t < e I'v'r >
- -
n X > m ) < Pr
T_ -
("tL...s=l
t-
Ws < e)

Long Term Behavior of Mll
The previous lemma compared the behavior of MP with that of M~ over
finite times. The next lemma tells us something about the behavior of Mll in
the long run (and without connection to M:n).
Lemma 12-12.E.3 If conditions AO, AI, A2 hold, for all m 2:: jVfm, and for
all n E N, we have
Proof. This is proved in exactly the same way as Lemma 11-11.C.2 in Chapter
11 and so the proof is omitted.e
Appendix 12.F: Bounding Nfl from Below
Two Auxiliary Stochastic Processes

Define the following quantities
Ci(m) = sup a(n), ;Y(m) = inf ;Y(n) = 0, ~(m) = 1 - Ci(m) - ;Y(m).
m?n m?n
Note that Ci(m) + ~(m) + ;Y(m)=I, that 0::; Ci(m) ::; 1 and that 0::; ~(m) ::; 1;
obviously, Ci( m), ~(m), ;Y( m) are probabilities. Note also that that Ci( m) is
decreasing with m, with limm-+oo Ci(m) = 1f1 and lim m-+ oo ~(m) = 1f2; also, for
all m we have
Ci(m) 2:: a(m), ;Y(m)::; ')'(m).
Define ji as follows (for z E {-I, 0,1})
Ci(m) if z=l,
ji(z) ={ ~(m) if z = 0,
;Y(m) if z = -1.
(We have suppressed, for brevity of notation, the dependence of ji(.) on m.)
Consider a sequence of independent random variables Vim, V2m , ... which, for
t = 1,2, ... and z E {-I, 0, I} satisfy:
Pr(Vr = z) = ji(z).
In other words, the Vr's are a sequence of independent, identically distributed
random variables. Finally define the stochastic process Mtm by the following
relationships
Mm
t
= {O 1 iff
iff ~r =
~m=1.
0or ~m = -1; (12.F.1)
In other words, Mtm indicates when ~m is equal to 1.

The rationale for introducing ~m and M[" is similar to that of previous
sections. We want to analyze the behavior of L: 7.'1 .Because Vi is less likely to
move right than ~m, it follows that the probability of 2: 7/ 1
being greater than
236
some number is smaller than that of 2: ~;n . Because Vi, V2, ... are dependent,
while Vr, V2m , ... are independent, it is easier to analyze the behavior of E ~;n
" " M21
than that of ~.
the behavior of the stochastic process MF.
Lemma 12-12.F.I For any m > M m , 3tm such that for all t ~ tm we have
Pr ( ""t Mm
L..,.s-~ s ~ a(m) +m
1) 1 < t2 '
Proof. This is proved in exactly the same manner as Lemma ll-ll.C.l in

Chapter lland hence the proof is omitted .•
Relating Mll and Mtm
Lemma 12-12.F.2 If conditions BO, BI, B2 hold, then for any mE Z, and
for the associated process MF we have (for any n ~ 0, € > 0, t ~ 0)
Proof. Choose some m, n ~ 0, € > 0, t ~ OJ consider these fixed for the rest of
the proof.Set L = €. t. We have
Pr(Xs > X s - 1 at least L times, for s = n + 1, .... ,n +t I'v'r ~ n X T ~ m).
Now, for any Xo E Z, define the following conditions on strings XIX2 ... Xt E zt.
EI For s = 1,2, .... , t we have Xs > Xs-l at least L times.

E2 For s = 1,2, .... , t we have Ixs - xs-il ~ 1.
E3 For s = 1,2, .... ,t we have Xs ~ m.
Define the following sets of sequences.
1. Am(xo): the set of strings that satisfy EI, E2, E3j
2. A(xo): the set of strings that satisfy EI, E2.

Obviously, for all m we have Am{xo) C A{xo). Recall that

R{xo) = Pr (Xn = Xo IVr 2: n, X T 2: m)
and define
Q ~ Pr (Xs > X s- 1 at least L times, s = n + 1, .... , n + tlVr 2: n, X T 2: m).
Then we have that
R{xo)'
3:o=m,m+1, ...
R{xo)'
3:o=m,m+1, ...
Now fix also Xo and concentrate on
Once again we apply the usual replacement procedure. To do this we need

ii ~ {(Yb .. ,Yt): for s = 1, .. ,t we have Ys E {-l,O,l}, no. of l's 2: L}.
As usual, the sets A{xo) and ii are in a one-to-one correspondence, which can
be expressed by one-to-one functions Y : A(xo) - ii and X : ii - A(xo),where
X = y-1 and ii is independent of Xo. Next, define four sets as follows.
-1
Bt - {(YbY2, ... ,Yt) Eii and Yt = -I}
Eii and Yt = O}
-2
Bt - {(Y1,Y2, ... ,Yt)
-3
Bt - {(Y1,Y2, ... ,Yt) Eii and Yt = 1 and (Y1,Y2, ... ,Yt-b-1) Eiil}
-4 - -1 -2 -3
Bt - B-Bt -Bt -Bt ·
As usual, the sets iii, i = 1,2,3,4 partition ii and the elements of iii, iil and
iil are in a one-to-one correspondence. In addition (by the usual arguments)
iil is the set of ii sequences with no. of l's > L , while iit is the set of
ii sequences with no. of l's = L. Nowlet us proceed with the replacement
procedure.
(12.F.2)
238
L q(Xl!XO)' q(X2!Xl)· ... · q(Xt!Xt-d (12.F.3)

(Xl ,X2 , ... ,Xt)EX (lii)
+ L q(Xl!XO)' q(X2!Xl)· .. ·· q(Xt!Xt-d (12.F.4)

(Xl ,X2 , ... ,x,)EX (li~)
+ L q(Xl!XO)' q(x2!xd' .,.' q(Xt!Xt-l) (12.F.5)

(X1,X2, •.• ,x,)EX (lir)
+ L q(Xl!XO)' p(X2!Xl)· ... · q(Xt!Xt-l) (12.F.6)

(Xl ,X2, ... ,x')EX (lit)
Since the terms in expressions (12.F.3), (12.F.4), (12.F.5) are in a one-to-one
correspondence, we can rewrite expression (12.F.2) as
L q(Xl!XO) ..... q(Xt-l!Xt-2)'

(Xl ,X2, ... ,x')EX (lii)
(12.F.7)
In expression (12.F.7) the terms in the square bracket add up to one, so they can
be replaced by [P( +1) + p(O) + p( -l)J (which also equals one) without altering
the value of the expression. Suppose now that in expression (12.F.8) each
term in the sum is replaced by q (Xl!XO) . ... q (Xt-l!Xt-2) . p(Xt-l + 1 - xt-d,
i.e. q (Xt-l + l!Xt-l) is replaced by p(Xt-l + 1 - Xt-l) = P(1). By definition,
p(l)= a(m). We have a(m) ;::: q(Xt-l + l!Xt-l), because: if Xt-l < m, then
q(Xt-l +l!Xt-l) = a(m) ~ a(m); whereas if Xt-l ;::: m, then q(Xt-l +l!xt-d =
a(xt-d ~ a(m).
Hence, replacing all the q(Xt-l + l!Xt-l) terms with p(Xt-l + 1 - xt-d =
p(Xt - Xt-l) the expression is not decreased and it follows that Qt(xo) is no
greater than
L q(Xl!XO) ..... q(Xt-l!Xt-2) . [P( -1) + p(O) + P(1)]

(x 1 ,x 2 , ... ,x,)Ex( lii)
L q(Xl!XO)' q(x2!xd· .. ·· q(Xt-l!Xt-2)' p(Xt - Xt-l)

(Xl ,X2, ... ,x')EX (lii)
L q(Xl!XO)· .... q(Xt-l!Xt-2)· ji(Xt - Xt-l).

(XI,X2, ... ,X')EA(xo)
Recall that
Qt(xo) =
and define
Continuing the replacement process we obtain an increasing sequence
Qt(Xo) :::; Qt-l(XO) :::; Qt-2(XO) :::; ... :::; Qo(xo)

where Qo (xo) has only ji terms. In other words
and
Then, using exactly the same argument as in Appendix I1.C, it follows that
R(xo)·
xo=m,m+l, ...
(12.F.1O)
240
Expression (12.F.9) is exactly Pr ( 2:n+t

.-n+t
Mll(s)
? c; I'v'T ? n X7"? m ,while
) .
expression (12.F.1O) is the probability that, for s = 1,2, ... , t , M;" = 1 at least
L = c; • t times. In short, what has been proved is that
Pr ( ",n+t
ws=n+1
t
MIl
s >
-
c; I'v'T > n
-
> m) <
X 7"_ -
Pr ("'t
ws=l
Mms
t
? c;
)
Long Term Behavior of Ml1
The previous lemma compared the behavior of MP with that of M;" over
finite times. The next lemma tells us something about the behavior of Ml1 in
the long run (and without connection to M;").
Lemma l2-l2.F.3 If conditions AD, AI, A2 hold, for all m E Z, and for
all n E N, we have
Proof. This is proved exactly like Lemma 11-Il.C.l and so the proof is
omitted .•
Appendix 12.G: Convergence of N tij

Now, using the previously established Lemmas 12-12.D.3, 12-12.E.3, 12-12.F.3
we are ready to prove Theorem 12.2.
Proof of Theorem 12.2 : Only the case when lim X t = +00 will be
t-+oo
considered in detail (the case lim X t
t-+oo
= -00 is proved in exactly the same
manner). First we will prove that
N21
Pr ( lim _t_ = 0 ! lim X7" = +00) = l. (12.G.l)
t-+oo t 7"-+00
Then it follows immediately that

N22 )
Pr ( lim
m-+oo
_t_
t
= 7I"2!lim
7"-+00
X7" = +00 = 1 (12.G.2)
Next, we will prove that
Pr ( NIl
lim _t_ = 71"1 !lim X7" = +00 ) = 1; (12.G.3)
m-+oo t 7"-+00
using eq.(12.G.3) we will also prove that

N12 )
Pr ( lim _t_
m-+oo t
= 0 !lim
7"-+00
X7" = +00 =1 (12.G.4)
Finally, using eqs.(12.G.1)-(12.GA) it is easy to prove that
Pr ( lim N~~ lim X" = +00) = 1

= 0 I"_00 (12.G.5)
m-oo Nt
and
Pr ( lim
m_oo
N~~
Nt
I"_00
= 0 lim X" = +00) = 1 (12.G.6)
For m = m, m + 1, ... and for n = 0,1,2, ... define the events
Amn = {'IT ~ n X" ~ m} .

Then, conditional probabilities Pr ( ... jVT ~ n X" ~ m) can be written as
Pr ( ... jAmn). From Lemma 12-12.D.3 follows that for all m ~ m and for all
n ~ 0 we have
",n+t
L..s=n+1 M21
s . )
Pr ( t ~ Z'y(m) 1.0. jAmn = 0 =>
",n+t
L..s=n+1 M21
s _ )
Pr ( t < 2'Y(m) a.a. jAmn = 1 =>
,
( ,
Pr 3tnm : 'It > tnm L
n+t
s=n+1
M;l < 2;Y(m) . t jAmn
)
= 1 =>
Pr ( , ,
3tnm : 'It > tnm
n+t
~ M;l < 2;:Y(m) . t +n jAmn
)
= 1 =>
,
( ,
Pr 3tnm : 'It > tnm
2: ns-l
n+t
+t M21
s
t
< 2;y(m). - - + ~ jAmn
n+t n+t
)
= 1.
It follows that, for all m ~ m and for all n ~ 0 , and conditional on the event
A mn , we have (with probability 1)
",n+t M21
L..s-1 s
=~~~< 'I'm ._-+--.
2-() t n
n+t n+t n+t
In addition, the following inequalities are true (obviously with probability 1
and for all m and n)
t
'It t +n . 2;Y(m) < 2;Y(m), (12.G.7)
n
- < ;:Y(m). (12.G.8)
n+t
242
nm , it follows that , for all m > m and for all n ;::: 0,

· t nm -- max(t'nm' til)
Tak lng
p, (3tnm ,Vi > tnm ~:=: M;' :5 3')(m) IAmn) ~ 1 =>

Pr (Nt ~ 3;;Y(m) a.a.IAmn ) = 1 (12.G.9)
(where "a.a." means "almost always"). Define
Am ~ U Amn = n=l
n=l
U {VT;::: n Xr ;::: m} =
{3 n: V T;::: n Xr ;::: m} = {Xr ;::: m a.a.};
also, define
A ~
m=m
n Am = n_ {Xr ;::: m a.a.} = {V m;::: m
m=m
X r ;::: m a.a.}.
It follows immediately that A lim Xr = oo.}. Next, define for m = m, m+

= { r-+oo
1, ... the events
Bm ~ { -t- ~ 3;;Y(m) a.a

N21 }
,
B ~ m~m Bm = {V m ;::: m Nt ~ 3;;y(m) a.a} .
Note that, since lim ;;Y(m) = 0, and since N;l ;::: 0, we have
m-+oo
B = {V m ;::: m N;l ~ 3;;Y(m) a.a} = {lim N;l =

t t-+oo t
o}.
Now, from eq.(12.G.9) it follows that
Pr (BmIAmn) = 1 =? Pr (BmIAmn) Pr (Amn) = Pr (Amn) ,
therefore
(12.G.1O)
Note that for a fixed m and for n < n' we have Amn C A mn ,. Then, from
Lemma A.2 it follows that
lim Pr (Amn)
n--+oo
= Pr (u
n=l
Amn) = Pr (Am) . (12.G.ll)
On the other hand, for a fixed m and for n < n' we have BmnAmn C BmnA mn ,.
Hence
nl~ Pr (Bm n Amn) = Pr CQl (Bm n Amn)) =

(12.G.12)
Since Pr (Bm n Amn) = Pr (Amn) for all m > m and n ;?: 0, it follows from
(12.G.I0), (12.G.ll) and (12.G.12) that for all m > m
(12.G.13)
For m < m' we have: {Xt ;?: m' a.a.} => {Xt ;?: m a.a.} , hence Am' C Am
and Bm' C B m , since ,?(m) decreases monotonically. It follows that
lim Pr (Am) = Pr (
m--+oo
rt Am) = Pr (A) ,
m=m
(12.G.14)
(12.G.15)
It follows from (12.G.13), (12.G.14) and (12.G.15) that Pr (A) = Pr (A n B) =>

Pr(AnB)
Pr(A) = 1 => Pr ( B IA ) = 1. In other words
Pr ( '11m;?: M
-=-=1n
-+- N2l
< 3,?(m) a.a. IVm;?: ml X-r ;?: m a.a.) = 1 =>
t-oo
N2l
Pr ( lim _t_
t
=0 I lim
-r-oo
X-r = +00 ) = 1. (12.G.16)
This completes the proof of eq. (12. G.1).

To show eq.(12.G.2) note that
Pr (lim Nl =
t-oo t
7r2 ) = 1;
Since by assumption Pr( lim X-r = +00) > 0, it follows by Lemma A.6 that
-r-oo
P(1·
r 1m -
t-oo
N;
t
= 7r2 (12.G.17)
Also recall that

N;2 + N;l N;l N; Nl 2
(12.G.18)
t - -t- = -t- - -t-'
By taking limits in eq.(12.G.18) and using eqs.(12.G.16), (12.G.17) it follows

that
Pr (lim N;2
t-oo t
= 7r21 -r-oo
lim X-r = +00) = 1.
244
This completes the proof of eq. {12. G.2}

For m = m, m+ 1, ... define the events
Cm =. {Nll
-t- 2: a~()
m - 2 a.a } ,
m
C n
=. 00 Cm = { ~
V m 2: m Nl 1
-- 2: a~()
m - -2 a.a } .
=~ t m
Since lim a(m)
m--+oo
= 7rI, it follows that
C= {V m 2: m Nlt 1 2: a(m) -! a.a}

m
= {liminf
m--+oo
Nlt 1 2: 7rl} .(12.G.19)
Using Lemma 12-12.E.3, it can be shown that Pr (CmIAmn) = 1 and then,

taking limits we get Pr (CIA) = 1. Also, for m = 1,2, ... define the events
D m =. -t- ~ a-()
{Nll m +;;
2 a.a } ,
n Dm
D ~ 00
m=l
= { Vm 2: 1 N
_t_ ~
t
ll
a( m) +-
m
2}
a.a .
Since lim a( m)
m--+oo
= 7rl, it follows that
D = { Vm E Z Nll
_t_
t
~ a(m) +- 2} {
m
a.a = lim sup
m--+oo
N
_t_
t
ll
~ 7rl } .(12.G.20)
Using Lemma 12-12.F.3, it can be shown that Pr (DmIAmn) = 1 and then,

taking limits we get Pr (DIA) = 1. Using Pr (CIA) = 1, Pr (DIA) = 1, the fact
that A = { lim X-r =
-r--+oo
+oo}
(with Pr(A) > 0) and eqs.(12.G.19), (12.G.20), it
is not hard to show that
Pr ( lim inf
m--+oo
Nll
_t_
t
2: 7rl and lim sup
m--+oo
N ll
_t_
t
I
~ 7rl lim X-r = +00 ) = 1 =}
-r--+oo
Pr ( lim
m-+CX)
Nlt 1 = 7rl I lim T-+CX)
X-r = +00) = 1
This completes the proof of eq.{12.G.3}. Eq.{12.G.4} is proved similarly to

eq.{12.G.2}.
Finally, from eq.(12.G.1) and eq.(12.G.3) it follows that
Pr ( lim
t--+oo
N12
_t_
t
= 0 and lim
t--+oo
N ll
_t_
t
= 7rl I lim X-r
-r--+oo
= +00 ) = 1 =}
Pr (lim
t-+CXJ
NNt~:
t
=0 I lim X-r
T--+OO
= +00) = 1,
which is eq.{12.G.5)j eq.{12.G.6) is proved similarly. This completes the proof

of the Theorem for the case when lim Xr = +00.
"'--00
The part of the proof concerning the case lim Xr = -00 follows exactly
.,.--00
the same pattern as the previously presented results, requiring the proof of ad-
ditionallemmas, corresponding to Lemmas 12-12.D.3, 12-12.E.3 and 12-12.F.3.
This is omitted for the sake of brevity. •
IV Connections
13 BIBLIOGRAPHIC REMARKS
In this chapter we review the literature of modular and, more generally, multiple
models methods. We discuss a number of related approaches from the neural
network literature, as well as from the areas of statistical pattern recognition,
econometrics, statistics, fuzzy sets and control theory.
13.1 INTRODUCTION
13.1.1 Modular and Multiple Models Methods

In this and the next chapter we discuss what can be broadly described as ma-
chine learning systems which utilize multiple models. We use the term "machine
learning" in its broadest possible sense, so as to include operations performed
by neural networks, fuzzy systems, statistical pattern recognizers, as well as by
more "conventional" systems utilized in statistics, econometrics, control the-
ory and so on. As will soon become clear, there is considerable overlap in the
methods used by these disciplines.
Let us first clarify the meaning of the terms multiple models methods and
modular methods. In our understanding, modular methods are a subset of mul-
tiple models ones. A multiple models method makes concurrent use of several
alternative models of the same "process". A modular method has the addi-
tional characteristic that anyone of the models can be removed and replaced
by an alternative model (performing the same or a similar function) without
requiring extensive modification (for instance retraining) of the remaining com-

250
ponents (models). We refer to this property as interchangeability. In short, we

propose the following definition of modularity (which depends on the notion of
interchangeability) .
Definition 1 A learning method is modular iff it employs a number of inter-
changeable components.
The above should be taken as a guiding principle, rather than as a formal
mathematical definition. In practice the distinction between modular and mul-
tiple models methods is not very sharp and there exist examples of methods
which are "borderline-modular". We will discuss this issue in more detail in
later sections. It should also be mentioned that other opinions / definitions as
to what constitutes a modular learning system exist; we will present a few of
them later.
The idea of multiple models has been used widely by researchers in several
areas, including (but not being limited to) neural networks, statistical pattern
recognition, statistics, econometrics, fuzzy sets and control theory. We will try
to present some of this work in the following sections. We should stress that
in what follows we do not attempt to present a complete bibliography. Indeed,
such a task would be absolutely hopeless within a limited space, given that in
anyone of the above disciplines the number of relevant papers is vast. Our
goal is more modest: to present our personal map of the literature, listing and
commenting on papers which we have found interesting. We hope that this
material will prove useful to other researchers in two ways. First, by making
them aware of connections which may have previously been unknown to them
and giving them starting points for their own literature search. Second, by
providing a context which will make clear the basic unity of the multiple model
approach (this theme will be discussed in greater detail in the next chapter).
In our opinion, the borders between the above mentioned disciplines are rather
fuzzy; some of the work to be presently discussed can be placed in a number
of different categories. In our presentation, sometimes we have been guided
by similarity of subject matter; in other cases we have followed similarity of
approach.
13.1.2 Static and Dynamic Problems

The problems and methods we will discuss in the following sections can be
separated into two categories, depending on whether they deal with static or
dynamic (i.e. time series) data. There are some qualitative differences between
the two kinds of problems, which can be reduced to a single factor: in the
dynamic case there will generally be correlations between successive samples.
This fact may furnish useful information which can be profitably utilized by a
time series-oriented method. On the other hand, it is not unusual to staticize
a time series problem. For example, consider the problem of predicting a time
series Yl, Y2, ... ,Yt, ... . Developing a predictor can be considered as a static
regression problem, where it is required to discover a mapping from inputs
Yt-M, Yt-M+l, ... ,Yt-l to outputs Yt. In this sense, the input/ output
BIBLIOGRAPHIC REMARKS 251
samples ({Yt-M, Yt-M+l. ... ,Yt-l}, Yt) can be considered as static patterns.
Hence there is considerable overlap between the methods used for static and
dynamic problems. So far in this book we have restricted ourselves to time
series problems. However, because of the considerable overlap between static
and dynamic methods, in this and the next chapter we will consider together
both the static and dynamic case.
13.1.3 Supervised and Unsupervised Learning
Learning methods in general are divided in supervised and unsupervised ones

and this distinction applies to multiple models methods as well. In supervised
learning a teacher is available who provides the learning system with some feed-
back regarding its performance. In many cases this simply amounts to having
available some target outputs which the learning system must approximate (in
this case feedback regarding the system's performance consists in computing
some measure of the difference between actual and estimated output, e.g. total
square error). Once again, the distinction is not sharp. Consider, for example,
the problem of modeling unknown sources, discussed in Part III. We have de-
scribed our approach as an unsupervised learning method, since an important
component is the estimation of the unobservable source process Zl. Z2, ... ; this
is unsupervised learning since we are never provided with the actual Zt values.
On the other hand, if the problem is considered as one of modelling the input/
output mechanism which generates the observable process Yl, Y2, ... ,then the
learning process can be considered supervised, since the target outputs Yt are
available.
13.2 NEURAL NETWORKS
13.2.1 Network Combination
A large amount of work has been carried out on the subject of multiple net-
works architectures. A good starting point is two special issues of the journal
Connection Science. Namely vol.8, no.3/4 is devoted to ensemble approaches
and vol.9, no.l is devoted to modular approaches. The leading article in each
issue (Sharkey, 1996; Sharkey, 1997) discuss the difference between the two
approaches; according to Sharkey's definition modular approaches are charac-
terized by the use of specialized networks, while ensemble approaches employ
nonspecialized networks. While Sharkey does not explicitly introduce the in-
terchangeability criterion, it more or less follows from her elaboration of the
above categorization.
Keeping in mind that the above distinction is fuzzy to a considerable extent,
we can still consider two broad categories of multiple neural networks systems:
those which employ task-specialized networks and those which employ ensem-
bles of networks which perform the same task.
252
Combination of Specialized Networks. A book-length treatment of spe-

cialized (modular) neural networks appears in (Hrycej, 1992). The book also
includes references to a number of related papers. Another review of modular
neural networks, with fairly extensive bibliography is (Ronco and Gawthrop,
1995). We have already mentioned the special issue of Connection Science de-
voted to the same subject. Finally, (Haykin, 1994) is a general book about
neural networks, but has an extensive section on modular networks which also
includes references to a number of related papers.
In addition, here is a brief sample of papers reporting work on specialized
networks (Anand and Mehrotra and Mohan, 1995; Aussem and Murtagh, 1997;
Bartfai and White, 1997; Bengio, Fessant and Collobert, 1996; Bennani, 1995;
Catfolis and Meert,1997; Chiang and Fu, 1994; Fessant, Bengio and Collobert,
1996; Fun and Hagan, 1996; Kecman, 1996; Luttrell, 1997; Rodriguez et al.,
1996).
Specialized networks are frequently used for applications, e.g. speech (Waibel,
1988; Waibel, 1989a; Waibel, 1989b; Waibel, Sawai and Shikano, 1989) and
character recognition (Bodenhausen and Waibel, 1993). In both cases the
nature of the task favors specialization; e.g. one specialized network can be
developed for the recognition of each phoneme or character.
An important question regarding networks of specialized sub-networks is
how to define on which task must each sub-network specialize. In some cases
this is obvious from the nature of the problem and can be done "manually".
In less obvious cases, genetic or evolutionary methods have been used (Drabe,
Bressgott and Bartscht, 1996; Happel and Murre, 1994; Liu and Yaa, 1997).
Another popular methodology involves "constructive" algorithms (Moon and
Oh, 1995; Ramamurti and Ghosh, 1996) which grow "networks of networks" in
a data driven manner. Such methods will be discussed in Section 13.2.2.
Ensembles of Networks. We have already mentioned the special issue of

Connection Science which is devoted to ensemble networks. The term "ensem-
ble" ("committee" is also used) usually denotes a collection of networks, each
of which has been trained to perform the same task. The rationale for using an
ensemble of networks is that by combining (in appropriate manner) the outputs
of all the networks in the ensemble an improved output will be obtained.
The prototypical application of the ensemble methodology is in time series
prediction: several networks provide predictions of the same time series; these
predictions are combined to obtain a final prediction which has smaller error
and / or variance. This is an old and honored idea which originated in econo-
metrics (see Section 13.4). Time series prediction with ensembles of neural
networks is discussed, for example, in (Fessant, Bengio and Collobert, 1995a;
Fessant, Bengio and Collobert, 1995b; Ginzburg and Horn, 1994; Parmanto,
Munro and Doyle, 1996). Other applications of the averaging idea appear in
(Hashem, 1996; Mani, 1991; Schwarze and Hertz, 1994; Turner and Ghosh,
1996; Urbanczik, 1996).
Theoretical treatments of ensembles appear, for instance, in (Freund, Shapire,

Singer and Warmuth, 1997; Naftaly, Intrator and Horn, 1997; Perrone and
Cooper, 1993; Raviv and Intrator, 1996; Tresp and Taniguchi, 1995).
More complicated methods of combination than averaging are also possible.
For instance, the coefficients of combination can be optimized (Cheng, Fadlalla
& Lin, 1996; Hashem, 1997; Hashem & Schmeisser, 1995; Meir, 1995; Tresp
and Taniguchi, 1995). Genetic algorithms are used for combination in ( Opitz
and Shavlik, 1996); consensus theoretic methods in ( Benediktsson, Sveinsson
and Ersoy, 1997); and some other combination possibilities appear in (Rosen,
1996; Krogh and Vedelsby, 1995).
For a statistical mechanics perspective to the averaging idea, see (Kang, Oh
and Kwon, 1997; Kang and Oh, 1996; Krogh and Sollich, 1997; Urbanczik,
1996). An idea related to averaging appears in (Smieja, 1996).
For applications to classification see (Chen, Wang and Chi, 1997; Hochberg,
Cook, Renals and Robinson, 1994; Ji and Ma, 1997; Shimsoni and Intrator,
1998; Waterhouse and Cook, 1996; Xu, Krzyzak and Suen, 1993)
Three related and rather sophisticated ideas for network combination are
boosting (Drucker, Schapire and Simard, 1993a; Drucker, Schapire and Simard,
1993b; Drucker et al., 1994; Drucker and Cortes, 1996), bagging (Breiman,
1996a) and stacking (Wolpert, 1992; Breiman, 1996b).
Mixtures of Experts. Mixtures of Experts and Hierarchical Mixtures of

Experts are methods for combining specialized networks. We treat them sepa-
rately because of their wide use and some interesting characteristics they have.
Mixtures of Experts methods have been developed by several collaborators,
most notably Hinton, Jacobs, Jordan, Neal and Nowlan. An extensive bibli-
ography of methods for mixtures of experts combinations appears in (Jacobs,
1995).
The term experts refers to specialized networks. The original work on mix-
tures of experts is rather similar to the previously discussed methods for combi-
nation of specialized networks. See for example (Jacobs, 1989; Nowlan, 1990a;
Nowlan, 1990b; Jacobs and Jordan, 1991; Jacobs, Jordan and Barto, 1991;
Neal, 1991; Nowlan, 1991; Nowlan and Hinton, 1991).
This work culminated in (Jacobs, Jordan, Nowlan and Hinton, 1991) where
the adaptive mixtures of experts architecture was presented. This architecture
implements input / output relationships of the form y = f(x) by using a collec-
tion of expert networks. Each expert implements a function y = /k(x) (where
k= 1, 2, ... ,K) which approximates y = f(x) for a particular range of x values.
The experts' outputs are combined by a sum of the form y = ~:;=1 wk(x)fk(x),
where the weights Wk(X) are implemented by a gating network. The important
characteristic of the method is that the parameters of both the experts and
the gating network are trained jointly, using a steepest descent or Expectation
Maximization (EM) algorithm. Hence, while the connectivity of the combined
network must be defined in advance, the separation of the input space in regions
is carried out automatically.
254
Subsequently, the method was explored theoretically (Ghahramani and Jor-

dan, 1994; Jacobs, 1997; Jacobs, Peng and Tanner, 1997; Jordan and Xu, 1995;
Kang and Oh, 1996; Waterhouse, Mackay and Robinson, 1995; Waterhouse,
Mackay and Robinson, 1996; Xu and Jordan, 1993) and in addition it was ex-
tended along several directions. Perhaps the most notable extension was was
the development of hierarchical mixtures of experts (Jordan and Jacobs, 1994).
In this case several layers of experts are utilized, organized in a tree architec-
ture, with experts at a given level acting as gating networks for experts at
the next lower level. Another interesting development was the incorporation of
mechanisms to describe time dependence between successive inputs (Cacciatore
and Nowlan, 1994; Meila and Jordan, 1997).
The above developments led to the realization that, in a more general con-
text, graphs of experts can be employed; such graphs include as special cases
tree-like hierachies, (Jordan, 1994) and Markov chains which express temporal
dependence (Waterhouse and Robinson, 1995; Meila and Jordan, 1997; Jordan,
Ghahramani and Saul, 1997; Ghahramani and Jordan, 1997).
Two points are worth mentioning in connection to the above. First, that a
graph can be seen as a convenient way to illustrate the dependencies between
various processing elements (be they single neurons or networks of neurons);
the nodes of the graph correspond to the processing elements and the vertices
to dependencies between specific nodes (for instance input / output connec-
tions). Second, the graph can be used to organize the training of the network,
using either the EM algorithm (Dempster et al., 1977; Ghahramani, 1997) or
variational methods (Jordan, Ghahramani, Saul and Jaakkola, 1998). Hence
the use of graphical models offers a unifying framework for the combination of
networks. The subject of graphical models will be discussed again in Section
13.7.2.
The mixture of experts methodology has been applied to various problems.
For applications to modelling and control of dynamical systems see, for ex-
ample, (Chaer, Bishop and Ghosh, 1997; Jacobs and Jordan, 1993; Satoshi,
Hidekiyo and Yoshikazu, 1995). For applications to classification see (Alpaydin
and Jordan, 1996; Hu and Palreddy and Tompkins, 1995; Miller and Uyar, 1996;
Ramamurti and Ghosh, 1996; Waterhouse and Robinson, 1994). For applica-
tion to speech recognition see (Fritsch, Finke and Waibel, 1997b; Peng, Jacobs
and Tanner, 1996; Waterhouse and Cook, 1996; Waterhouse and Robinson,
1995; Zhao, Schwartz, Sroka and Makhoul, 1995). For time series prediction
see (Mangeas, Muller and Weigend, 1995; Weigend, Mangeas and Srivastava,
1995; Mangeas and Weigend, 1995; Weigend, 1996; Zeevi, Meir and Adler,
1996). For various other applications see (Hering and Haupt and Villmann,
1995; Hering and Haupt and Villmann, 1996; Prank et al., 1996; Pawelzik,
Kohlmorgen and Muller, 1996).
Usually the topology of a mixtures-of-experts network is fixed and training
is performed offline. Hence, papers which present online training (Tham, 1995)
and constructive training (Fritsch, Finke and Waibel, 1997a; Park and Hu,
1996; Saito and Nakano, 1996; Waterhouse and Robinson, 1996) are particularly
interesting.
An interesting reformulation of the mixtures of experts idea in terms of
"managers" relegating tasks to "sub-managers" appears in (Dayan and Hinton,
1993); other interesting points of view appear in (Dayan and Zemel, 1995;
Esteves and Nakano, 1995; Schaal and Atkeson, 1996; Xu, Hinton and Jordan,
1995).
RBF and Related Networks. Networks of radial basis function (RBF)

neurons are popular because of the existence of efficient training (Bishop, 1991;
Bors and Pitas, 1996; Chen, Cowan and Grant, 1991; Heiss and Kampl, 1996;
Kaminski and Strumillo, 1997; Roy, Govil and Miranda, 1997; Sherstinksyand
Picard, 1996; Weymaere and Martens, 1991; Whitehead, 1996; Whitehead and
Choate, 1996; Xu and Krzyzak and Oja, 1993). Such training algorithms often
yield networks which utilize a tree (see below) topology. Also, RBF neurons
are often used in conjunction with growing networks algorithms (see Section
13.2.2). Finally, it should be noted that RBF networks have been shown to be
universal approximators (Benaim, 1994; Chen and Chen, 1995; Freeman and
Saad, 1995; Hartman, Keeler and Kowalski, 1990; Park and Sandberg, 1991;
Park and Sandberg, 1993).
Because of the above properties, RBF networks have been used in many ap-
plications. For example, regression and classification applications are reported
in (Krzyzak, Linder and Lugosi, 1996; Lee, 1991; Lee and Pan, 1996; Rosen-
blum, Yacoob and Davis, 1996); time series prediction in (Cheng, Chen and
Mulgrew, 1996; Hartman and Keeler, 1991; Moody and Darken, 1989); and dy-
namic systems identification and control in (Elanyar, 1994; Gorinevsky, 1995;
Rosenblum and Davis, 1996; Unar and Murray-Smith, 1997).
We are interested in RBF networks because, if each RBF neuron is considered
as an elementary network, it is possible to think of RBF networks as multiple
models systems. This interpretation follows if each neuron is considered as a
network by itself. (A related point of view identifies each neuron as a rule 1
see (Hunt, Haas and Murray-Smith, 1996; Jang and Sun, 1993; Tresp, Hollatz
and Ahmad, 1997).) If this point of view is adopted, then some of the above
mentioned tree and / or growing algorithms can be generalized and applied for
the combination / organization of networks with many neurons.
Trees. We have already discussed trees in connection with hierachical mix-

tures of experts and to RBF networks. Generally, a network with a tree archi-
tecture furnishes a convenient way for organizing neurons or subnetworks. Ad-
ditional examples of this approach can be found in (Omohundro, 1991; Sanger
1991b; Sirat and Nadal, 1990); the parallel source identification algorithm de-
scribed in Chapter 10 exhibits the same point of view.
1 Possibly a fuzzy rule; see Section 13.6.1;

256
Tree topologies can be fixed in advance, but one of the most useful character-
istics of tree networks is that they can grow as necessary during training (ofHine
or online). This property has been exploited both within the neural networks
community, as will be seen in the next section, and in other disciplines (see
Section 13.3.3).
13.2.2 Constructive and Growing Methods

A collection of multiple models can be organized according to a predefined
arrangement provided by the user; however a more attractive alternative is
the use of an algorithm which will automatically provide an appropriate, data
driven arrangement. Such algorithms are referred to as constructive or growing
algorithms and are the subject of intensive research. Bibliographies of construc-
tive neural network algorithms appears in (Kwok and Yeung, 1997a; Kwok and
Yeung, 1997b).
A typical example of an algorithm for growing an RBF network is the Re-
source Allocating Network (Platt, 1991a; Platt, 1991b). Other RBF growing
algorithms are reported in (Fabri and Kadirkamanathan, 1996; Fritzke, 1994a;
Fritzke, 1994b; Kadirkamanathan and Niranjan, 1992; Kadirkamanathan, Ni-
ranjan and Fallside, 1991; Karayiannis and Weiqun, 1997; Whitehead and
Choate, 1994; Weymaere and Martens, 1991).
Another typical and popular growing algorithm is cascade correlation, which
is originally reported in (Fahlman and Lebiere, 1990a; Fahlman and Lebiere,
1990b; Fahlman, 1991). Variations of cascade correlation appear in (Blonda,
Pasquariello and Smith, 1993; Courrieu, 1993; Gallant, 1986; Giles et al., 1995;
Hwang, You, Lay and Jou, 1993; Klagges and Soegtrop, 1992; Littman and
Ritter, 1996; Littman and Ritter, 1993; Littman and Ritter, 1992; Sjogaard,
1992; Sjogaard, 1991; Smotroff, Friedman and Conolly, 1991; Yeung, 1991).
Algorithms for growing trees are reported in (Banan and Hjelmstadt, 1992;
Brent, 1991; Deffuant, 1990; Miller and Rose, 1996; Perrone and Intrator,
1992; Sanger, 1991a; Sanger, 1991b; Sankar and Mammone, 1991; Rojer and
Schwartz, 1989); similarly, algorithms for creating networks by repeated neuron
splitting are reported in (Bellido and Fernandez, 1991; Wynne-Jones, 1992;
Hanson, 1990).
Other constructive algorithms can be found in (Alpaydin, 1990; Ash, 1989;
Baffer and Zelle, 1992; Bartlett, 1994; Baum and Lang, 1991; Choi and Park,
1994; Giles et al., 1995; Chee and Harrison, 1997; Choi and Choi, 1994; Chuanyi
and Psaltis, 1997; Fiesler, 1994; Fritzke, 1991; Mezard and Nadal, 1989a;
Mezard and Nadal, 1989b; Muselli, 1995; Nabhan and Zomaya, 1994; Nadal,
1989; Niranjan, 1993; Petridis and Paraschidis, 1993; Redding, Kowalczyk and
Downs, 1993; Romaniuk and Hall, 1993; Sadjadi, Sheedvash and Trujillo, 1993;
Setiono and Hui, 1995; Simon, 1993; Sin and Figueiredo, 1993; Tenorio and
Lee, 1989; Wong, 1993). Some of the above algorithms can be modified so that
they construct networks of subnetworks rather than single neurons.
A genetic algorithm for growing a network appears in (Angeline, Saunders
and Pollack, 1994). An interesting method which reduces a classification prob-
lem of K categories to K problems of two categories each (essentially building

a tree) appears in (Anand, Mehrotra and Mohan, 1995).
For the sake of completeness, let us also give references to a few pruning
algorithms. A survey of such algorithms appears in (Reed, 1993). Additional
such algorithms are reported in (Castellano, Fanelli and Pelillo, 1997; Finnoff,
Hergert and Zimmerman, 1992; Goutte and Hansen, 1997; Hassibi and Stork,
1992; Hassibi, Stork, Wolff and Watanabe, 1994; Hergert and Finnoff, 1992;
Karnin, 1990; leCun, Denker and Solla, 1990; Levin, Leen and Moody, 1994;
Mukherjee and Fine, 1996; Omlin and Giles, 1993; Ramachandran and Pratt,
1991; Sietsma and Dow, 1988; Thodberg, 1993)
13.2.3 Self Organizing Maps
Self Organizing Maps (SOM) have been developed by T. Kohonen. An early

paper is (Kohonen, 1982) and· a relatively recent and comprehensive review is
(Kohonen, 1990). Book length treatments appear in (Kohonen, 1988a; Koho-
nen, 1988b; Kohonen, 1995).
SOMs consist of lattices of neurons; usually one- or two-dimensional orthog-
nal lattices are used. Higher dimensional inputs are mapped on these lattices
by an online process which is reminiscent of the k-means algorithm (see Sec-
tion 13.3.2). Hence each neuron in the lattice is associated with a collection
of inputs. In this sense, the SOM lattice can be seen as one more method for
organizing neurons which have specialized in similar but distinct tasks (in this
case "memorizing" input collections).
We are particularly interested in the online SOM learning algorithms which,
much like our data allocation algorithms, can be seen as generalizations of
the basic k-means algorithm. The convergence of SOMs has been investigated
intensively (Cottrell and Fort, 1986; Dersch and Tavan, 1995; Kohonen, 1982;
Lo and Yu and Bavarian, 1993; Lo and Bavarian, 1991; Luttrell, 1991; Luttrell,
1994; Ritter, 1986; Ritter and Schulten, 1988; Ritter and Schulten, 1991; Zhang,
1991; Yin and Allinson, 1997). This work may be a useful starting point for
analyzing more general online learning procedures for multiple models methods.
There are also connections with stochastic competitive learning (Kosko, 1991 b)
13.2.4 ART
The ART networks of Carpenter and Grossberg also utilize and process several
prototypes concurrently, so they san be considered to belong in the category of
mUltiple models. Basic papers are (Carpenter and Grossberg, 1990; Carpenter,
Grossberg and Rosen, 1991; Carpenter, Grossberg and Reynolds, 1991). The
convergence properties of the ART networks have been examined thoroughly;
look for example in (Georgiopoulos, Heileman and Huang, 1990).
258
13.3 STATISTICAL PATTERN RECOGNITION

There is considerable overlap between statistical pattern recognition and neural
networks theory. Many neural network pattern recognition procedures have
originated in statistical pattern recognition. In our opinion both fields can be
taken as branches of statistics, on the conceptual level. (Historically, of course,
each field developed in its own distinct manner.) Volume 8, no.1 (1997) of
IEEE Trans. on Neural Networks is devoted to neural networks and statistical
pattern recognition. We find the overview in (Jain and Mao, 1997) and the
taxonomy / bibliography (Holmstroem, Koistinen and Laaksonen, 1997) par-
ticularly interesting. See also (Li and Tufts, 1997; Prakash and Murty, 1997;
Joshi and Ramakrishman and Houstis, 1997; Ji and Ma, 1997; Cho, 1997; Lee
and Landgrebe, 1997; Ridella and Rovetta and Zunino, 1997).
In addition to the above work, we will briefly discuss here three major themes
of "classical" statistical pattern recognition which we find particularly relevant
to multiple models issues. These themes are: k-nearest neighbor algorithms,
k-means algorithms and classification and regression trees.
13.3.1 k-Nearest Neighbor Algorithms

The basic idea of k-nearest neighbor classification algorithms is simple. A num-
ber of prototype patterns is given, and the classification of each prototype (to
one of a finite number of classes) is known. A new pattern is classified by
looking at the classification of its k nearest (in the Euclidean sense) neighbors.
Many variations of this basic idea are possible, both for classification and re-
gression problems; some of these are described in (Duda and Hart, 1973) as well
as in (Patrick, 1972). See (Cover, 1967; Cover and Hart, 1968) for a theoretical
treatment.
It is rather obvious that k-nearest neighbor algoorithms fall within the gen-
eral multiple models framework.
13.3.2 k-Means Algorithms

The classical version of the k-means algorithm is used for ofRine clustering of
a finite data set. The algorithm cycles through the data, assigning each datum
to one of a fixed number of a clusters. A prototype pattern (centroid) is asso-
ciated with each cluster and assignment is based on minimizing the Euclidean
distance between the datum and the centroids. After each allocation, cluster
centroids are recomputed. The basic algorithm and several variations are dec-
sribed and theoretically analyzed in (Patrick, 1972); the classical reference is
(MacQueen, 1965). An online version is also possible, which does not cycle
through the data, but instead processes a continuously incoming data stream;
this is equivalent to learning vector quantization (LVQ) (Haykin, 1994; Gonza-
lez, Grana and D'Anjou, 1994) and stochastic learning (Kosko, 1991b) and also
closely related to the SOM algorithms of Section 13.2.3. We have also remarked
on the similarity of this procedure to our data allocation algorithms (Chapter
10). For some additional clustering algorithms related to k-means see (Jain
and Dubes, 1988). Naturally, the basic algorithm can be applied recursively,
with cluster splitting, to provide tree-shaped hierarchical clustering.
The k-means algorithm and its variations are often used in the neural net-
works context. (Chinrungrueng, 1995). In particular, this approach is used
often for initialization of RBF networks (Moody and Darken, 1989).
It is clear that the k-means algorithm and its variations fall within the mul-
tiple models context, with each cluster /centroid corresponding to one model.
In Chapter 10 we have remarked on possible generalization of k-means, where
clustering is performed according to the degree of constraint satisfaction.
13.3.3 Classification, Regression and Decision Trees
While decision trees are usually treated separately from classification and deci-
sion trees (CART) they are rather similar. In both cases some incoming datum
must be processed by one of several available models; the appropriate model
is chosen by traversing a tree where a decision (selection of a model subset)
is taken at every node. This results in successive refinement of the candidate
models set, until a single model is selected.
This approach has been used in the neural networks context, as already re-
marked (Section 13.2.1). Similar work has appeared in the context of statistical
pattern recognition, as well as statistics proper. For instance an early example
of a regression tree is presented in (Friedman, 1979). The seminal work on
classification and regression trees is (Breiman, Friedman, Olshen and Stone,
1984). An interesting recent paper is (Brodley and Utgoff, 1995). For decision
trees, see (Quinlan, 1993).
We have already remarked that trees are essentially a device for organizing
multiple models in an efficient manner. These models can be simple or rel-
atively complex (e.g. neural networks). Regarding the construction of trees,
many methods are available. (Breiman, Friedman, Olshen and Stone, 1984)
gives a very complete exposition for supervised construction of CARTs. For
a more recent review see (Buntine, 1994a); also see (Chaudhuri, Huang, Loh
and Yao, 1994; Chaudhuri, Lo, Loh and Yangi, 1995) and, for an application
to time series prediction (Farmer and Sidorowich, 1988). For decision trees,
see (Quinlan, 1986; Quinlan, 1993); a recent review appears in (Breslow and
Aha, 1997). In addition the methods presented in Section 13.2.2 (for neural
networks combination) also apply here. Finally, for the sake of completeness,
note that some useful techniques for pruning trees can be found in (Kubat and
Flotzinger, 1995).
Great effort has been expended for the theoretical analysis of the properties
of trees. Here we only list a few samples of such work (Breiman, 1996c; Ehren-
feucht and Haussler, 1989; Helmbold and Schapire, 1995; Quinlan and Rivest,
1989).
260
13.4 ECONOMETRICS AND FORECASTING
A large number of predictor combination methods appear in the literature of

statistics, econometrics and operations research. These are clearly multiple
models methods: for a particular prediction task a number of alternative pre-
dictors are developed and a rule is devised to select which predictor to use at
a particular time.
In the operations research community, a seminal early article is (Bates and
Granger, 1969) followed by a sequence of papers of which representative are
(Dickinson, 1973; Dickinson, 1975; Bunn, 1975). Similarly, in the Interna-
tional Journal of Forecasting a very extensive review of predictor combination
methods appears in (Clemen, 1989); see also (Winkler, 1989; Makridakis, 1989;
Armstrong, 1989; Hogarth, 1989). The above papers are mostly concerned
with fixed combinations; for combinations with changing weights see for in-
stance (Deutsch, Granger and Terasvirta, 1994).
The same idea appears in the econometrics literature under the name switch-
ing regressions. An early paper is (Quandt, 1958), followed by (Quandt, 1972;
Goldfeldt and Quandt, 1973; Quandt and Ramsey, 1978). Also, the idea of
fitting local models appears, for instance, in (Cleveland, Devlin and Grosse,
1988) and threshold models in (Tong and Lim, 1980). The idea of a time se-
ries (Markovian) switching of regime (which is another implementation of local
modelling) is also quite popular; see (Hamilton, 1988; Hamilton, 1989; Hamil-
ton, 1990; Hamilton, 1991; Hamilton and Susmel, 1994; Hamilton and Lin,
1996b; Hamilton, 1996). Finally a recent and extensive review appears in the
Ph.D. thesis (Krolzig, 1997).
For a review of econometrics and forecasting in relationship to the connec-
tionist point of view see (Hansen and Nelson, 1997) and also the collection
of papers in (Weigend and Gershfeld, 1994); we found especially interesting
(Frazer and Dimitriadis, 1994) for a connection with Hidden Markov Models
and (Lewis, Ray and Stevens, 1994) for a connection with multivariate adap-
tive splines. Other connectionist approaches to forecasting can be found in
(Weigend, Rumelhart and Huberman, 1990; Weigend, Rumelhart and Huber-
man, 1991; Connor, Martin and Atlas, 1994).
13.5 FUZZY SYSTEMS
13.5.1 Fuzzy Rule Systems as Multiple Models
In a certain sense all fuzzy rule systems can be considered as multiple mod-
els systems: one rule corresponds to one model. The analogy becomes more
obvious if we consider the usual implementation of fuzzy rules by radial basis
functions; in this case all the remarks of Section 13.2.1 apply here as well. Sev-
eral papers can be cited which discuss this point of view; consider, for example,
(Hunt and Brown, 1995; Hunt, Haas and Murray-Smith, 1996; Jang and Sun,
1993; Kim and Mendel, 1995).
13.5.2 Takagi-Sugeno Algorithms

A definite connection between fuzzy systems and local models of dynamical
systems appears in Sugeno's work on modelling and control of dynamical sys-
tems. This can be seen, for example, in (Takagi and Sugeno, 1985; Sugeno
and Kang, 1986; Sugeno and Kang, 1988; Sugeno et al., 1989; Sugeno and
Yasukawa, 1993).
13.5.3 Combination of Networks by Fuzzy Logic

We have discussed in previous sections various instances of model or network
combination methods. Model combination by fuzzy logic rules is an example of
this approach, which appears in connection with switching regressions (Hath-
away and Bezdek, 1993; Petridis and Kehagias, 1997) and selection of control
strategies (Cho and Kim, 1995; Kandadai and Tien, 1997; Park, Moon and
Lee, 1995; Qin and Borders, 1994; Zardecki, 1994)
13.5.4 Fuzzy Clustering and Classification

Finally, let us mention that fuzzy algorithms have been proposed for clustering
and classification problems. Such algorithms, much like their crisp counter-
parts, can be interpreted as multiple models methods, for the reasons presented
in the previous sections. Hence it is informative to study fuzzy clustering algo-
rithms, such as (Backer and Jain, 1981; Bezdek and Harris, 1978; Bezdek and
Coray, Gunderson and Watson, 1981a; Bezdek, Coray, Gunderson and Wat-
son, 1981b; Dunn, 1973; Gath and Geva, 1989; Kaburlasos and Petridis, 1997;
Karayannis, 1994; Pal, Bezdek and Hathaway, 1996; Ruspini, 1969; Windham,
1982), as well as fuzzy classification algorithms such as (Layeghi et al., 1994;
Nozaki, 1994; Pal and Majumder, 1977).
13.6 CONTROL THEORY

We use the term "control theory" to also include estimation and identification
problems. Multiple models methods have proved especially useful for problems
of these two categories.
13.6.1 Early Work

One of the first uses of multiple models in the control literature is in a state
estimation problem using multiple Kalman filters (Magill, 1965). A similar
method is presented in (Sworder, 1969) where a Markovian assumption is used
to model transitions between models.
13.6.2 Update of Mixtures

An idea that was introduced relatively early to the state estimation problem
for nonlinear systems and jor systems with non Gaussian noise, is the repre-
sentation of a non Gaussian probability density as a mixture of Gaussians. It is
262
then possible to obtain recursive equations which describe the evolution of this
mixture. An early example of this idea appears in (Srinivasan, 1969). Bucy
uses a similar but perhaps more general idea in (Bucy, 1969; Bucy and Senne,
1971). This approach is related to the mixtures of experts discussed in Section
13.2.1 and can be considered a multiple models method for the same reasons.
13.6.3 Lainiotis' work
We pay special attention to Lainiotis' work because of his prolific output and
also because it influenced to a considerable extent the development of our own
methods. Lainiotis originally presented his idea in a pattern recognition con-
text (Hancock and Lainiotis, 1965; Hilborn and Lainiotis, 1968; Hilborn and
lainiotis, 1969a; Hilborn and Lainiotis, 1969b; Lainiotis, 1970). In all of these
cases Lainiotis essentially considered the problem of classifying a time series
generated by an unobservable source. He first applied his results to a control
theoretic problem in (Sims, Lainiotis and Magill, 1969), which was a rsponse
to Magill's 1965 paper. Finally, (Lainiotis, 1971b) presented the essentials of
a methodology to treat the problems of state and parameter estimation and
control of a system with unknown parameters. This methodology depended
crucially on the use of multiple models, namely a bank of Kalman filters, each
filter being tuned to one of the candidate parameter values of the actual sys-
tem. Later contributions in the control theory context include (Lainiotis, 1971a;
Lainiotis, 1971b; Henderson and Lainiotis, 1972; Park and Lainiotis, 1972;
Lainiotis and Deshpande and Upadhyay, 1972; Lainiotis, 1973); this is just a
small sample of the great number of Lainiotis' papers. The theory is presented
in comprehensive form in the collection (Lainiotis, 1974a) which includes the
papers (Lainiotis, 1974b; Lainiotis, 1974c; Lainiotis, 1974d). Later contribu-
tions include (Petridis, 1981; Lainiotis and Likothanasis, 1987), an application
to seismic signals (Lainiotis, Katsikas and Likothanasis, 1988) and, recently,
applications related to neural networks problems (Lainiotis and Plataniotis,
1994a; Lainiotis and Plataniotis, 1994b; Lainiotis and Plataniotis, 1994c).
13.6.4 Multiple Models
Several versions of the multiple models idea have appeared in the control liter-
ature of the last twenty five years; for instance see the book (Mariton, 1990).
A classical applications-oriented paper in this direction is (Athans et al., 1977);
(Kashyap, 1977) is also relevant. Theoretical analysis appears in (Tugnait
and Haddad, 1980; Greene and Willsky, 1980; Anderson, 1985) among other
places. More recent developments in the use of multiple models are described in
(Caputi and Moose, 1995; Kulkarni and Ramadge, 1996; Murray-Smith, 1994;
Murray-Smith and Gollee, 1994; Narendra and Balakhrishnan, 1994; Narendra,
Balakhrishnan and Ciliz, 1995; Narendra and Balakrishnan, 1997; Pottman,
Unbehauen and Seborg, 1993; Skepstedt, Ljung and Millnert, 1992; Xiaorong
and Bar-Shalom, 1996). A recent book-length treatment of multiple models
approaches to control (and several other disciplines) is (Murray-smith and Jo-

hansen, 1997). This book also has an extensive list of references.
A related idea is the use of Volterra, Laguerre etc. series to represent non-
linear systems. A book-length exposition appears in (Rugh, 1981); see also
(Wahlberg, 1991; Sbarbaro, 1997). Other multiple models approaches include
the use of sliding modes (Utkin, 1977; Utkin, 1992) and GMDH (Farlow, 1984)
models, which are treated extensively in the book (Farlow, 1984). Finally, we
have already discussed in Section 13.5.2 the Takagi-Sugeno approach to system
identification and control.
13.6.5 Switching Regimes

Most of the methods described previously make use of a "flat" collection of
models; in other words the transition between models is not structured. How-
ever, we have already mentioned a few examples of more structured methods,
which describe the transition between the models. Generally the structure is
obtained by postulating a Markovian switching model. As already mentioned,
this idea appears in (Sworder, 1969) and a little later in (Ackerson and Fu,
1970). Because usually the change of models is associated with the change of
the systems' operating mode or regime, this approach is usually termed switch-
ing regimes. The idea has become very popular and has been explored both
experimentally and theoretically to a great extent (Blom and Barshalom, 1988;
Chizeck, Willsky and Castanon, 1986; Dufour and Bertrand, 1994a; Dufour
and Bertrand, 1994b; Helmick, Blair and Hoffman, 1996; Hilhorst, Ameron-
gen and Lohnberg, 1991; Millnert, 1987; Morse and Mayne, 1992; Petridis and
Kehagias, 1998).
13.6.6 nee structured models

We have already discussed tree-structured models; their potential advantages
have resulted in their (relatively recent) introduction into control theoretic
methodologies. See (Foss, Johansen and Sorensen, 1995; Johansen and Foss,
1995; Hunt, Kalkkuhl, Fritz and Johansen, 1996; Murray-Smith and Johansen,
1997; Stromberg, Gustafson and Ljung, 1991).
In some sense Markovian and tree models are complementary: Markovian
models impose structure on the use of multiple models through time and tree
models impose structure on (models) space.
13.6.7 Unsupervised Generation of Multiple Models

In most of the approaches discussed above the user must design the local models
"manually". It is of obvious importance to obtain automatic model building
methods. Several possibilities exist along these lines. For instance, in tree
structured models, the complete arsenal of classification and regression trees
(Section 13.3.3) could be used profitably. Another approach is the use of genetic
algorithms to generate candidate models. See for instance (Li et aI., 1995, Li
264
et ai., 1994, Tan et al., 1995). The approach we have presented in Chapter 9
in connection to the waste water treatment plant is also relevant.
13.6.8 Sensor Fusion

Finally, here is a research area which does not lie within control theory proper
but is sufficiently related to be presented at this point. Sensor or Decision
Fusion refers to the problem of combining evidence from various sensors in or-
der to reach a decision that utilizes information generated from several sources.
While this is not strictly speaking a multiple models problem, we believe there
are methods in this field which may prove useful in the multiple models contex.
For instance, a problem treated in the sensor fusion context is the (Bayesian)
evaluation of several (perhaps conflicting) hypotheses; this is of obvious value
for comparing and/or combining multiple models. Another sensor fusion prob-
lem is the arrangement of a given number of sensors in serial, parallel or ser-
ial/parallel combinations so as to minimize the probability of error; this appears
to be related to considerations regarding growing classification and regression
trees and lor tandems and merits further examination.
The field of sensor fusion has been growing at an explosive rate in the last two
decades. An extensive and recent overview appears in (Dasarathy, 1994). Some
important papers in this area are (Caputi and Moose, 1993; Krzystoforowitcz,
1990; Hong and Lynch, 1993; Kazakos, 1991; Papastavrou and Athans, 1992;
Tenney and Sandell, 1981).
13.7 STATISTICS
Most of the work we have discussed in the previous sections could equally be
classified as statistical procedures. There are however two important statistical
methodologies which remain to be discussed: Hidden Markov Models (HMM)
and Graphical Models.
13.7.1 Hidden Markov Models

Hidden Markov models (HMM) are pairs of stochastic processes (Zt, Yt), where
Zt is Markovian and unobservable, while Yt is a function (deterministic or sto-
chastic) of Zt and fully observable. This definition is quite general and encom-
passes a large number of systems (for instance stochastic dynamical systems
such as the ones studied in control theory). The study of HMMs has tradition-
ally focussed on the case where Zt is taking values in a discrete, finite set; the
process Yt is usually (but not exclusively) also discrete valued.
The study of HMMs goes back to (Blackwell and Koopmans, 1957; Dhar-
madikari, 1963). However the real impetus in the area was generated by (Baum
et ai. 1970; Baum, 1972). In these papers an efficient training method was es-
tablished for parameter estimation of HMMs. This method was eventually
understood to be a special case of the EM algorithm (Dempster, Laird and Ru-
bin, 1977). The existence of an efficient training method allowed the widespread
use of HMMs in speech recognition problems (Jelinek et al., 1983; Levinson et

al., 1983). In the last fifteen years the use of HMMs has been extended to large
number of pattern recognition problems. An extensive and modern coverage
at book length is available in (Elliot, Aggoun and Moore, 1995). Engineering-
oriented reviews appear in (Poritz, 1988; Rabiner, 1988). A generalization
of HMMs to Hidden Markov Random Fields appears in (Kunsch, Geman and
Kehagias, 1995).
It should be obvious that our Markovian source witching model is essentially
a HMM. In this context, a HMM with K states may be considered to consist of
K sub-sources, each having its own output function. This observation under-
scores the usefulness of HMMs in multiple models situations. Essentially, what
HMMs offer is a framework for modelling the operation of serveral observation
generating processes (the Yt's) and the transition from one such process to the
next (the Zt's). The multiple model characteristics of this description become
more obvious when we turn to more complicated HMMs where the observation
process may utilize a complex input / output mechanism, while the parame-
ters of this mechanism depend on the current state, i.e. on the current source.
Hence, an input / output model corresponds to each state. In fact, predictive
hidden Markov models (Kenny, Lennig and Mermelstein, 1990; Deng, 1992;),
which use various observation generating mechanisms (Gaussian distributions,
linear models, neural networks) are very similar to our PREMONNs. The same
can be said of (Poritz, 1982; Iso and Watanabe, 1990; Kung and Taur, 1995;
Rahim and Lee, 1996) and of Bengio's I/O HMMs (Bengio and Frasconi, 1995;
Bengio and Frasconi, 1996).
Other connections between HMMs and neural networks have been pointed
out (Bourlard and Wellekens, 1990) and various HMM/NN hybrids have been
proposed (Baldi and Chauvin, 1996; Bennani, 1995; Bengio, leCun, Nohl and
Burgers, 1995; Bengio, deMori , Flammia and Kompe, 1992; Bourlard and
Morgan, 1990; Bourlard and Morgan:1991; Morgan et al. 1993)
Finally, an interesting connection between HMMs and neural networks is
pointed out in (Kehagias, 1990). Also a good bibliiography on hybrid HMM/NN
models appears in (Bengio, 1996).
13.7.2 Graphical Models

Graphical models (also known as Bayesian networks and influence diagrams)
are a relatively recent development. An important book in the history of the
subject is (Pearl, 1988). A more statistically oriented point of view appears in
(Whittaker, 1990) and (Jensen, 1993).
Graphical models can be explained shortly as follows: random variables are
represented as nodes in a (usually directed acyclic) graph and probabilistic de-
pendencies between the variables are represented by arcs connecting the nodes.
This representation offers a powerful visual aid to the user, but there is more
to graphical models than visualization. They offer a unified framework within
which trees, HMMs and neural networks (in particular hierarchical mixtures of
experts) can be described. However, the most important point is that a pow-
266
erful set of training methods is available for training graphical models; these
include the EM algorithm and variational methods (Jordan, Ghahramani and
Saul, 1998).
It is of particular interest that a number of papers document the relation-
ships between graphical models and neural networks, notably ( Ghahramani
and Jordan, 1997; Ghahramani, 1998; Hofman and Tresp, 1995; Jordan, 1994;
Jordan, Ghahramani and Saul, 1997; Meila and Jordan, 1997; Neal, 1992; War
terhouse and Robinson, 1995). Finally, a good overview of learning methods
appears in (Buntine, 1995; Buntine, 1996).
14 EPILOGUE
We have presented the PREMONN family of algorithms for time series classifi-
cation, prediction and identification. The PREMONN algorithms are modular,
in the sense that they concurrently employ a number of time series models,
each of which may be modified or removed from the PREMONN system with-
out affecting the remaining modules. Hence PREMONNs belong to the larger
family of modular or multiple models algorithms which, as we have seen in the
previous chapter, have a long and successful history in various disciplines.
We believe the success of multiple models methods is due to the employment
of two important components. The first component is, quite obviously, the use
of multiple models, which may be considered as an implementation of the fun-
damental problem solving approach of divide-and-conquer. The advantages of
this approach are so well understood that there is no need for further comments
here. However, this is not the whole story. The use of multiple models in itself
would be ineffective if there was not an organizing framework within which the
multiple models can be employed to advantage. Regarding the choice of such
a framework there is considerable diversity of opinion; hence the plethora of
approaches and algorithms which is evident from the bibliographic references
of the previous chapter.
In our view, graphical models provide a good framework for reconciling most,
if not all, of the approaches which we have discussed. The operation of multiple
models or modules is organized along the edges of a graph, which delineates
the flow of information and computation. Classification and prediction can be

268
reformulated as estimation problems in the context of a probabilistic or causal
Bayesian network. Efficient estimation algorithms exist for the case of sparse
graphs (the EM algorithm) and are being developed for the case of dense ones
(variational methods).
However there is a catch. If a graph structure is available, then the estima-
tion problem is relatively straightforward. But when the graph is not available,
discovering it can be quite a hard problem in itself. Hence there is a great in-
terest in constructive and growing algorithms; we have offered a few references
to the field in the previous chapter and the interested reader can use these as
starting points to the explosively growing literature of the field.
Our point of view is hardly neutral. Our personal research program consists
in extending our source identification algorithms (Chapters 10, 11 and 12)
making them more efficient and studying their theoretical properties. The
analysis presented in Chapters 11 and 12 is a starting point; much remains to
be done.
Let us then conclude the book by listing a few problems which we consider
interesting.
1. We are interested in obtaining a more rigorous convergence proof for the case
of many sources and / or predictors; the arguments presented in Chapters
11 and 12 can be considered as heuristic. A more rigorous treatment may
require more delicate tools.
2. For example, it may be necessary to introduce a higher dimensional special-

ization process. For illustration purposes consider the case of the parallel
algorithm with two sources and two predictors. The specialization process
has been defined as X t = Nl l - Nfl + Nf2 - Nl 2. This is a scalar process.
The process [Nll - Nfl Nf2 - NPl will undoubtedly furnish more informa-
tion regarding the specialization state of each predictor. In order to analyze
its convergence properties it will probably be necessary to employ methods
traditionally used for the analysis of two dimensional random walks. This
may be a hard problem.
3. We believe that our convergence conditions are sensible but hard to verify
for a practical problem, since they depend on the combination of active
sources, training algorithm and error threshold. We would like to develop
more applicable conditions. In addition we would like to understand the
existing ones better.
4. To this goal it will probably be useful to consider issues relating to the com-
plexity of the sources we want to identify and the capacity of the models we
employ. Concepts and tools form PAC, complexity and information theory
will probably be useful.
5. Finally, on a more applied note, we would like to further compare the perfor-
mance of serial and parallel data allocation algorithms and also to examine
the potential advantages of hybrid data allocation.
EPILOGUE 269
We consider these to be exciting problems; if this book stimulates research

along the above or similar lines, we will consider it successful.
Appendix A
Mathematical Concepts
In this appendix we review some concepts of measure theoretic probability.

Some concepts of mathematical analysis are prerequisite; they can be found in
many standard textbooks, for example (Royden, 1968) or (Billingsley, 1986).
A.I NOTATION
Here are a few symbolisms which we use throughout the book.
1. The set of nonnegative integers {O, 1,2, ... } is denoted by N.
2. The set of integers { ... , -1,0,1, ... } is denoted by Z.

3. The set of real numbers is denoted by R.
4. We will often make use of the indicator /unction, denoted by l(A), where
A is some event (see also the next section). When the event is true, the
indicator function takes the value one; other wise it is zero. More formally
l(A) = {Io if A is true,

else.
So for example l(x > 3) equals 1 when x = 5, and equals 0 when x = 2.

5. Generally, if e is a set, then eN signifies the set of N-tuples from this set,
i.e.
eN = {(lh,02, ... ,ON): On E e for n = 1,2, ... ,N};

in other words
eN = e x ex. ... x e .
, I
N times
272
For instance R N denotes the set of N - tuples of real numbers (Xl, X2, ... , X N ),
Xn E R for n = 1,2, ... ,N.
6. If X is a vector or matrix, then x' denotes its tmnspose.

7. The end of the proof of a theorem or lemma is denoted by the symbol e.
A.2 PROBABILITY THEORY
A.2.1 Fundamentals
We will use the standard setup of probability theory. Our exposition is brief;
for more details the reader is referred to (Billingsley, 1986).
We start with a probability space (0, F, P). Here 0 is the universal set. F
is a sigma field in 0 , i.e. a collection of subsets of 0 which contains 0 and
is closed under complements and countable unions. P is a probability measure
defined on elements of F, i.e. a set function P : F ---+ [0,1], which satisfies
P(0) = 0, P(O ) = 1 and is countably additive.
Random variables are P-measumble functions X(w) of the form
X : 0 ---+ <p,
where <p is an appropriate range. Stochastic processes are sequences of random
variables, for instance: Xo(w), Xl (w), X 2 (w), .... Following standard usage
(for reasons of brevity) we usually omit denoting the dependence on w, writing
a random variable as X, and a stochastic process as X o, XI> ....
Events are simply elements of F, i.e. P-measurable sets. For instance we
may talk of the event that X t < 1 and write something like
A= {Xt < I};
this actually is a shorthand for
A = {w such that Xt(w) < I}.
In many occasions we will consider the probability of an event A under the
probability measure P. This is denoted as P(A) or, more generically, as Pr(A)
(if it is clear from the context which measure P is referred to).
By fixing a particular point Wo EO, we obtain Xo(wo), XI(wo), X 2 (wo), ...
which is a sample path or realization of the stochastic process Xo(w), XI(w),
X 2 (w), ....
Next we define stationary and ergodic stochastic processes.
Definition 2 A stochastic process X o, XI, X 2 , ... is called stationary if for
every tEN and for every set A E Foe we have
Pr ([Xo, Xl, ... J E A) = Pr ([Xt+I> X t +2 , ••• J E A) .
Consider some set <p and the set <poe, i.e. the set of infinite sequences
from <P; finally take a set A c <poe. A is called shift-invariant if for every
(<Po, (PI , <P2, ... ) E A we have
(<Po, <P1> <P2, ... ) E A {:} (<PI> <P2, ... ) EA.
APPENDIX A: MATHEMATICAL CONCEPTS 273
Definition 3 A stochastic process Xo, Xl, X 2 , •.• is called ergodic if for every
shift-invariant set A, we have that Pr(Xo, Xl, X 2 , ... E A) is equal to either 0
or 1.
A.2.2 Limits
LemmaA.l IfAcB andCcD thenAnCcBnD.
Proof. This is quite obvious. If x E An C then x E A and x E C, so x E B

and xED, so x E B nD .•
Lemma A.2 (i) If for all n we have Dn+l C Dn then lim Pr(Dn)
n-HXl
= Pr(D),
00
where D = n=l
n Dn.
00
(ii) If for all n we have E n+l :J En then lim Pr(En )
n--+oo
= Pr(E), where E = U
n=l
En.
Proof. Part (i) is proved in (Royden, 1968). This is essentially the Monotone
Convergence Theorem (which will be stated more generally in the next section)
applied to the indicator functions fn(w) =l(w E D n ), which have as limit the
indicator function f(w) =l(w E D). Part (i) can then be used to prove part
(ii). Consider the sets Dn = E~, n = 1,2,.... Then Dn+l C Dn and so
lim Pr(D n ) = Pr(D), where D =
n~CXJ
n
Dn. Next it is shown that D = E C (and
n=l
hence Pr(E) = 1- Pr(D)). Indeed:
1. If xED, then x E Dn. for all n. This means that x E E~. for all nand
so x rf- En for any n and so x rf- n~l En. So, x E
DC EC.
C~l En r = EC. Hence,
2. If, on the other hand, x E E C = C~l En) C , then x rf- En for any n and so
= E~, for every n. =n=l
00
xED In other words, xED n Dn. Hence E C cD.
In short, EC = D, so Pr(E C) = Pr(D), and Pr(E) =

1 - Pr(D)= 1- lim
n--+oo
Pr(Dn)= lim [1 - Pr(Dn)]= lim Pr(En ) and the proof is complete .•
n-+oo n-+oo
The terms infinitely often (i.o.) and almost always (a.a.) are defined as
follows.
Definition 4 Given a sequence of events Ab A 2 , ... , the event A = "An occurs

00 00
infinitely often" ("i.o.") is defined to be A = n U A k . We also write A =
n=lk=n
{An i.o.}.
Another way to describe "Lo." is this: A= {w : tin 3kn ;::: n such that
wE Ak n } ' Note that "An occurs infinitely often" does not depend on n.
274
Definition 5 Given a sequence of events Al, A 2 , ... , the event A = "An occurs
almost always" ("a.a") is defined to be B = U n A k . We also write A = {An
00 00
n=lk=n
a.a.}. Another way to describe "a.a." is this: A= {w : 3no such that Vn ~ no
wEAn}. Note that "An occurs almost always" does not depend on n.
Lemma A.3 The negation of "infinitely often" is "almost always", i.e. {An
i.o.Y = {A~ a.a.}.
Proof. Take a sequence of events A 2 , A 2 , ... E F. First we show that:
00
nCllk~n Ak
00
C
( 00 00
n~lk~n Ak
)C
(A.l)
To see this, consider

00 00
wEn U Ak =>
n=lk=n
Vn 3kn ~ n such that w E Akn =>
Vn 3kn ~ n such that w ¢ At =>
Vn
Hence we have shown eq.{A.l). Second, we show that

00 00 c)C 00 00
(
n~lk~n Ak C nCllk~n Ak . (A.2)
r
To see this consider
wE (nQlk~n Ak =>
00 00
w ¢ u n Ak =>
n=lk=n
There is no no such that Vn ~ no wEAn =>
Vn 3kn ~ n such that w ¢ Akn =>
Vn 3kn ~ n such that w E Akn =>
00 00
wEn U Ak
n=lk=n
and so we have proved eq.{A.2) and the proof of the lemma is finished .•
We will use the Borel-Cantelli Lemma several times in connection to events
occurring infinitely often or almost always. This Lemma is stated as follows.
Lemma A.4 (Borel-Cantelli) If L:nPr{An) < 00, then Pr{A n i.o.) = O.
Proof. It appears in (Billingsley, 1968, pp.53-55).
A.2.3 Probability distributions, densities and functions
Definition 6 Given a mndom variable X, its probability distribution function

is denoted by Fx(x) and defined by
Fx(x) ~ Pr(X ~ x).
Note that, by the definition of a random variable, the probability distribution

function of a random variable exists always. This is not necessarily true of the
probability density function.
Definition 7 Given a mndom variable X with a differentiable probability dis-

tribution function Fx(x), the probability density function of X is denoted by
dx(x) and is defined by
dx(x) ~ d~Fx(x). (A.3)
It follows immediately from the above definition that for a random variable
X taking values in R we have
Pr(X ~ x) = [~ dx(y)dy.
If X is differentiable, then certainly it does not take values in a countable set.
Conversely, a countable valued random variable does not have a probability
density in the sense of the above definition. However, it is convenient to define
for a countable valued random variable X the following quantity
dx(x) ~ Pr(X = x). (AA)
We use the same symbol dx(x) because the quantities defined in eqs.(A.3),
(AA) are, in a sense, analogous.
A.2.4 Expectation
Having defined a probability measure P, we can integrate random variables

X(w) with respect to P. The mathematical expectation or mean of X is the
integral of X over all n .
Definition 8 The mathematical expectation of X(w) is denoted by E(X) and

is defined to be
E(X) ~ k X(w)dP(w).
It should be pointed out that in case n is countable, then the expectation reduces
to a sum:
E(X) = L X(n)P(n).
nEO
276
We also define the variance of a random variable X as follows
Definition 9 The variance of X(w) is denoted by Var(X) and is defined to

be
Var(X) ~ In (X(w) - E(X»2 dP(w) = E [(X - E(X»2] .
The Bounded Convergence Theorem is useful when it is necessary to inter-

change the operations of taking expectations and taking limits.
Theorem A.5 (Bounded Convergence) If Xo, Xl, ... and X are random
variables and there is a constant M such that with probability one and for
n = 0, 1, ... we have IXnl < M,then
lim Xn = X => lim E(Xn) = E(X).
n~oo n~oo
Proof. It appears in Billingsley (p.214), in slightly more general form.
A.2.5 Conditioning
The reader is probably familiar with the notions of conditional probability, de-
fined as follows.
Definition 10 If A and B are events, then the conditional probability of A

given B is defined by
P (AlB) ~ Pr(A n B) (A. 5)

r Pr(B) .
The following lemma will be useful in the proof of Theorems 11.2, 12.2.
Lemma A.6 Consider two sets A and B such that Pr(A) = a > 0 and
Pr(B) = 1. Then we have
Pr(BIA) = 1.
Proof. Since Pr(A) > 0, the conditional probability is defined by
Pr(BIA) = pr~:(~t) .
It is clear that
(B n A) U (Be n A) = A,
and that
(B n A) n (Be n A) = 0.
It then follows that
Pr(B n A) = Pr(A) - Pr(B e n A). (A. 6)
But we also have that
BC nA c B C ~ Pr(B C n A) :::; Pr(B = 0

C) (A.7)
(since Pr(B) = 1). Hence from eqs.(A.6) and (A.7) we have that
Pr(B n A)
Pr(B n A) = Pr(A) ~ Pr(BIA) = Pr(A) = 1.
and the proof of the lemma is complete.e

The definition of conditional expectation can be generalized. Consider two
sigma fields Q, F with Q c:F. Take any event A E F. The conditional
probability of A given Q is defined to any random variable X(w) which is Q
-measurable, integrable and for all G E Q satisfies
faX(W)dP = P(A n G). (A.8)
We use the standard notation and denote conditional probability of A not by

X but by Pr(AIQ).
This may look somewhat exotic, but it actually is a generalization of the
previous definition of conditional probability by eq.(A.5). To see that eq.(A.5)
is a special case of eq.(A.8), consider any G E F and take F = {0, n, G, GC} .
It is easy to check that is Q is a sigma field. Now consider the following random
variable
Pr(AnG)
Pr(G) ifw E G
Pr(AnG C )
X(w) = {
Pr(Gc)
if wE GC (A.9)
Pr(A) ifw E n
o if wE 0.
It is easy to check that X(w) as defined in eq.(A.9) satisfies eq.(A.5).

We need to introduce the more general definition of conditional probability
because we want to talk about the conditional probability of X given Y, where
X and Y are random variables. Given two measurable sets A and B, the events
{X E A}, {Y E B}
are F-measurable. Furthermore, we can define a sigma field Q to be Q ==
{{Y E B}, {Y E B}C, n, 0}. Then it makes sense to talk about the conditional
probability of X given Y, denoted by Pr(XIY), where we define
Pr(X E AIY E B) = Pr(AIQ).

Now we can define independent random variables.
Definition 11 We say that the random variables X and Yare independent

if for all events A, B we have
Pr(X E AIY E B) = Pr(X E A) Pr(Y E B).

278
We say that two random variables are dependent if they are not independent.
In particular, we can take the set A to be (-00, x) and define a conditional

probability distribution function by
Fx(xlY E B) = Pr(X ::; xlY E B).
This is a function of x and the set B. Now, if Fx(xlY E B) is differentiable

with respect to x, then we can also define the conditional probability density of
X.
dx(xlY E B) ~ d~Fx(xIY E B)
In particular, we will be interested in the case when the set B is {y}. Then
we have
Fx(xlY = y) ~ Pr(X ::; xlY = y)

which is a function of x and y. And, if Fx(xlY = y) is differentiable with
respect to x, then we can also define the conditional probability density of X
given that Y = y, as
dx(xlY = y) ~ d~Fx(xIY = y)
We will sometimes use in place of dx(xlY = y) a shorter (and somewhat
abusive) notation, writing the conditional probability density of X, given Y = y,
as dx(aIY); the meaning should be clear from the context.
A.2.6 Sums of Random Variables

The Strong Law of Large Numbers describes the limiting behavior of time av-
erage of independent random variables.
Theorem A.7 (Strong Law of Large Numbers) If X o, Xl, ... are inde-
pendent and identically distributed and E(IXol) < 00, then
Pr (lim Xo + Xl + ... + X n - l = E(Xo)) = l.

n~oo n
Proof. It appears in (Billingsley, pp.290-292).

The Ergodic Theorem describes the limiting behavior of the time average of
dependent random variables.
Theorem A.8 (Ergodic) If X o, Xl, ... is a stationary and ergodic stochastic

process and E(IXol) < 00, then
Pr (lim Xo + Xl + ... + X n - l = E(Xo)) = l.

n-.oo n
and, for any measumble function f(.,., ... ) such that E (If(Xo, Xl. ... )1) < 00,
we also have
Pr (lim f(Xo, Xl, ... ) + ... + f(X n , X n +1, ... ) = E (f(XO, Xl, ... ))) = 1.
n-+oo n
Proof. See (Karlin and Taylor, 1975, pp.487-488).
A.3 SEQUENCES OF BERNOULLI TRIALS

Definition 12 A stochastic process X o, X lt X 2 , •• , is called a sequence of
Bernoulli trials if
1. each mndom variable X t , t = 0, 1,2, ... is independent of the remaining ones;
2. each mndom variable X t , t = 0, 1,2, '" can take only two possible values, call
them a and Cj
3. the mndom variables X t , t = 0,1,2, ... have identical probability distribution
Pr(Xt = a) = a and Pr(Xt = c) = "{. (A.lO)

We are particularly interested in the behavior of the average
X O+X2 +",+Xn
Vn
when X o, Xl, X 2 , ••• is a sequence of Bernoulli trials and a = 0 , c = 1. In
this case, it is easy to show that for any t we have
E ( X o + Xl : ... + X n - l ) = "{
and
.~
V ar ( XO+XI+ ... +Xn-l) = yna"{.
n
Hence, rather than studying the behavior of xQ±x~ik·±Xn, we will consider
instead the normalized average
Xo + Xl + ... + X n - l - n"{
..jna"{
which has expectation equal to zero and variance equal to one. For this nor-
malized average we can prove the following.
Theorem A.9 If X o, Xl, ... is a sequence of Bernoulli trials with Pr(Xt =
0) = a and Pr(Xt = 1) = ,,{, and 8 is any number greater than one, then we
have
Pr (lXo +XI +~Xn-l -Wil ~ J28l0 g(t)) < t~'

Proof. The proof appears in (Feller, 1968, p.175 and p.203).
280
A.4 MARKOV CHAINS

Markov chains are an important class of stochastic processes. We limit ourselves
to Markov chains which take values in discrete time and countable state space.
More specifically we have the following.
Definition 13 A stochastic process X o, Xl, X 2 , ••• , taking values in a count-

able (finite or infinite) set e is said to be a Markov Chain if its conditional prob-
abilities satisfy the following relationship for t = 0, 1, 2, ... and all Bi , Bj , Bk , •••
E e.
Pr(Xt = Bi/Xt - l = Bj , X t -2 = Bk , ... ) = Pr(Xt = Bi/Xt - l = Bj ).
In this case we can define the transition probability matrix as follows.
Definition 14 The transition probability matrix P of the Markov chain X o,

Xl, X 2 , ••• , has elements Pij, i, j E e defined by Pij ~ Pr(Xt = Bi /Xt - l = Bj ).
Definition 15 Consider a Markov chain X o, Xl, X 2 , ••• , taking values in the

countable set e. X t The Markov chain is called irreducible iff for all Bi , Bj E e
there is some tEN such that we have
Thus a Markov chain is irreducible if there is a positive probability of moving

from any state to any other state. Irreducible Markov chains can be divided into
two categories: persistent and transient ones. We first must define persistent
and transient states.
Definition 16 Consider a Markov chain X o, Xl, X 2 , ••• , taking values in

the countable set e. The state Bi E e is called persistent iff
Pr(Xt = Bi for some t /Xo = Bi ) = 1.
Definition 17 Consider a Markov chain X o, Xl, X 2 , ••• , taking values in
the countable set e. The state Bi E e is called transient iff
Pr(Xt = Oi for some t /Xo = Bi ) < 1.
Clearly a state is either transient or persistent.. The following theorem

characterizes transience and persistence of irreducible Markov chains.
Theorem A.I0 Consider a Markov chain X o, Xl, X 2 , ••. , taking values in

e = {... , B-1, Bo, Bl, ... }.
If X t is irreducible, either all states are transient or
all states are persistent. In the first case we have
Pr (Xt = Bm i.o./Xo = Bi ) = 0 for all i, m in Z,
L
00
Pr (Xt = Bm i.o./Xo = Bi ) = 0 for all i in Z.

m=-oo
Proof. The theorem is stated and proved (in a somewhat more general form)
in (Billingsley, 1986).e
In view of the above theorem we can characterize the entire Markov chain
(provided it is irreducible) as persistent or transient.
References
Ackerson, G. and Fu, K (1970). On state estimation in switching environments.

IEEE Trans. on Automatic Control, 15:10-17.
Alpaydin, E. (1990). Grow-and-learn: An incremental method for category
learning. In Proc. Int. Neural Network Conj., pages 761-764.
Alpaydin, E. and Jordan, M. I. (1996). Local linear perceptrons for classifica-
tion. IEEE Trans. on Neural Networks, 7:788-792.
Anand, R., Mehrotra, K, Mohan, C. K, and Ranka, S. (1995). Efficient classifi-
cation for multiclass problems using modular neural networks. IEEE Trans.
on Neural Networks, 6:117-124.
Anderson, P. (1985). Adaptive forgetting in recursive identification through
multiple models. Int. Journal of Control, 42:1175-1193.
Angeline, P. J., Saunders, G. M., and Pollack, J. B. (1994). An evolutionary
algorithm that constructs recurrent neural networks. IEEE Trans. on Neural
Networks, 5:54-65.
Armstrong, J. (1989). Combining forecasts: the end of the beginning or the
beginning of the end? Int. Journal of Forecasting, 5:585-588.
Arrowsmith, D. and C.M.Place (1990). An Introduction to Dynamical Systems.
Cambridge Univeristy Press.
Ash, T. (1989). Dynamic node creation in backpropagation networks. Connec-
tion Science, 1:365-375.
Athans, M. et al. (1977). The stochastic control of the F8-C aircraft using a
multiple model adaptive control method - part i: equlibrium flight. IEEE
Trans. on Automatic Control, 22:768-780.
Atkeson, A., An, C. H., and Hollerbach, J. N. (1986). Estimation of inertial
parameters of manipulator links and loads. Int. Journal of Robotics Res.,
5:101-119.
Aussem, A. and Murtagh, F. (1997). Combining neural network forecasts on
wavelet transformed time series. Connection Science, 9:113-121.
Ayesa, E. et al. (1993). Evaluation of sensitivity and observability of the state
vector for system identification and experimental design. Water Science and
Technology, 28:209-218.
284
AZimi-Sadjadi, M. R., Sheedvash, S., and Trujillo, F. O. (1993). Recursive dy-

namic node creation in multilayer neural networks. IEEE Trans. on Neural
Networks, 4:242-256.
Backer, E. and Jain, A. (1981). A clustering performance measure based on
fuzzy set decomposition. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 3:66-75.
Baffes, P. T. and Zelle, J. M. (1992). Growing layers of perceptrons: Introducing
the Extentron algorithm. In Proc. Int. Joint Conf. on Neural Networks, pages
392-397.
Baldi, P. and Chauvin, Y. (1996). Hybrid modeling, HMMjNN architectures,
and protein applications. Neural Computation, 8:1541-1565.
Banan, M. R. and Hjelmstad, K. D. (1992). Self-organization of architecture
by simulated hierarchical adaptive random partitioning. In Proc. Int. Joint
Con/. on Neural Networks, pages 823-828.
Bartfai, G. and White, R. (1997). Adaptive resonance theory based modular
networks for incremental learning of hierarchical clusters. Connection Sci-
ence, 9:87-112.
Bartlett, E. B. (1994). Dynamic node architecture learning: An information
theoretic approach. Neural Networks, 7:129-140.
Bates, J. and Granger, C. (1969). The combination of forecasts. Operational
Research Quarterly, 20:451-467.
Baum, E. B. and Lang, K. J. (1991). Constructing hidden units using examples
and queries. In Advances in Neural Information Processing Systems 3, pages
904-910.
Baum, L. (1972). An inequality and associated maximization technique in sta-
tistical estimation of probabilistic functions of a Markov process. Inequalities,
3:1-8.
Baum, L. et al. (1970). A maximization technique occurring in the statistical
analysis of probabilistic functions of Markov chains. Annals of Mathematical
Statistics, 41:164-171.
Bellido, I. and Fernandez, G. (1991). Backpropagation growing networks: To-
wards local minima elimination. In Artificial Neural Networks, Proc. IWANN,
pages 130-135.
Benaim, R. (1994). On functional approximation with normalised gaussian
units. Neural Computation, 6:319-333.
Benediktsson, J. A., Sveinsson, J. R., and Ersoy, O. K. (1997). Parallel consen-
sual neural networks. IEEE Trans. on Neural Networks, 8:54-64.
Bengio, Y. (1996). Markovian models for sequential data. Tech. report, Dept.
Informatique et Recherche Operationnelle, Universite de Montreal.
Bengio, Y., de Mori, R., Flammia, G., and Kompe, R. (1992). Global opti-
mization of a neural network-hidden Markov model hybrid. IEEE Trans. on
Neural Networks, 3:252-259.
Bengio, S., Fessant, F., and Collobert, D. (1996). Use of modular architectures
for time-series prediction. Neural Processing Letters, 3:101-106.
REFERENCES 285
Bengio, Y. and Frasconi, P. (1995). An input/output HMM architecture. In

Advances in Neuml Information Processing Systems 7, pages 427-434.
Bengio, Y. and Frasconi, P. (1996). Input-output HMM's for sequence process-
ing. IEEE Trans. on Neuml Networks, 7:1231-1249.
Bengio, Y., LeCun, Y., Nohl, C. and Burges, C. (1995). Lerec: A NN/HMM hy-
brid for on-line handwriting recognition. Neuml Computation, 7:1289-1303.
Bennani, Y. (1995). A modular and hybrid connectionist system for speaker
identification. Neuml Computation, 7:791-798.
Bezdek, J., C. Coray, R. G., and Watson, J. (1981a). Detection and character-
ization of cluster substructure 1. linear structure: fuzzy c-lines. SIAM J. of
Applied Mathematics, 40:339-352.
Bezdek, J., Coray, R. G., and Watson, J. (1981b). Detection and characteriza-
tion of cluster substructure II. fuzzy c-varieties and complex combinations
thereof. SIAM J. of Applied Mathematics, 40:352-372.
Bezdek, J. and Harris, J. (1978). Fuzzy partitions and relations: an axiomatic
basis for clustering. Fuzzy Sets and Systems, 1:111-127.
Billingsley, P. (1986). Probability and Measure. Wiley.
Bishop, C. (1991). Improving the generalization properties of radial basis func-
tion neural networks. Neuml Computation, 3:579-588.
Blackwell, D. and Koopmans., L. (1957). On the identifiability problem for
functions of finite Markov chains. Ann of Math. Stat., 28:1011-1015.
Blom, H. and Bar-Shalom, Y. (1988). The interacting multiple model algo-
rithm for systems with Markovian switching coefficients. IEEE Trans. on
Automatic Control, 33:780-783.
Blonda, P., Pasquariello, G., and Smith, J. (1993). Comparison of backpropa-
gation, Cascade-Correlation and Kohonen algorithms for cloud retrieval. In
Proc. Int. Joint Con! on Neuml Networks, pages 1231-1234.
Bodenhausen, U. and Waibel, A. (1993). Application oriented automatic struc-
turing of time-delay neural networks for high performance character and
speech recognition. In Proc. Int. Joint Con! on Neuml Networks.
Bonnlander, B. V. and Mozer, M. C. (1992). Metamorphosis networks: An alter-
native to constructive methods. In Advances in Neuml Information Process-
ing Systems 4, pages 131-138.
Bors, A. G. and Pitas, 1. (1996). Median radial basis function neural network.
IEEE Trans. on Neuml Networks, 7:1351-1364.
Bottou, L. and Bengio, Y. (1995). Convergence properties of the k-means al-
gorithm. In Advances in Neuml Information Processing Systems 7, pages
585-592.
Bourlard, H. and Morgan, N. (1990). A continuous speech recognition system
embedding MLP into HMM. In Advances in Neuml Information Processing
Systems 2, pages 186-193.
Bourlard, H. and Morgan, N. (1991). Merging multilayer perceptrons and hid-
den Markov models: Some experiments in continuous speech recognition. In
Neuml Networks: Advances and Applications 3, pages 215-239.
286
Bourlard, H. and Wellekens, C. J. (1990). Links between Markov models and

multilayer perceptrons. IEEE Trans. on Pattern Analysis and Machine In-
telligence, 12:1167-1178.
Breiman, L. (1968). Probability. Addison-Wesley.
Breiman, L. (1996a). Bagging predictors. Machine Learning, 24:123-140.
Breiman,1. (1996b). Stacked regressions. Machine Learning, 24:49-64.
Breiman, L. (1996c). Technical note: some properties of splitting criteria. Ma-
chine Learning, 24:41-47.
Breiman, L., Friedman, J. H. Olshen, R. and Stone, C. (1984). Classification
and regression trees. Wadsworth.
Brent, R. P. (1991). Fast training algorithms for multilayer neural nets. IEEE
Trans. on Neural Networks, 2:346-354.
Breslow, L. A. and Aha, D. W. (1997). Simplifying decision trees: a survey.
Knowledge Engineering Review, 12:1-40.
Brodley, C. E. and Utgoff, P. E. (1995). Multivariate decision trees. Machine
Learning, 19:45-77.
Bucy, R. (1969). Bayes theorem and digital realizations for non-linear filters.
Journal of Astronautical Sciences, 17:80-94.
Bucy, R. and Senne, K. (1971). Digital synthesis of nonlinear filters. Automatica,
7:287-298.
Bunn, D. (1975). A Bayesian approach to the linear combination of forecasts.
Operational Research Quarterly, 26:325-329.
Buntine, W. (1994a). Learning classification trees. In Artificial Intelligence
Frontiers in Statistics. Chapman and Hall.
Buntine, W. (1994b). Operations for learning using graphical models. Journal
of Artificial Intelligence Research.
Buntine, W. (1995). Graphical models for discovering knowledge.
Buntine, W. (1996). A guide to the literature on learning probabilistic networks
from data. IEEE Trans. on Knowledge and Data Engineering, 8:195-211.
Cacciatore, T. W. and Nowlan, S. J. (1994). Mixtures of controllers for jump
linear and nonlinear plants. In Advances in Neural Information Processing
Systems, pages 719-726.
Caputi, M. and Moose, M. (1993). A modified gaussian sum approach to es-
timation of nongaussian signals. IEEE Trans. on Aerospace and Electronic
Systems, 29:446-451.
Caputi, M. and Moose, M. (1995). A necessary condition for effective perfor-
mance of the multiple model adaptive estimator. IEEE Trans. on Aerospace
and Electronic Systems, 31:1132-1138.
Carpenter, G. A. and Grossberg, S. (1990). ART 3: Hierarchical search using
chemical transmitters in self-organizing pattern-recognition architectures.
Carpenter, G. A., Grossberg, S., and Reynolds, J. (1991). ARTMAP: A self-
organizing neural network architecture for fast supervised learning and pat-
tern recognition. In Proc. Int. Joint Conf. on Neural Networks, pages 863-
868.
REFERENCES 287
Carpenter, G. A., Grossberg, S., and Rosen, D. (1991). ART 2-A: An adap-
tive resonance algorithm for rapid category learning and recognition. Neural
Networks,4:493-504.
Castellano, G., Fanelli, A. M., and Pelillo, M. (1997). An iterative pruning
algorithm for feedforward neural networks. IEEE Trans. on Neural Networks,
8:519-531.
Catfolis, T. and Meert, K. (1997). Hybridization and specialization of real time
recurrent learning based networks. Connection Science, 9:51-69.
Chaer, W. S., Bishop, R. H., and Ghosh, J. (1997). A mixture-of-experts frame-
work for adaptive Kalman filtering. IEEE Trans. on Systems, Man and Cy-
bernetics, Part B, 27:452-464.
Chaudhuri, P., Huang, M. C., Loh, W. Y., and Yao, R. (1994). Piecewise-
polynomial regression trees. Statistica Sinica, 4:143-167.
Chaudhuri, P., Lo, W. D., Loh, W. Y., and Yang, C. C. (1995). Generalized
regression trees. Statistica Sinica, 5:641-666.
Chee, P. L. and Harrison, R. F. (1997). An incremental adaptive network for
on-line supervised learning and probability estimation. Neural Networks,
10:925-939.
Giles, C. L. et al. (1995). Constructive Learning of Recurrent Neural Networks:
Limitations of Recurrent Casade Correlation and a Simple Solution. IEEE
Chen, K., Wang, L., and Chi, H. (1997). Methods of combining multiple classi-
fiers with different features and their applications to text-independent speaker
identification. International Journal of Pattern Recognition and Artificial In-
telligence, 11:417-445.
Chen, S., Cowan, C. F. N., and Grant, P. M. (1991). Orthogonal least squares
learning algorithm for radial basis function networks. IEEE Trans. on Neural
Chen, S., Yu, D., and Moghaddamjo, A. (1992). Weather sensitive short-term
load forecasting using nonfully connected artificial neural network. IEEE
Trans. on Power Systems, 7:1098-1105.
Chen, T. and Chen, H. (1995). Approximation capability to functions of several
variables nonlinear functionals and operators by radial basis function neural
networks. IEEE Trans. on Neural Networks, 6:904-910.
Cheng, E. S., Chen, S., and Mulgrew, B. (1996). Gradient radial basis function
networks for nonlinear and nonstationary time series prediction. IEEE Trans.
Cheng, W., Fadlalla, A., and Lin, C.-H. (1996). Improve forecasting perfor-
mance of neural networks through the use of a combined model. In World
Congress on Neural Networks, pages 447-450.
Chiang, C. C. and Fu, H. C. (1994). A divide-and-conquer methodology for
modular supervised neural network design. In Int. Con! on Neural Networks,
pages 119-124.
288
Chinrunggrueng, C. and Sequin, C. H. (1995). Optimal k-means algorithm

with dynamic adjustment of learning rate. IEEE Trans. on Neural Networks,
6:157-169.
Chizeck, H., Willsky, A., and Castanon, D. (1986). Discrete time Markovian
jump linear quadratic optimal control. Int. Journal of Control, 45:213-231.
Cho, S.-B. (1997). Neural-network classifiers for recognizing totally uncon-
strained handwritten numerals. IEEE Trans. on Neural Networks, 8:43-53.
Cho, S.-B. and Kim, J. H. (1995). Multiple network fusion using fuzzy logic.
IEEE Trans. on Neural Networks, 6:497-501.
Choi, C.-H. and Choi, J. Y. (1994). Constructive neural networks with piecewise
interpolation capabilities for function approximation. IEEE Trans. on Neural
Choi, D.-I. and Park, S.-H. (1994). Self-creating and organizing neural networks.
Chou, P. (1991). Optimal partitioning for classification and regression trees.
IEEE Trans. on Pattern Analysis and Machine Intelligence, 13:340-354.
Christiaansen, W.R. (1971). Short term load forecasting using general expo-
nential smoothing. IEEE TIuns. Power Systems, 90:pages 900-911.
Chuanyi, J. and Psaltis, D. (1997). Network synthesis through data-driven
growth and decay. Neural Networks, 10:1133-1141.
Clemen, R. (1989). Combining forecasts: a review and annotated bibliography.
Int. Journal of Forecasting, 5:559-583.
Cleveland, W., Devlin, S., and Grosse, E. (1988). Regression by local fitting.
Journal of Econometrics, 37:87-114.
Connor, J. T., Martin, R. D., and Atlas, L. E. (1994). Recurrent neural networks
and robust time series prediction. IEEE Trans. on Neural Networks, 5:240-
254.
Cottrell, M. and Fort, J. (1986). Stochastic model of retinotopy: self organizing
process. BioI. Cybernetics, 53:405-411.
Courrieu, P. (1993). A convergent generator of neural networks. Neural Net-
works, 6:835--844.
Cover, T. (1968). Estimation by the nearest neighbor rule. IEEE Trans. on
Information Theory, 14:50-55.
Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification. IEEE
Trans. on Information Theory, 13:21-27.
Dasarathy, B. (1994). Decision Fusion. IEEE CS Society Press.
Davila, C., Abaye, A., and Kohotanzad, A. (1994). Estimation of single sweep
steady-state visual potential by adaptive line enhancement. IEEE Trans. on
Biomedial Engineering, 41:197-200.
Dayan, P. and Hinton, G. E. (1993). Feudal reinforcement learning. In Advances
in Neural Information Processing Systems, pages 271-278.
Dayan, P. and Zemel, R. S. (1995). Competition and multiple cause models.
Neural Computation, 7:565--579.
Deffuant, G. (1990). Neural units recruitment algorithm for generation of de-
cision trees. In Proc. Int. Joint Conf. on Neural Networks, pages 637-642.
REFERENCES 289
Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood
estimation via the EM algorithm. J. of the Stat. Roy. Soc. B, 39:1-38.
Deng, L. (1992). A generalized hidden Markov model with state-conditioned
trend functions of time for the speech signal. Signal Processing, 27:65-78.
Deng, L. et al. (1991). Phonemic hidden Markov models with continuous mix-
ture output densities for large vocabulary word recognition. IEEE Trans. on
Signal Processing, 39:1677-168l.
Deng, L. et al. (1992). Modelling acoustic transitions in speech by state - inter-
polation hidden Markov models. IEEE Trans. on Signal Processing, 40:265-
272.
Dersch, D. R. and Tavan, P. (1995). Asymptotic level density in topological
feature maps. IEEE Trans. on Neural Networks, 6:230-236.
Deutsch, M., Granger, C., and Terasvirta, T. (1994). The combination of fore-
casts using changing weights. Int. Journal of Forecasting, 10:47-57.
Dharmadikari, S. (1963). Functions of finite Markov chains. Ann. of Math. Stat.,
34:1022-1032.
Dickinson, J. (1973). Some statistical results in the combination of forecasts.
Operational Research Quarterly, 24:252-260.
Dickinson, J. (1975). Some comments on the combination of forecasts. Opera-
tional Research Quarterly, 26:205-210.
Doob, J. (1953). Stochastic Processes. Wiley.
Drabe, T., Bressgott, W., and Bartscht, E. (1996). Genetic task clustering
for modular neural networks. In Proc. of International Workshop on Neural
Networks for Identification, Control, Robotics, and Signal/Image Processing,
NICROSP, pages 339-347.
Drucker, H. and Cortes, C. (1996). Boosting decision trees. In Advances in
Neural Information Processing Systems 8, pages 479-485.
Drucker, H. et al. (1994). Boosting and other ensemble methods. Neural Com-
putation,6:1289-130l.
Drucker, H., Schapire, R., and Simard, P. (1993a). Boosting performance in
neural networks. International Journal of Pattern Recognition and Artificial
Intelligence, pages 61-76.
Drucker, H., Schapire, R., and Simard, P. (1993b). Improving performance in
neural networks using a boosting algorithm. In Advances in Neural Infor-
mation Processing Systems 5, pages 42-49.
Duda, R. and Hart, P. (1973). Pattern Classification and Scene Analysis. Wiley.
Dufour, F. and Bertrand, P. (1994). The filtering problem for continuous-time
linear systems with Markovian switching coefficients. System and Control
Letters, 23:453-46l.
Dumitras, A. et al. (1994). A quantitative study of evoked potential estimation
using a feedforward neural network. In IEEE Workshop: Neural Networks
for Signal Processing, pages 606-613.
Dunn, J. (1973). A fuzzy relative of the ISODATA process and its use in de-
tecting compact well-separated clusters. Journal of Cybernetics, 3:32-57.
290
Ehrenfeucht, A. and Haussler, D. (1989). Learning decision trees from random

examples. Information and Computation, 82:231-246.
Elanyar, S. (1994). Radial basis function neural network for approximation and
estimation of nonlinear dynamic systems. IEEE Trans. on Neural Networks,
5:584-593.
Elliot, R., Aggoun, 1., and Moore, J. (1995). Hidden Markov Models. Springer.
Estevez, P. A. and Nakano, R. (1995). Hierarchical mixture of experts and
max-min neural networks. In Int. Conf. on Neural Networks, pages 651--656.
Fabri, S. and Kadirkamanathan, V. (1996). Dynamic structure neural networks
for stable adaptive control of nonlinear systems. IEEE Trans. on Neural
Networks, 7:1151-1167.
Fahlman, S. E. (1991). The recurrent cascade correlation architecture. Tech.
report. CMU-CS-91-100.
Fahlman, S. E. and Lebiere, C. (1990a). The Cascade-Correlation learning ar-
chitecture. In Advances in Neural Information Processing Systems 2, pages
524-532.
Fahlman, S. E. and Lebiere, C. (1990b). The Cascade - Correlation learning ar-
chitecture. Tech. Report, ftp:// archive.cis.ohio- state.edu/ pub/neuroprose/
fahlman.cascor- tr.ps.Z.
Farlow, J. (1984). Self organizing methods in modelling - GMDH type algo-
rithms. Marcel Dekker.
Farmer, J. and Sidorowich, J. (1988). Exploiting chaos to predict the future
and reduce noise. Tech. report, Number LA-UR-88-901, Center for Nonlinear
Studies, Los Alamos National Laboratory.
Feller, W. (1968). An Introduction to Probability Theory and its Applications.
Wiley.
Fessant, F., Bengio, S., and Collobert, D. (1996). Use of modular architectures
for time series prediction. Neural Processing Letters, 3:101-106.
Fiesler, E. (1994). Comparative bibliography of ontogenic neural networks. In
International Conf. on Arlifcial Neural Networks (ICANN).
Finnoff, W., Hergert, F., and Zimmermann, H. G. (1992). A comparison of
weight elimination methods for reducing complexity in neural networks. In
Proc. Int. Joint Conf. on Neural Networks, pages 980-987.
Finnson, A. (1993). Simulation of a strategy to start up nitrification at Bromma
sewage plant using a model based on the IAWPRC no.l model. Water Sci-
ence and Technology, 28:230-237.
Foss, B., Johansen, T., and Sorensen, A. (1995). Nonlinear predictive control
using local models applied to a batch fermentation process. Control Engi-
neering Practice, 3:389--396.
Frattale, F. M. M. and Martinelli, G. (1995). A constructive algorithm for bi-
nary neural networks: The oil-spot algorithm. IEEE Trans. on Neural Net-
works, 6:794--797.
Frazer, A. and Dimitriadis, A. Forecasting probability densities by using hidden
Markov models with mixed states. In Forecasting the Future and Understand-
ing the Past. Editors A. Weigend and N. Gershenfeld. Addison-Wesley.
REFERENCES 291
Frean, M. (1990). The Upstart algorithm: A method for constructing and train-
ing feed-forward neural networks. Neural Computation, 2:198-209.
Freeman, J. S. and Saad, D. (1995). Learning and generalization in radial basis
function networks. Neural Computation, 7:1000-1020.
Freund, Y., Schapire, R. E., Singer, Y., and Warmuth, M. K. (1997). Using and
combining predictors that specialize. In 29th Annual ACM Symposium on
Theory of Computing, pages 334-343.
Friedman, J. H. (1979). A tree structured approach to nonparametric multi-
ple regression. In Smoothing techniques for curve estimation, pages 5-22.
Springer.
Friedman, J. H. (1991). Multivariate adaptive regression splines. Annals of
Statistics, 19: 1-67.
Breiman, L., J. H. Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984).
Classification and Regression Trees Wadsworth International.
Fritsch, J. (1996). Modular Neural Networks for Speech Recognition. Tech.
Report CMU-CS-96-203. Department of Computer Science, Carnegie Mellon
Univ.
Fritsch, J., Finke, M., and Waibel, A. (1997a). Adaptively growing hierarchical
mixtures of experts. In Advances in Neural Information Processing Systems
9.
Fritsch, J., Finke, M., and Waibel, A. (1997b). Context-dependent hybrid HME
/ HMM speech recognition using polyphone clustering decision trees. In Int.
Conf. on Acoustics, Speech and Signal Processing.
Fritzke, B. (1991). Unsupervised clustering with growing cell structures. In Int.
Joint Conf. on Neural Networks, pages 531-536.
Fritzke, B. (1994a). Growing cell structures - A self-organizing network for
unsupervised and supervised learning. Neural Networks, 7:1441-1460.
Fritzke, B. (1994b). Supervised learning with growing cell structures. In Ad-
vances in Neural Information Processing Systems 6, pages 255-262.
Fun, M. H. and Hagan, M. T. (1996). Levenberg-Marquardt training for mod-
ular networks. In Int. Conf. on Neural Networks, pages 468-473.
Funahashi, K. (1989). On the approximate realization of continuous mappings
by neural networks. Neural Networks, 2:183-192.
Fung, K. et al. (1996). Visual evoked potential enhancement by an artificial
neural network filter. Biomedical Mater. Engin., 6:1-10.
Gallant, S. I. (1986). Three constructive algorithms for network learning. In
Proc. of the 8th Annual Conf. of the Cognitive Science Society, pages 652-
660.
Gath, I. and Geva, A. (1989). Unsupervised optimal fuzzy clustering. IEEE
Trans. on Pattern Analysis and Machine Intelligence, 11:773-78l.
Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the
bias/variance dilemma. Neural Computation, 4:1-58.
Georgiopoulos, M., Heileman, G. L., and Huang, J. (1990). Convergence prop-
erties of learning in ARTl. Neural Computation, 2:502-509.
292
Ghahramani, Z. (1997). Factorial learning and the EM algorithm. In Advances

in Neural Information Processing Systems 9, pages 472-478.
Ghahramani, Z. (1998). Learning Dynamic Bayesian Networks. Springer.
Ghahramani, Z. and Jordan, M. I. (1994). Supervised learning from incomplete
data via an EM approach. In Advances in Neural Information Processing
Ghahramani, Z. and Jordan, M. I. (1997). Factorial hidden Markov models.
Machine Learning, 29:245-259
Giles, C. L., Chen, D., Sun, G.-Z., Chen, H.-H., Lee, Y.-C., and Goudreau,
M. W. (1995). Constructive learning of recurrent neural networks: Limita-
tions of recurrent cascade correlation and a simple solution. IEEE Trans. on
Neural Networks, 6:829--836.
Ginzburg, I. and Horn, D. (1994). Combined neural networks for time series
analysis. In Advances in Neural Information Processing Systems 6, pages
224-231.
Goldfeld, S. and Quandt, R. (1973). A Markov model for switching regressions.
Journal of Econometrics, 1:3-16.
Gonzalez, A., Grana, M. and D'Anjou, A. (1995). An analysis of the GLVQ
algorithm. IEEE Trans. on Neural Networks, 6:1012-1016.
Gorinevsky, D. (1995). On the persistence of excitation in radial basis function
network identification of nonlinear system. IEEE Trans. on Neural Networks,
6:1237-1244.
Goutte, C. and Hansen, L. K. (1997). Regularization with a pruning prior.
Greene, C. and Willsky, A. (1980). An analysis of the multiple model adaptive
control algorithm. In Proc. of the IEEE Conference on Decision and Control,
pages 1142-1145.
Gross, G. and Galiana, F. (1987). Short term load forecasting. Proc. of the
IEEE,75:1558-1573.
Halliday, A. and Kriss, A. (1976). The pattern evoked potentials in compression
of the anterior visual pathways. Brain, 99:357-374.
Hamilton, J. D. (1988). Rational-expectations econometric-analysis of changes
in regime - an investigation of the term structure of interest-rates. Journal
of Economic Dynamics and Control, 12:385-423.
Hamilton, J. D. (1989). A new approach to the economic-analysis of nonsta-
tionary time-series and the business-cycle. Econometrica, 57:357-384.
Hamilton, J. D. (1990). Analysis of time-series subject to changes in regime.
Journal of Econometrics, 45:39--70.
Hamilton, J. D. (1991). A quasi-Bayesian approach to estimating parameters
for mixtures of normal-distributions. Journal of Business and Economic Sta-
tistics, 9:27-39.
Hamilton, J. D. (1996). Specification testing in Markov-switching time-series
models. Journal of Econometrics, 70:127-157.
Hamilton, J. D. and Lin, G. (1996). Stock-market volatility and the business-
cycle. Journal of Applied Econometrics, 11:573-593.
REFERENCES 293
Hamilton, J. D. and Susmel, R. (1994). Autoregressive conditional heteroskedas-

ticity and changes in regime. Journal of Econometrics, 64:307-333.
Hancock, J. and Lainiotis, D. (1965). On learning and distribution free coin-
cidence detection procedures. IEEE Trans. on Information Theory, 11:272-
280.
Hansen, J. V. and Nelson, R. D. (1997). Neural networks and traditional time
series methods: A synergistic combination in state economic forecasts. IEEE
Hanson, S. J. (1990). Meiosis networks. In Advances in Neural Information
Processing Systems 2, pages 533-541.
Happel, B. L. M. and Murre, J. M. J. (1994). Design and evolution of modular
neural-network architectures. Neural Networks, 7:985-1004.
Hartman, E. and Keeler, J. (1991). Predicting the future: advantages of semi-
local units. Neural Computation, 3:566-578.
Hartman, E., Keeler, J. D., and Kowalski, J. M. (1990). Layered neural networks
with Gaussian hidden units as universal approximations. Neural Computa-
tion, 2:210--215.
Hashem, S. (1996). Effects of collinearity on combining neural networks. Con-
nection Science, 8:315-336.
Hashem, S. (1997). Optimal linear combinations of neural networks. Neural
Networks, 10:599-614.
Hashem, S. and Schmeiser, B. (1995). Improving model accuracy using opti-
mal linear-combinations of trained neural networks. IEEE Trans. on Neural
Hassibi, B. and Stork, D. G. (1993). Second order derivatives for network prun-
ing: Optimal brain surgeon. In Advances in Neural Information Processing
Hassibi, B., Stork, D. G., Wolff, G., and Watanabe, T. (1994). Optimal Brain
Surgeon: Extensions and performance comparisons. In Advances in Neural
Information Processing Systems 6, pages 263-270.
Hathaway, R. and Bezdek, J. (1993). Switching regression models and fuzzy
clustering. IEEE Trans. on Fuzzy Systems, 1:195-204.
Haussler, D., Kivinen, J., and Warmuth, M. (1996). Tight worst-case lower
bounds for predicting with expert advice. Lecture Notes in Computer Sci-
ence. 904:69-80,
Haykin, S. (1994). Neural Networks: a Comprehensive Foundation, MacMillan.
Haykin, S. and Deng, C. (1991). Classification of radar clutter using neural
networks. IEEE Trans. on Neural Networks, 2:589-600.
Heiss, M. and Kampl, S. (1996). Multiplication-free radial basis function net-
work. IEEE Trans. on Neural Networks, 7:1461-1464.
Helmbold, D. and Schapire, R. (1995). Predicting nearly as well as the best
pruning of a decision tree. In Proceedings of the 8th Annual Conference on
Computational Learning Theory (COLT'95), pp. 61-68, ACM Press.
294
Helmick, R, Blair, W., and Hoffman, S. (1996). One - step fixed -lag smoothers
for Markovian switching systems. IEEE Trans. on Automatic Control,41:1051-
1056.
Henderson, T. and Lainiotis, D. (1972). Digital matched filters for detecting
gaussian signals in gaussian noise. Information Science, 4:233-249.
Henze, M., Jr., C. G., Gujer, W., Marais, G., and Matsuo, T. (1983). Activated
sludge model. Tech. report No.1, IAWPRC.
Hergert, F., Finnoff, W., and Zimmermann, H. G. (1992). A comparison of
weight elimination methods for reducing complexity in neural networks. In
Proc. Int. Joint Con! on Neural Networks, pages 980-987.
Hering, K, Haupt, R, and Villmann, T. (1995). An improved mixture of ex-
perts approach for model partitioning in VLSI-design using genetic algo-
rithms. Tech. report. Universitat Leipzig, Fakultat fur Mathematik und In-
formatik, 1995.
Hering, K, Haupt, R, and Villmann, T. (1996). Hierarchical strategy of model
partitioning for VLSI-design using an improved mixture of experts approach.
In Tenth Workshop on Parallel and Distributed Simulation - PADS 96, Proc.,
pages 106-113.
Hilborn, C. and Lainiotis, D. (1968). Optimal unsupervised learning multicate-
gory dependent hypotheses pattern recognition. IEEE Trans. on Information
Theory, 14:468-470.
Hilborn, C. and Lainiotis, D. (1969a). Optimal estimation in the presence of
unknown parameters. IEEE Trans. on Systems Science and Cybernetics,
1:38-43.
Hilborn, C. and Lainiotis, D. (1969b). Unsupervised learning minimum risk pat-
tern classification for dependent hypotheses and dependent measurements.
IEEE Trans. on Systems Science and Cybernetics, 5:109-115.
Hilliorst, R, van Amerongen, J., and Lohnberg, P. (1991). Intelligent adaptive
control of mode-switch processes. In Proc. IFAC International Symposium
on Intelligent Tuning and Adaptive control, Singapore.
Ho, K, Hsu, Y., Chen, C., Lee, T., Liang, C., Lai, T., and Chen, K (1990).
Short term load forecasting of Taiwan power system using a knowledge-
based expert system. In IEEE/PES 1990 Winter Meeting. Paper 90 WM
259-2 PWRS.
Ho, K, Hsu, Y., and Yang, C. C. (1992). Short term load forecasting using a
multilayer neural network with an adaptive learning algorithm. IEEE Trans.
on Power Systems, 7:141-149.
Hochberg, M., Cook, G., Renals, S., and Robinson, A. J. (1994). Connectionist
model combination for large vocabulary speech recognition. Neural Networks
for Signal Processing, pages 269-278.
Hofmann, Rand '!resp, V. (1995). Discovering structure in continuous vari-
ables using Bayesian networks. In Advances in Neural Information Process-
ing Systems 1-
Hogarth, R (1989). On combining diagnostic forecasts: thoughts and some
evidence. Int. Journal of Forecasting, 5:593-597.
REFERENCES 295
Holmstrom, L., Koistinen, P., and and, Laaksonen, J. (1997). Neural and sta-
tistical classifiers-taxonomy and two case studies. IEEE Trans. on Neuml
Networks, 8:5-17.
Hong, L. and Lynch, A. (1993). Recursive temporal-spatial information fusion
with applications to target identification. IEEE Trans. on Aerospace and
Electronic Systems, 29:435-444.
Hrycej, T. (1992). Modular learning in neuml networks. Wiley.
Hu, Y. H., Palreddy, S., and Tompkins, W. J. (1995). Customized ECG beat
classifier using mixture of experts. Neuml Networks for Signal Processing,
pages 459-464.
Hunt, K, J.C. Kalkkuhl, Fritz, H., and T.A. Johansen, T. (1996). Construc-
tive empirical modeling of longitudinal vehicle dynamics using local model
networks. Control Engineering Pmctice, 4:167-178.
Hunt, K J., Haas, R., and Murray-Smith, R. (1996). Extending the functional
equivalence of radial basis function networks and fuzzy inference systems.
IEEE Trans. on Neuml Networks, 7:776-781.
Hunt, K R. H. and Brown, M. (1995). On the functional equivalence of fuzzy
inference systems and spline-based networks. Int. Journal of Neuml Systems,
6:171-184.
Hwang, J.-N., You, S.-S., Lay, S.-R., and Jou, I.-C. (1993). What's wrong with
a cascaded correlation learning network: A projection pursuit learning per-
spective. In Int. Symposium on Artificial Neuml Networks, pages Ell-E20.
Iso, K and Watanabe, T. (1990). Speaker-independent word recognition using
a neural prediction model. In IEEE ICSP, pages 441-444.
Jacobs, R. A. (1989). Initial experiments on constructing domains of expertise
and hierarchies in connectionist systems. In Proc. of the 1988 Connectionist
Models Summer School, pages 144-153.
Jacobs, R. A. (1995). Methods for combining experts probability assessments.
Neuml Computation, 7:867-888.
Jacobs, R. A. (1997). Bias/variance analyses of mixtures-of-experts architec-
tures. Neuml Computation, 9:369-383.
Jacobs, R. A. and Jordan, M. I. (1991). A competitive modular connectionist
architecture. In Advances in Neuml Information Processing Systems 3, pages
767-773.
Jacobs, R. A. and Jordan, M. I. (1993). Learning piecewise control strategies in
a modular neural-network architecture. IEEE Trans. on Systems, Man and
Cybernetics, 23:337-345.
Jacobs, R. A., Jordan, M. I., and Barto, A. G. (1991). Task decomposition
through competition in a competitive modular connectionist architecture:
The what and where vision tasks. Cognitive Science, 15:219-250.
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive
mixtures of local experts. Neuml Computation, 3:79-87.
Jacobs, R. A., Peng, F. C., and Tanner, M. A. (1997). A Bayesian approach
to model selection in hierarchical mixtures-of-experts architectures. Neuml
Networks, 10:231-241.
296
Jain, A. K. and Dubes, R. C. (1988). Algorithms for Clustering Data. AMS

Press.
Jain, A. K. and Mao, J. (1997). Guest editorial: Special issue on artificial neural
networks and statistical pattern recognition. IEEE Trans. on Neural Net-
works, 8:1-3.
Jang, J. S. R. and Sun, C. T. (1993). Functional equivalence between radial
basis function networks and fuzzy inference systems. IEEE Trans. on Neural
Jelinek, F. et al. (1983). A maximum likelihood approach to continuous speech
recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence,
5:179-190.
Jensen, F. V. (1993). An introduction to Bayesian networks.
Jeppson, U. and Olsson, G. (1993). Reduced order models for on-line parameter
identification of the activated sludge process. Water Science and Technology,
28:173-183.
Ji, C. and Ma, S. (1997). Combinations of weak classifiers. IEEE Trans. on
Johansen, T. and Foss, B. (1995). Identification of nonlinear system structure
and parameters using regime decomposition. Automatica, 31:321-326.
Johansen, T. and Foss, B. (1997). Operating regime based process modelling
and identification. Computers and Chemical Eng., 21:159-176.
Jordan, M. (1994). A statistical approach to decision tree modelling. In Proc.
of the Seventh Annual ACM Conference on Computational Learning Theory,
pages 13-20.
Jordan, M., Ghahramani, Z., and Saul, L. (1997). Hidden Markov decision trees.
In Advances in Neural Information Processing Systems 9, pages 501-507.
Jordan, M.I., Ghahramani, Z., Saul, L.K. and Jaakkola, T.S. (1998). An In-
troduction to Variational Methods for Graphical Models. Technical Report,
University of California, Berkeley, Number CSD-98-980.
Jordan, M. and Jacobs, R. (1994). Hierarchical mixtures of experts and the EM
algorithm. Neural Computation, 6:181-214.
Jordan, M. and Xu, L. (1995). Convergence results for the EM approach to
mixtures of experts architectures. Neural Networks, 8:1409-1431.
Joshi, A., Ramakrishman, H., and Houstis, E. N. (1997). On neurobiological,
neuro-fuzzy, machine learning, and. IEEE Trans. on Neural Networks, 8:18-
31.
Kaburlasos, V. and Petridis, V. (1997). Fuzzy lattice neurocomputing: a novel
connectionist scheme for versatile learning and decision making by clustering.
Int. Journal of Computers and their Applications, 4:31-43.
Petridis, Vas. and Kaburlasos, V. (1998). Fuzzy lattice neural network: ahybrid
model for learning. IEEE Trans. on Neural Networks, to appear.
Kadirkamanathan, V. and Niranjan, M. (1992). Application of an architecurally
dynamic neural network for speech pattern classification. Proc. of the Inst.
of Acoustics, 14:343-350.
REFERENCES 297
Kadirkamanathan, V. and Niranjan, M. (1993). A function estimation approach

to sequential learning with neural networks. Neural Computation, 5:954-975.
Kadirkamanathan, V., Niranjan, M., and Fallside, F. (1991). Sequential adap-
tation of radial basis function networks. In Advances in Neural Information
Kaminski, W. and Strumillo, P. (1997). Kernel orthonormalization in radial
basis function networks. IEEE Trans. on Neural Networks, 8:1177-1183.
Kandadai, R. and Tien, J. (1997). A knowledge base generating hierarchical
fuzzy neural controller. IEEE Trans. on Neural Networks, 8:1531-1541.
Kang, K. and Oh, J. (1996). Statistical mechanics of the mixture of experts.
Advances in Neural Information Processing Systems 9, pages 183-189.
Kang, K., Oh, J. H., and Kwon, C. (1997). Learning by a population of per-
ceptrons. Physical Review E, 55:3257-3261.
Karayiannis, N. (1994). MECA: Maximum entropy clustering algorithm. In
IEEE Conf. on Fuzzy Systems, pages 630-635.
Karayiannis, N. and Weiqun, G. (1997). Growing radial basis neural networks:
merging supervised and unsupervised learning with network growth tech-
niques. IEEE Trans. on Neural Networks, 8:1492-1506.
,Karlin, S. and Taylor, H.M. (1975). A First Course in Stochastic Processes.
Academic Press. Karnin, E. D. (1990). A simple procedure for pruning
back-propagation trained neural networks. IEEE Trans. on Neural Networks,
1:239-242.
Kashyap, R. (1977). A Bayesian comparison of different classes of dynamic
models using empirical data. IEEE Trans. on Automatic Control, 22:715-
727.
Kazakos, D. (1991). Asymptotic error probability expressions for multihypoth-
esis testing using multisensor data. IEEE Trans. on Systems, Man and Cy-
bernetics, 21:1101-1114.
Kecman, V. (1996). System identification using modular neural network with
improved learning. In Proc. of Int. Workshop on Neural Networks for Identi-
fication, Control, Robotics, and Signal/Image Processing, NICROSP, pages
40-48.
Kehagias, Ath. (1990). Optimal control for training: the missing link between
hidden Markov models and connectionsist networks. Math. and Compo Mod-
elling, 14:284-289.
Kehagias, Ath. and Petridis, Vas. (1997a). Predictive Modular neural networks
for time series classification. Neural Networks, 10:31-49.
Kehagias, Ath. and Petridis, Vas. Time Series Segmentation using predictive
modular neural networks. Neural Computation, 9:1691-1710.
Kenny, P., Lennig, M., and Mermelstein, P. (1990). A linear predictive HMM
for vector-valued observations with applications to speech recognition. IEEE
Trans. on Acoustics, Speech and Signal Proc., 38:220-225.
Khotanzad, A., Afkhami-Rohani, R., Lu, T.-L., Abaye, A., Davis, M., and
Marantukulam, D. J. (1997). ANNSTLF-a neural-network-based electric
load forecasting system. IEEE Trans. on Neural Networks, 8:835-846.
298
Kiartzis, S., Petridis, V., Bakirtzis, A., and Kehagias, A. (1997). Short term
load forecasting using a Bayesian combination algorithm. Electrical Power
and Energy Systems, 19:171-177.
Kim, H. and Mendel, J. (1995). Fuzzy basis functions: Comparison with other
basis functions. IEEE Trans. on Fuzzy Systems, 3:158-168.
Klagges, H. and Soegtrop, M. (1992). Limited fan-in random wired Cascade-
Correlation. Ftp from archive.cis.ohio-state.edu in /pub/neuroprose
Kohonen, T. (1982). Analysis of a simple self-organizing process. Biol. Cyber-
netics,44:135-140.
Kohonen, T. (1988a). An introduction to neural computing. Neural Networks,
1:3-16.
Kohonen, T. (1988b). Self-Organization and Associative Memory. Springer.
Kohonen, T. (1990). The self-organizing map. Proc. of the IEEE, 58:1464-1480.
Kohonen, T. (1995). Self Organization Maps. Springer.
Kosko, B. (1991a). Neural Networks and Fuzzy Systems. Prentice-Hall.
Kosko, B. (1991b). Stochastic competitive learning. IEEE Trans. on Neural
Krishnan, R. and Doran, F. (1987). Study of parameter sensitivity in high-
performance and inverter-fed induction motor drive systems. IEEE Trans.
on Ind. Appl., 23:263-265.
Krishnan, R. and Doran, F. (1991). A review of parameter sensitivity and
adaptation in indirect vector controlled induction motor drive. IEEE Trans.
on Power Electronics, 6:695-703.
Krogh, A. and Sollich, P. (1997). Statistical mechanics of ensemble learning.
Physical Review E, 55:811-825.
Krogh, A. and Vedelsby, J. (1995). Neural network ensembles, cross validation,
and active learning. In Advances in Neural Information Processing Systems
7, pages 231-238.
Krolzig, H. (1997). Markov switching vector autorgressions. Springer.
Krzysztoforowitcz, R. (1990). Fusion of detection probabilities and compari-
son of multisensor systems. IEEE Trans. on Systems, Man and Cybernetics,
20:665--677.
Krzyzak, A., Linder, T., and Lugosi, G. (1996). Nonparametric estimation and
classification using radial basis function nets and empirical risk minimization.
Kubat, M. and Flotzinger, D. (1995). Pruning multivariate decision trees by
hyperplane merging. Lecture Notes in Artificial Intelligence, 912:190-199.
Kulkarni, S. and Ramadge, P. (1996). Model and controller selection poli-
cies based on output prediction errors. IEEE Trans. on Automatic Control,
41:1594-1604.
Kung, S. Y. and Taur, J. S. (1995). Decision - based neural networks with
signal/image classification applications. IEEE Trans. on Neural Networks,
6:170-181.
Kuensch, H., Geman, S. and Kehagias, Ath. Hidden Markov Random Fields.
The Annals of Applied Probability, 5:577-602.
REFERENCES 299
Kushner, H. and Clark, D. (1978). Stochastic approximation methods for con-

strained and unconstrained systems. Springer.
Kwok, T. and Yeung, D. (1997a). Objective functions for training new hidden
units in constructive neural networks. IEEE Trans. on Neural Networks,
8:1131-1148.
Kwok, T.-Y. and Yeung, D.-Y. (1997b). Constructive algorithms for structure
learning in feedforward neural networks for regression problems. IEEE Trans.
Laguna, P. et al. (1992). Adaptive filter for event related bioelectric signals us-
ing an impulse correlated reference input: comparison with signal averaging
techniques. IEEE 1rans. on Biomedial Engineering, 39:1032-1044.
Lainiotis, D. (1970). Sequential structure and parameter-adaptive pattern recog-
nition - part I: supervised learning. IEEE Trans. on Information Theory,
16:548-556.
Lainiotis, D. (1971a). Joint detection, estimation and system identification.
Information and Control, 19:75-92.
Lainiotis, D. (1971b). Optimal adaptive estimation: Structure and parameter
adaptation. IEEE 1rans. on Automatic Control, 16:160-170.
Lainiotis, D. (1973). Adaptive control of linear stochastic systems. Automatica,
9:107-115.
Lainiotis, D. (1974b). Estimation Theory. Elsevier.
Lainiotis, D. (1974a). Estimation: a brief survey. In Estimation Theory, ed.
Lainiotis, D. Elsevier.
Lainiotis, D. (1974c). Partitioned estimation algorithms, I: Nonlinear estima-
tion. In Estimation Theory. ed. Lainiotis, D. Elsevier.
Lainiotis, D. (1974d). Partitioned estimation algorithms, II: Linear estimation.
In Estimation Theory. ed. Lainiotis, D. Elsevier.
Lainiotis, D., Deshpande, J., and Upadhyay, T. (1972). Optimal adaptive con-
trol: A nonlinear separation theorem. Int. Journal of Control, 15:877-888.
Lainiotis, D., Katsikas, S., and Likothanasis, S. (1988). Adaptive deconvolution
of seismic signals: performance, computational analysis, parallelism. IEEE
Trans. on Acoustics, Speech and Signal Processing, 36: 1715-1734.
Lainiotis, D. and Likothanasis, S. (1987). Partitioned adaptive control algo-
rithms - comparative computational analysis-parallelism. Control and Com-
puters, 15:40-47.
Lainiotis, D. and Plataniotis, K. (1994a). Adaptive dynamic neural network
estimators. In In Proc. of IJCNN '94, pages 4736-4745.
Lainiotis, D. and Plataniotis, K. (1994b). Neural network estimators: Applica-
tion to ship position estimation. In IEEE ICNN '94, pages 4710-4715.
Lainiotis, D. and Plataniotis, K. (1994c). A new class of intelligent neural net-
work controllers. In IEEE ICNN '94, pages 2344-2349.
Layeghi, S. et al. (1994). Pattern recognition of the polygraph using fuzzy
classification. In 3rd IEEE Con! on Fuzzy Systems, pages 1825-1829.
Le Cun, Y., Denker, J. S., and Solla, S. A. (1990). Optimal brain damage. In
Advances in Neural Information Processing Systems, pages 598-605.
300
Lee, C. and Landgrebe, D. A. (1997). Decision boundary feature extraction for

neural networks. IEEE Trans. on Neural Networks, 8:75-83.
Lee, S. and Pan, J. C.-J. (1996). Unconstrained handwritten numeral recogni-
tion based on radial basis competitive and cooperative neworks with spatio-
temporal feature representation. IEEE Trans. on Neural Networks, 7:455-
474.
Lee, Y. (1991). Handwritten digit recognition using k- nearest-neighbor, radial
basis function, and backpropagation neural networks. Neural Computation,
3:440-449.
Leontaritis, 1. and Billings, S. (1985a). Input-output parametric models for
non-linear systems. part I: Deterministic non-linear systems. Int. Journal of
Control, 41:303-328.
Leontaritis, 1. and Billings, S. (1985b). Input-output parametric models for
non-linear systems. part II: Stochastic non-linear systems. Int. Journal of
Control, 41 :329-344.
Levin, A. U., Leen, T. K, and Moody, J. E. (1994). Fast pruning using principal
components. In Advances in Neural Information Processing Systems, 6.
Levin, E. (1993). Hidden control neural architecture modelling of nonlinear time
- varying systems and its applications. IEEE Trans. on Neural Networks.
4:109-116.
Levinson, S. et al. (1983). An introduction to the application of the theory
of probabilistic functions of a Markov chain. The Bell Sys. Tech. Journal,
62:1035-1074.
Lewis, P., Ray, B., and Stevens, J. (1994). Modeling time series by using multi-
variate adaptive regression splines (MARS). In Time series prediction: fore-
casting the future and understanding the past. Addison-Wesley.
Li, Q. and TUfts, D. W. (1997). Principal feature classification. IEEE Trans.
Li, Y., KC. Tan, K N., and Murray-Smith, D. (1995). Performance based
linear control system design by genetic evolution with simulated annealing.
In Proc. of IEEE Conf. on Decision and Control.
Li, Y., Ng, K. C., Tan, K. C., Gray, G. J., McGookin, E. W., Murray-Smith,
D. J., and Sharman, K. C. (1994). Automation of linear and nonlinear con-
trol systems design by evolutionary computation. In Proc. IFAC Youth Au-
tomation Conf. Beijing, China.
Littlestone, N. and Warmuth, M. K (1991). The weighted majority algorithm.
Tech. report, UCSC-CRL-91-28. University of California at Santa Cruz,
Littmann, E. and Ritter, H. (1992). Cascade network architectures. In Proc.
Int. Joint Conf. on Neural Networks, pages 398-404.
Littmann, E. and Ritter, H. (1993). Generalization abilities of cascade net-
work architectures. In Advances in Neural Information Processing Systems
5, pages 188-195.
Littmann, E. and Ritter, H. (1996). Learning and generalization in cascade
network architectures. Neural Computation, 8:1521-1539.
REFERENCES 301
Liu, Y. and Yao, X. (1997). Evolving modular neural networks which generalize
well. In Proc. of the IEEE Conference on Evolutionary Computation, pages
605-610.
Ljung, L. (1987). System Identification: Theory for the User. Prentice Hall.
Lo, Z. and Bavarian, B. (1991). On the rate of convergence in topology pre-
serving neural networks. Biol. Cybernetics, 65:55-63.
Lo, Z.-P., Yu, Y., and Bavarian, B. (1993). Analysis of the convergence prop-
erties of topology preserving neural networks. IEEE Trans. on Neural Net-
works, 4:207-220.
Lu, C., Wu, H., and Vemuri, S. (1993). Neural network based short term load
forecasting. IEEE Trans. on Power Systems, 8:336-342.
Luttrell, S. P. (1991). Code vector density in topographic mappings: Scalar
case. IEEE Trans. on Neural Networks, 2:427-436.
Luttrell, S. (1994). A Bayesian analysis of self-organizing maps. Neural Com-
putation, 6:767-794.
Luttrell, S. (1997). Self organization of multiple winner take all neural networks.
Connection Science, 9:11-30.
MacKay, D. (1996). Equivalence of linear boltzmann chains and hidden Markov
models. Neural Computation, 8:178-181.
MacQueen, J. (1965). Some methods for classification and analysis of multi-
variate observations. In Proc. of the Berkeley Symposium on Math. Sciences
and Probability.
Magill, D. (1965). Optimal adaptive estimation of sampled stochastic processes.
IEEE Trans. on Automatic Control, 10:434-439.
Makridakis, S. (1989). Why combining works? Int. Journal of Forecasting,
5:601-603.
Mangeas, M., C.Muller, and A.S.Weigend (1995). Forecasting electricity de-
mand using a mixture of nonlinear experts. In World Congress on Neural
Networks, 2:48-53.
Mani, G. (1991). Lowering variance of decisions by using artificial neural net-
work portfolios. Neural Computation, 3:484-486.
Mariton, M. (1990). Jump linear systems in automatic control. Marcel Dekker.
McGillem, C., Aunon, J., and Pomalaza-Raez, C. (1985). Improved waveform
estimation procedures for event related potentials. IEEE Trans. on Biome-
dial Engineering, 39:371-379.
McGillem, C., Aunon, J., and Yu, K. (1985). Signals and noise in evoked brain
potentials. IEEE Trans. on Biomedial Engineering, 32:371-379.
Meila, M. and Jordan, M. I. (1997). Markov mixtures of experts. In Multiple
Model Approaches to Modelling and Control. Taylor and Francis.
Meir, R. (1995). Bias, variance and the combination of least squares estimators.
Mezard, M. and Nadal, J.-P. (1989). Learning in feedforward layered networks:
The Tiling algorithm. Journal of Physics A: Math. Gen., 22:2191-2203.
Miller, D. and Rose, K. (1996). Hierarchical unsupervised learning with growing
phase transitions. Neural Computation, 8:425-450.
302
Miller, D. and Uyar, S. (1996). A mixture of experts classifier with learning

based on both labelled and unlabelled data. pages 571-578.
Millnert, M. (1987). Identification of ARX models with Markovian parameters.
Int. Journal of Control, 45:2045-2058.
Mohammed, 0., Park, D., Merchant, R., Dinh, T., Tong, C., Azeem, A., Farah,
J., and Drake, C. (1994). Practical experiences with an adaptive neural net-
work short term load forecasting system. Paper 94 WM 210-5 PWRS pre-
sented in IEEE/PES 1994 Winter Meeting.
Moody, J. and Darken, C. J. (1989). Fast learning in networks of locally-tuned
processing units. Neural Computation, 1:281-294.
Moon, Y. J. and Oh, S. Y. (1995). On an efficient design algorithm for modular
neural networks. In Proc. of the Int. Con! on Neural Networks, 3:1310-1315.
Morse, A. and Mayne, D. (1992). Applications of hysteresis switching in para-
meter adaptive control. IEEE Trans. on Automatic Control, 37:1343-1354.
Mukherjee, S. and Fine, T. (1996). Accelerated training of neural networks by
ensemble pruning. In Proc. of World Congress on Neural Networks, pages
51-56.
Murray-Smith, R. (1994). A local model network approach to nonlinear mod-
elling. PhD Thesis, Department of Computer Science, University of Strath-
clyde, 1994.
Murray-Smith, R. and Gollee, H. (1994). A constructive learning algorithm for
local model networks. In Proc. of IEEE Workshop on computer intensive
methods in control and signal processing, pages 21-29.
Murray-Smith, R. and Johansen, T. (1997). Multiple model approaches to mod-
elling and control. Taylor and Francis.
Muselli, M. (1995). On sequential construction of binary neural networks. IEEE
Mutlukan, E. and Keating, D. (1994). Visual field interpretation with a personal
computer based neural network. Eye, 8:321-323.
Nabhan, T. M. and Zomaya, A. Y. (1994). Toward generating neural network
structures for function approximation. Neural Networks, 7:89-99.
Nadal, J.-P. (1989). Study of a growth algorithm for a feedforward network.
Int. Journal of Neural Systems, 1:55-59.
Naftaly, U., Intrator, N., and Horn, D. (1997). Optimal ensemble averaging of
neural networks. Network-Computation in Neural Systems, 8:283-296.
Narendra, K. and Balakhrishnan, J. (1994). Improving transient response of
adaptive control systems using multiple models and switching. IEEE Trans.
on Automatic Control, 39:1861-1866.
Narendra, K., Balakhrishnan, J., and Ciliz, M. (1995). Adaptation and learning
using multiple models, switching and tuning. IEEE Control Systems, pages
37-51.
Narendra, K. and Balakrishnan, J. (1997). Adaptive control using multiple
models. IEEE Trans. on Automatic Control, 42:171-187.
Neal, R. (1991). Bayesian mixture modelling. In Maximum Entropy and Bayesian
Methods, pages 197-211.
REFERENCES 303
Neal, R. (1992). Asymmetric parallel Boltzmann machines are belief netowrks.

Neural Computation, 4:832-834.
Niebur, D.et al. (1995). Artificial neural networks for power systems. Tech.
report CIGRE TF 38.06.06, ELECTRA, pages 77-101.
Nowlan, S. (1990a). Competing experts: an experimental investigation of as-
sociative mixture models. Tech. report CRG-TR-90-5, Department of Com-
puter Science, University of Toronto.
Nowlan, S. (1990b). Maximum likelihood competitive learning. In Advances in
Neural Information Processing Systems 2, pages 574-582.
Nowlan, S. (1991). Soft Competitive Adaptation: Neural Network Learning Al-
gorithms based on Fitting Statistical Mixtures. PhD thesis, Department of
Computer Science, University of Toronto.
Nowlan, S. J. and Hinton, G. E. (1991). Evaluation of adaptive mixtures of
competing experts. In Advances in Neural Information Processing Systems
3, pages 774-780.
Nozaki, K. et al. (1994). Selecting fuzzy rules with forgetting in fuzzy classifi-
cation systems. In IEEE Conf. on Fuzzy Systems, pages 618-623.
Omlin, C. W. and Giles, C. L. (1993). Pruning recurrent neural networks for
improved generalization performance. Tech. report 93-6, Computer Science
Department, Rensselaer Polytechnic Institute.
Omohundro, S. (1991). Bumptrees for efficient function, constraint and classi-
fication learning. In Advances in Neural Information Processing Systems 3,
pages 693-699.
Opitz, D. and Shavlik, J. (1996). Actively searching for an effective neural
network ensemble. Connection Science, 8:337-353.
Pal, N. R., Bezdek, J. C., and Hathaway, R. J. (1996). Sequential competi-
tive learning and the fuzzy c-means clustering algorithms. Neural Networks,
9:787-796.
Pal, S. and Majumder, D. (1977). Fuzzy sets and decision making approaches
in vowel and speaker recognition. IEEE Trans. on Systems, Man and Cyber-
netics, 7:625-629.
Papalexopoulos, A., How, S., and Peng, T. (1994). An implementation of a
neural network based load forecasting model for the EMS. In IEEEIPES
1994 Winter Meeting.
Papalexopoulos, A. D. and Hesterberg, T. C. (1990). A regression-based ap-
proach to short term system load forecasting. IEEE Trans. on Power Sys-
tems, 5:1535--1547.
Papastavrou, J. and Athans, M. (1992). Distributed detection by a large team
of sensors in tandem. IEEE Trans. on Aerospace and Electronic Systems,
28:639-653.
Park, D., El-Sharkawi, M., Marks, R. J., Atlas, L., and Damborg, M. (1991).
Electric load forecasting using an artificial naural network. IEEE Trans. on
Power Systems, 6:442-449.
Park, J. and Hu, Y. H. (1996). Estimation of correctness region using clustering
in mixture of experts. In Int. Conf. on Neural Networks, pages 1395--1399.
304
Park, J. and Sandberg, I. W. (1991). Universal approximation using radial-

basis-function networks. Neural Computation, 3:246-257.
Park, J. and Sandberg, I. (1993). Approximation and radial basis function
networks. Neural Computation, 5:305-316.
Park, S. and Lainiotis, D. (1972). Joint detection-estimation of gaussian signals
in white gaussian noise. Information Science, 4:315-325.
Park, Y., Moon, U., and Lee, K (1995). A self-organizing fuzzy logic controller
for dynamic systems using a FARMA model. IEEE 1Tans. on Fuzzy Systems,
3:75-82.
Parmanto, B., Munro, P., and Doyle, H. (1996). Reducing variance of committee
prediction with resampling techniques. Connection Science, 8:405-425.
Patrick, E. (1972). Fundamentals of Pattern Recognition. Prentice Hall.
Pawelzik, K, Kohlmorgen, J., and Muller, K R. (1996). Annealed competition
of experts for a segmentation and classification of switching dynamics. Neural
Computation, 8:340-356.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems. Morgan Kauff-
man.
Peng, F. C., Jacobs, R. A., and Tanner, M. A. (1996). Bayesian inference in
mixtures-of-experts and hierarchical mixtures- of-experts models with an
application to speech recognition. Journal of the American Statistical Asso-
ciation, 91:953-960.
Peng, T., Hubele, N., and Karady, G. G. (1992). Advancement in the appli-
cation of neural networks for short term load forecasting. IEEE 1Tans. on
Power Systems, 7:250-258.
Perrone, M. and Cooper, L. (1993). When networks disagree: ensemble meth-
ods for neural networks. In Artificial neural networks for speech and vision.
Chapman-Hall.
Perrone, M. P. and Intrator, N. (1992). Unsupervised splitting rules for neural
tree classifiers. In Proc. of the International Joint Conference on Neural
Networks, pages 820-825.
Petridis, Vas. (1981). A method for bearings-only velocity and position estima-
tion. IEEE 1Tans. on Automatic Control, 26:488-493.
Petridis, Vas. and Kehagias, Ath. (1996a) Modular neural networks for MAP
classification of time series and the partition algorithm. IEEE 1Tans. on
Neural Networks, 7: 73-86.
Petridis, Vas. and Kehagias, Ath. (1996b). A recurrent network implementation
of Bayesian time series classification. Neural Computation, 8:357-372.
Petridis, Vas. and Kehagias, Ath. (1997). Predictive modular fuzzy systems for
time-series classification. IEEE 1Tans. on Fuzzy Systems, 5:381-397.
Petridis, Vas. and Kehagias, Ath. (1998). A multi-model algorithm for parame-
ter estimation of time-varying nonlinear systems. Automatica, 34:46H75.
Petridis, Vas. and Paraschidis, K (1993). Structural adaptation based on a sim-
ple learning algorithm. In Proc. IJCNN 93, pages 621-624. Nagoya, Japan.
Petrie, T. (1969). Probabilistic functions of finite state Markov chains. Annals
of Mathematical Statistics, 40:102-115.
REFERENCES 305
Platt, J. (1991a). A resource-allocating network for function interpolation.

Neural Computation, 3:213-225.
Platt, J. C. (1991b). Learning by combining memorization and gradient descent.
Poritz, A. (1982). Linear prediction hidden Markov models and the speech
signal. In IEEE ICASSP, pages 1291-1294.
Poritz, A. (1988). Hidden Markov models: a guided tour. In IEEE Int. Confer-
ence on Speech and Signal Processing, pages 7-13.
Pottmann, M., Unbehauen, H., and Seborg, D. (1993). Application of a general
multi model approach for identification of highly nonlinear processes - a case
study. Int. Journal of Control, 57:97-120.
Prakash, M. and Murty, M. N. (1997). Growing subspace pattern recognition
methods and their neural-network models. IEEE Trans. on Neural Networks,
8:161-168.
Prank, K. et al. (1996). Self-organized quantification of hormone pulsatility
based on predictive neural networks: Separating the secretory dynamics of
growth hormone in acromegaly from normal controls. Biophysical Journal,
70:2540-2547.
Qin, S. and Borders, G. (1994). A multi-region fuzzy logic controller for non-
linear process control. IEEE Trans. on Fuzzy Systems, 2:74-81.
Quandt, R. (1958). The estimation of the parameters of a linear regression
system obeying two separate regimes. Journal of the American Statistical
Association, 53:873-80.
Quandt, R. (1972). A new approach to estimating switching regressions. Journal
of the American Statistical Association, 67:306-310.
Quandt, R. and Ramsey, J. B. (1978). Estimating mixtures of normal distrib-
utions and switching regressions. Journal of the American Statistical Asso-
ciation, 73:730-752.
Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1:81-106.
Quinlan, J. (1993). C4.5: Programs for Machine Learning.
Quinlan, J. (1996). Learning decision tree classifiers. ACM Computing Surveys,
28:71-72.
Quinlan, J. and Rivest, R. L. (1989). Inferring decision trees using the minimum
description length principle. Information and Computation, 80:227-248.
Rabiner, L. (1988). A tutorial on hidden Markov models and selected applica-
tions in speech recognition. Proc. of the IEEE, 77:257-286.
Rahim, M. and Lee, C.-H. (1996). Joint ANN and HMM recognizer design using
string-based minimum classification error (mce) training. In World Congress
on Neural Networks, pages 51-56.
Ramachandran, S. and Pratt, L. Y. (1991). Information measure based scele-
tonisation. In Advances in Neural Information Processing Systems, 3, pages
1080-1087.
Ramamurti, V. and Ghosh, J. (1996). Advances in using hierarchical mixture
of experts for signal classification. In Int. Con! on Acoustics, Speech and
Signal Processing, pages 3569-3572.
306
Ranman, S. and Bhatnagar, R. (1988). An expert system based algorithm for

short term load forecast. IEEE Trans. on Power Systems, 3:392-399.
Raviv, Y. and Intrator, N. (1996). Bootstrapping with noise: an effective regu-
larization technique. Connection Science, 8:355-372.
Redding, N. J., Kowalczyk, A., and Downs, T. (1993). Constructive higher-
order network algorithm that is polynomial time. Neural Networks, 6:997-
1010.
Reed, R. (1993). Pruning algorithms - a survey. IEEE Trans. on Neural Net-
works, 4:740-746.
Ridella, S., Rovetta, S., and Zunino, R. (1997). Circular backpropagation net-
works for classification. IEEE Trans. on Neural Networks, 8:84-97.
Ritter, H. (1991). Asymptotic level density for a class of vector quantization
processes. IEEE Trans. on Neural Networks, 2:173-175.
Ritter, H. and Schulten, K. (1986). On the stationary state of kohonen's self-
organizing sensory mapping. BioZ. Cybernetics, 54:99-106.
Ritter, H. and Schulten, K. (1988). Convergence properties of kohonen's topol-
ogy conserving maps: Fluctuations, stability and dimension selection. Biol.
Rodriguez, C., Rementerfa, A., Martfn, J. I., Lafuente, A., Muguerza, J., and
Perez, J. (1996). A modular neural network approach to fault diagnosis.
Rogova, G. (1994). Combining the results of neural network classifiers. Neural
Rojer, A. and Schwartz, E. (1989). A multiple-map model for pattern classifi-
cation. Neural Computation, 1:104-115.
Romaniuk, S. G. and Hall, L. O. (1993). Divide and conquer networks. Neural
Networks, 6:1105-1116.
Ronco, E. and Gawthrop, P. (1995). Modular neural networks: a state of the
art. Tech. report, Center for Systems and Control, Univ. of Glasgow.
Rosen, B. (1996). Ensemble learning using decorrelated neural networks. Con-
Rosenblum, M. and Davis, 1. S. (1996). An improved radial basis function
network for visual autonomous road following. IEEE Trans. on Neural Net-
works, 7:1111-1120.
Rosenblum, M., Yacoob, Y., and Davis, L. S. (1996). Human expression recog-
nition from motion using a radial basis function network architecture. IEEE
Roy, A., Govil, S., and Miranda, R. (1997). A neural network learning theory
and a polynomial time rbf algorithm. IEEE Trans. on Neural Networks,
8:1301-1313.
Royden, H. (1968). Real Analysis. MacMillan.
Rudolph, G. (1994). Convergence Analysis of Canonical Genetic Algorithms.
Rugh, W. (1981). Nonlinear system theory: the Volterra- Wiener approach.
REFERENCES 307
Ruspini, E. (1969). A new approach to clustering. Information and Control,

15:22-32.
Saito, K. and Nakano, R. (1996). A constructive learning algorithm for an HME.
In Int. Con! on Neural Networks, pages 1268-1273.
Sanger, T. (1991a). A tree-structured algorithm for reducing computation in
networks with separable basis functions. Neural Computation, 3:67-78.
Sanger, T. D. (1991b). A tree-structured adaptive network for function ap-
proximation in high-dimensional spaces. IEEE Trans. on Neural Networks,
2:285-293.
Sankar, A. and Mammone, R. J. (1991). Optimal pruning of neural tree net-
works for improved generalization. In Proc. Int. Joint Con! on Neural Net-
works, pages 219-224.
Satoshi, Y., Hidekiyo, I., and Yoshikazu, N. (1995). Inverse model learning
algorithm using the hierarchical mixtures of experts. In Int. Con! on Neural
Networks, pages 2738-2742.
Sbarbaro, D. (1997). Local Laguerre models. In Multiple Model Approaches to
Modelling and Control. Taylor and Francis.
Schaal, S. and Atkeson, C. G. (1996). From isolation to cooperation: an alterna-
tive view of a system of experts. Advances in Neural Information Processing
Schwarze, H. and Hertz, J. (1994). Discontinuous generalization in large com-
mmittee machines. In Advances in Neural Information Processing Systems
6, pages 399-405.
Schwenk, H. and Bengio, Y. (1997). Adaptive boosting of neural networks for
character recognition. Tech. report, Dept. Informatique et Recherche Oper-
ationnelle, Universite de Montreal.
Seneta, E. (1987). Non-Negative Matrices. Springer.
Setiono, R. and Hui, L. (1995). Use of a quasi-Newton method in a feedforward
neural network construction algorithm. IEEE Trans. on Neural Networks,
6:273-277.
Shadafan, R. and Niranjan, M. (1993). A dynamic neural network architecture
by sequential partitioning of the input space. IEEE International Conference
on Neural Networks, pages 226-231.
Sharkey, A.J.C. (1996). On combining artificial neural nets. Connection Sci-
ence, 8:299-313.
Sharkey, A.J.C. (1997). Modularity, combining and artificial neural nets. Con-
Sherstinksy, A. and Picard, R. W. (1996). On the efficiency of the orthogonal
least squares training method for radial basis function networks. IEEE Trans.
Shimshoni, Y. and Intrator, N. (1998). Classifying seismic signals by integrating
ensembles of neural networks. IEEE Trans. On Signal Processing.
Sietsma, J. and Dow, R. J. F. (1988). Neural network pruning - why and how.
In IEEE Int. Con! on Neural Networks, pages 325-333.
308
Sims, F., Lainiotis, D., and Magill, D. (1969). Recursive algorithm for the calcu-
lation of the adaptive Kalman filter coefficients. IEEE Trans. on Automatic
Control, 14:215-218.
Sin, S.-K. and DeFigueiredo, R. J. (1993). Efficient learning procedures for
optimal interpolative nets. Neural Networks, 6:99-113.
Sirat, J. A. and Nadal, J.-P. (1990). Neural trees: a new tool for classification.
Network-Computation in Neural Systems, 1:423-438.
Sjogaard, S. (1991). A Conceptual Approach to Generalisation in Dynamic
Neural Networks. PhD thesis, Aarhus University.
Sjogaard, S. (1992). Generalization in Cascade-Correlation networks. In Work-
shop on Neural Networks for Signal Processing 1992, Vol. 2, pages 59-68.
Skepstedt, A., Ljung, L., and Millnert, M. (1992). Construction of composite
models form observed data. Int. Journal of Control, 55:141-152.
Smieja, F. (1996). The pandemonium system of reflective agents. IEEE Trans.
Smotroff, Friedman, and Conolly (1991). Self organizing modular neural net-
works. In Proc. Int. Joint Con! on Neural Networks, pages 187-192.
Sokol, S. (1976). Visually evoked potentials: theory, techniques and clinical
applications. Review Surv. Ophtalm., 21:18-44.
Sorheim, E. (1990). A combined network architecture using ART2 and back
propagation for adaptive estimation of dynamical processes. Modeling, Iden-
tification and Control, 11:191-199.
Srinivasan, D., Chang, C., and Liew, A. (1995). Demand forecasting using fuzzy
neural computation, with special emphasis on weekend and public holiday
forecasting. Paper 95 WM 158-6-PWRS presented at IEEE/PES 1995 Win-
ter Meeting.
Srinivasan, K. (1969). State estimation by orthogonal expansion of probability
distributions. IEEE Trans. on Automatic Control, 15:3-10.
Strang, G. and Nguyen, T. (1996). Wavelets and Filter Banks. Wellesley -
Cambridge Press.
Stromberg, J., Gustaffson, F., and Ljung, L. (1991). Trees as black box model
structures for dynamical systems. In European Control Conference, Greno-
ble, pages 1175-1180.
Sugeno, M. and Kang, G. (1986). Fuzzy modelling and control of multilayer
incinerator. Fuzzy sets and systems, 18:329-346.
Sugeno, M. and Kang, G. (1988). Structure identification of fuzzy model. Fuzzy
sets and systems, 26:15-33.
Sugeno, M., Murofushi, T., Mori, T., Tatematsu, T., and Tanaka, J. (1989).
Fuzzy algorithmic control of a model car by oral instructions. Fuzzy Sets and
Systems, 32:207-219.
Sugeno, M. and Yasukawa, T. (1993). A fuzzy logic-based approach to qualita-
tive modeling. IEEE Trans. on Fuzzy Systems, 1:7-32.
Swiercz, M., Grusza, M., and Sobolewski, P. (1997). Analysis of visual evoked
potentials using neural networks. In 4th Int. Conference on Computers in
Medicine, pages 127-132.
REFERENCES 309
Swihart, S. and Matheny, A. (1992). Classification of chromatic visual evoked

potentials with the aid of a neural net. Computers in Biology and Medicine,
22:165-17l.
Sworder, D. (1969). Feedback control of a class of linear systems with jump
parameters. IEEE Trans. on Automatic Control, 14:9-14.
Takagi, T. and Sugeno, M. (1985). Fuzzy identification of systems and its ap-
plications for modelling and control. IEEE Trans. on Systems, Man and
Tan, A.-H. (1997). Cascade ARTMAP: Integrating neural computation and
symbolic knowledge processing. IEEE Trans. on Neural Networks, 8:237-
250.
Tan, K, Li, Y., Murray-Smith, D., and Sharman, K (1995). System identifi-
cation and linearisation using genetic algorithms with simulated annealing.
Proc. first IEE/IEEE Int. Conf. on Genetic Algorithms in Eng. Systems:
Innovations and Appl., Sheffield.
Taniguchi, M. and Tresp, V. (1997). Averaging regularized estimators. Neural
Computation, 9: 1163-1178.
Tenney, R. and Sandell, N. (1981). Detection with distributed sensors. IEEE
Trans. on Aerospace and Electronic Systems, 17:98-1Ol.
Tenorio, M. F. and Lee, W.-T. (1989). Self organizing neural networks for the
identification problem. In Advances in Neural Information Processing Sys-
tems 1, pages 57-64.
Thakor, N. and Sherman, D. (1996). Wavelet time-scale analysis in biomedical
signal processing. In The Biomedical Engineering Handbook. CRC Press.
Tham, C. K (1995). On-line learning using hierarchical mixtures of experts. In
lEE Conference on Artificial Neural Networks, pages 347-35l.
Thodberg, H. H. (1993). Ace of Bayes: Application of neural networks with
pruning. Tech. report 1132E, Danish Meat Research Institute.
Tong, H. and KS.Lim (1980). Threshold autoregression, limit cycles and cycli-
cal data. Journal of the Royal Stat. Soc., Part B, 42:245-292.
Tresp, V., Hollatz, J., and Ahmad, S. (1997). Representing probabilistic rules
with networks of Gaussian basis functions. Machine Learning, 27:173-200.
Tresp, V. and Taniguchi, M. (1995). Combining estimators using non-constant
weighting functions. In Advances in Neural Information Processing Systems
7, pages 419-426.
Tugnait, J. and Haddad, A. (1980). Adaptive estimation in linear systems with
unknown Markovian noise statistics. IEEE Trans. on Information Theory,
26:66-78.
Turner, K. and Ghosh, J. (1996). Error correlation and error reduction in en-
semble classifiers. Connection Science, 8:385-404.
Unar, M. A. and Murray-Smith, D. J. (1997). Radial basis function networks
for ship steering control. In 12th International Conference on Systems En-
gineering (ICSE'97), Coventry, UK.
Urbanczik, R. (1996). A large committee machine learning noisy rules. Neural
310
Utans, J. (1994). Learning in compositional hierarchies: Inducing the structure

of objects from data. In Advances in Neural Information Processing Systems
6, pages 285-292.
Utkin, V. (1977). Variable structure systems with sliding modes. IEEE Trans.
Utkin, V. (1992). Sliding modes in control and optimization. Springer.
Vemuri, S., Huang, W. L., and Nelson, D. J. (1981). On-line algorithms for
forecasting hourly loads of an electric utility. IEEE Trans. Power Systems,
pages 3775-3784.
von Sperling, M. (1993). Parameter estimation and sensitivity analysis of an
activated sludge model using Monte Carlo simulation and the analyst's in-
volvement. Water Science and Technology, 28:219-229.
Wahlberg, B. (1991). System identification using Laguerre models. IEEE Trans.
Waibel, A. (1988). Connectionist glue: Modular design of neural speech systems.
In Connectionist Models Summer School, pages 417-425.
Waibel, A. (1989a). Consonant recognition by modular construction of large
phonemic time-delay neural networks. In Advances in Neural Information
Waibel, A. (1989b). Modular construction of time-delay neural networks for
speech recognition. Neural Computation, 1:39-46.
Waibel, A., Sawai, H., and Shikano, K. (1989). Modularity and scaling in large
phoneme neural networks. IEEE Trans. on Acoustics, Speech and Signal
Processing, 37: 1888-1898.
Waterhouse, S. R. and Cook, G. D. (1996). Ensemble methods for phoneme
classification. pages 800-806.
Waterhouse, S. R., MacKay, D., and Robinson, T. (1995). Bayesian methods for
7, pages 351-357.
Waterhouse, S. R., Mackay, D., and Robinson, T. (1996). Bayesian methods for
8, pages 351-357.
Waterhouse, S. R. and Robinson, A. J. (1994). Classification using hierarchical
mixtures of experts. Neural Networks for Signal Processing, pages 177-186.
Waterhouse, S. R. and Robinson, A. J. (1995). On the relationship between
hidden Markov models and hierarchical mixtures of experts. In Proc. of the
IEEE Workshop on Automatic Speech Recognition.
Waterhouse, S. R. and Robinson, A. (1996). Constructive algorithms for hier-
archical mixtures of experts. In Advances in Neural Information Processing
Weigend, A. S. (1996). Time series analysis and prediction using gated experts
with application to energy demand forecasts. Applied Artificial Intelligence,
10:583-624.
Weigend, A. S. and Gershenfeld, N. (1994). Time series prediction: forecasting
the future and understanding the past. Addison-Wesley.
REFERENCES 311
Weigend, A. S., Mangeas, M., and Srivastava., N. (1995). Nonlinear gated ex-
perts for time-series - discovering regimes and avoiding overfitting. Interna-
tional Journal of Neural Systems, 6:373-399.
Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. (1990). Back - propaga-
tion, weight - elimination and time series prediction. In Proc. Connectionist
Models Summer School, pages 105-116.
Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. (1991). Generalization
by weight-elimination with application to forecasting. In Advances in Neural
Information Processing Systems 3, pages 875-882.
Weymaere, N. and Martens, J.-P. (1991). A fast and robust learning algorithm
for feedforward neural networks. Neural Networks, 4:361-369.
Whitehead, B. (1996). Genetic evolution of radial basis function coverage using
orthogonal niches. IEEE Trans. on Neural Networks, 7:1525-1528.
Whitehead, B. A. and Choate, T. D. (1994). Evolving space-filling curves to dis-
tribute Radial Basis Functions over an input space. IEEE Trans. on Neural
Networks, 5:15-23.
Whitehead, B. and Choate, T. D. (1996). Cooperative-competitive genetic evo-
lution of radial basis function centers and widths for time series prediction.
Whittaker, J. (1990). Graphical models in applied multivariate statistics. Wiley.
Windham, M. (1982). Cluster validity for the fuzzy c-means clustering algo-
rithm. IEEE Trans. on Pattern Analysis and Machine Intelligence, 4:357-
363.
Winkler, R. (1989). Combining forecasts: a philosophical basis and some current
issues. Int. Journal of Forecasting, 5:605-609.
Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5:241-259.
Wong, Y. (1993). Clustering data by melting. Neural Computation, 5:89-104.
Wynne-Jones, M. (1992). Node splitting: A constructive algorithm for feed-
forward neural networks. Advances in Neural Information Processing Sys-
tems 4, pages 1072-1079.
Xiao-Rong, L. and Bar-Shalom, Y. (1996). Multiple model estimation with
variable structure. IEEE Trans. on Automatic Control, 41:478-493.
Xu, L., Hinton, G., and Jordan, M.1. (1995). An alternative model for mixtures
of experts. In Advances in Neural Information Processing Systems 7, pages
633-640.
Xu, L. and Jordan, M. I. (1993). EM learning on a generalized finite mixture
model for combining multiple classifiers. In World Congress on Neural Net-
works, pages 227-230.
Xu, 1., Krzyzak, A., and Oja, E. (1993). Rival penalized competitive learning
for clustering analysis RBF net and curve detection. IEEE Trans. on Neural
Xu, L., Krzyzak, A., and Suen, C. (1992). Methods of combining multiple clas-
sifiers and their applications to handwriting recognition. IEEE Trans. on
Systems, Man and Cybernetics, 22:418-434.
312
Yeung, D.-Y. (1991). A neural network approach to constructive induction. In

Machine Learning - Proc. of the 8th Int. Workshop, pages 228-232.
Yin, H. and Allinson, N. M. (1997). Bayesian learning for self-organising maps.
Electronics Letters, 33:304-305.
Yuan, W., O.-D. and Stenstrom, M. (1993). Model calibration for the high pu-
rity oxygen activated sludge process-algorithm development and evaluation.
Water Science and Technology, 28: 163-171.
Zardecki, A. (1994). Fuzzy control for forecasting and pattern recognition in a
time series. In IEEE 1rans. on Fuzzy Systems, pages 1815-1819. ??
Zeevi, A., Meir, R., and Adler, R. (1996). Time series prediction using mixtures
of experts. pages 309-315.
Zhang, J. (1991). Dynamics and formulation of self-organizing maps. Neural
Zhao, Y., Schwartz, R., Sroka, J., and Makhoul, J. (1995). Hierarchical mixtures
of experts methodology applied to continuous speech recognition. In Int.
Conf. on Acoustics, Speech and Signal Processing, pages 3443-3446.
Index
activated sludge process 135 Expectation Maximization algorithm 253-

adaptive resonance theory (ART) 257 254
experts
Bayes' rule 12, 35 adaptive hierarchical mixtures of
253-254
cascade correlation 256 mixtures of 253-254
combination graphs of 254
of networks 251-252 forgetting factor 19
of models 84 fuzzy classification 261
with fuzzy logic 261 fuzzy clustering 261
competitive algorithms 4 fuzzy inference, modes of 45
constructive algorithms 256 fuzzy modeling 261
credit update 3, 4, 22
additive 4, 23, 42, 54, 62, 70 gating network 253
counting 4, 23, 43, 54, 62, 71 genetic algorithms 86
fuzzy 4, 23, 43, 54, 63, 71 genetic operators 87
incremental 4, 23, 46, 47, 54, 65, 73 graphical models 254, 265--266
multiplicative 4, 22, 41, 54, 60, 68 graphical models, variational methods for
credit function 3, 15, 19, 84 266
fixed source 48 growing algorithms 256
Markovian rule 51, 53, 54, 76 growing tree 256
slow 40
Table of fixed source algorithms 49 ICRA 47
Table of Markovian algorithms 54 identification
by classification 82
data allocation of unknown sources 150
hybrid 6,152-156,174 black box 5, 82-83
parallel parameter estimation 5, 82, 84
convergence 176-179 interchangeability 20, 250
many sources 180, 181
serial 6, 152-154, 158, 209-210 local models 83, 171
convergence 211-214
many sources 214 MAP estimation 12, 50
data blocks 40, 103, 161 Markov chain
diversity criterion 87 transient 7, 183, 280
divide-and-conquer 84, 267-270 irreducible 280
persistent 280
ensemble of networks 252 Markov, hidden model 19, 161, 264-265
entropy 87 modular methods 2, 249
entropy criterion 88 modularity 104, 107
314
multiple models 3, 7, 84, 249 search strategies 84

for control 262-263 genetic algorithms 86
for forecasting 260 selection probabilities 5, 87
self organizing maps (SOM) 257
parallelism 20, 104 sensor fusion 264
parameter estimation short term electric load 123
algorithm 88 slowly varying systems 84
for large parameter set 86, 92 source switching 18, 49ff., 84
for small parameter set 84, 90 source variable 11
partition algorithm 14, 262 source, composite 215
phenomenological point of view 14, 46, 48, specialization process 7, 159-160, 174-
52 175, 210
posterior probability 12, 15 specialized networks 252
prediction error 3, 16, 40 statistical pattern recognition 258
predictive modular methods 3 switching regimes 263
predictor 2, 15, 21, 23, 102 switching regressions 260
predictor order 103
PREMONN architecture 20 threshold 18, 21, 104
PREMONN basic classification algorithm threshold models (in econometrics) 260
16-18 time series classification 1 , 2, 5, 11
pruning algorithms 257 time series identification 1 , 2,
time series prediction 1, 2
radial basis function networks 255 trees 161, 254-255
random walk 7, 178, 213 classification and decision 259
resource allocating network 256 decision 259
retraining period 162 tree stuctured models in control 263
scaling 20, 104 visually evoked responses (VER) 107

Predictive Modular Neural Networks - Applications To Time Series-Springer KEDMADO

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predictive Modular Neural Networks - Applications To Time Series-Springer KEDMADO

Uploaded by

Copyright:

Available Formats

PREDICTIVE

A C.I.P. Catalogue record for this book is available

Copyright © 1998 by Springer Science+Business Media New York

AII rights reserved. No part of this publication may be reproduced, stored in a

Printed an acid-free paper.

Part I Known Sources 9

3. GENERALIZATIONS OF THE BASIC PREMONN 39

4.3 Convergence Theorem for a Markovian Switching Sources Algorithm 65

5. SYSTEM IDENTIFICATION BY THE PREDICTIVE MODULAR APPROACH 81

6. IMPLEMENTATION ISSUES 101

7. CLASSIFICATION OF VISUALLY EVOKED RESPONSES 109

8. PREDICTION OF SHORT TERM ELECTRIC LOADS 123

9. PARAMETER ESTIMATION FOR AND ACTIVATED SLUDGE PROCESS 135

Part III Unknown Sources 147

10. SOURCE IDENTIFICATION ALGORITHMS 149

10.2 Source Identification and Data Allocation 151

11. CONVERGENCE OF PARALLEL DATA ALLOCATION 173

12. CONVERGENCE OF SERIAL DATA ALLOCATION 209

Part IV Connections 247

13. BIBLIOGRAPHIC REMARKS 249

14. EPILOGUE 267

PREdictive MOdular Neural Networks). Similar algorithms and systems have

nonmathematical. We have also provided an appendix, which contains the

VASSILIOS PETRIDIS AND ATHANASIOS KEHAGIAS

1.1 CLASSIFICATION. PREDICTION AND IDENTIFICATION: AN

Consider the following problem of time series classification. A source is a

V. Petridis et al., Predictive Modular Neural Networks

1.2 PART I: KNOWN SOURCES

As already mentioned, we consider classification to be the primary problem,

Each predictor computes a prediction of the next observation Yt.

In Chapter 5, we show how PREMONNs can be applied to the problem

1.3 PART II: APPLICATIONS

1.4 PART III: UNKNOWN SOURCES

1. take as input data from a composite time series, generated by alternately

2. identify the active sources;

3. produce an input/output model (i.e. predictor) for each active source;

In Chapter 10 we present two basic algorithms (in addition to guidelines for

The classification algorithms of Part I depend on the predictive error; in turn,

Initialization: Set K = 1. Initialize predictor no. 1 with random values. Set

This is a completely unsupervised learning algorithm. Data are allocated

In Chapter 11 we examine the convergence properties of the parallel source

1.5 PART IV: CONNECTIONS

In Chapter 1 we introduced the time series classification problem informally. In

2.1 BAYESIAN TIME SERIES CLASSIFICATION

V. Petridis et al., Predictive Modular Neural Networks

computed for t = 1, 2, '" ; in other words, the additional information provided

pf ~ Pr(z = k I Yl, Y2, ... , Yt). (2.2)

If the pf's are known, a natural choice for Zt is the following:

Theorem 2.1 (Computation of Posterior Probabilities) Suppose thatYt

This theorem is proved rigorously in Appendix 2.A. The proof basically

dz(k I Yl, Y2, ... , Yt-l). (2.7)

d.,(a) = -dad Pr(x < a).

This completes the "proof" of the theorem. Finally, it is reasonable to set Zt

Hence the classification method is summarized by eqs.(2.1), (2.3), (2.11) and

2.2 THE BASIC PREMONN CLASSIFICATION ALGORITHM

2.2.1 Phenomenological Motivation

2. The number K (number of possible sources and respective source functions)

These are phenomenological assumptions: they only relate to the observed

an observed sample Yt will be classified to the source/predictor which furnishes

2.2.2 The Algorithm

Basic PREMONN Classification Algorithm

Main online phase.

For t = 1,2, ...