Professional Documents
Culture Documents
Textbook Encyclopedia of Machine Learning and Data Mining 2Nd Edition Claude Sammut Ebook All Chapter PDF
Textbook Encyclopedia of Machine Learning and Data Mining 2Nd Edition Claude Sammut Ebook All Chapter PDF
https://textbookfull.com/product/encyclopedia-of-machine-
learning-and-data-mining-2nd-2nd-edition-claude-sammut/
https://textbookfull.com/product/machine-learning-and-data-
mining-in-aerospace-technology-aboul-ella-hassanien/
https://textbookfull.com/product/data-mining-practical-machine-
learning-tools-and-techniques-fourth-edition-ian-h-witten/
https://textbookfull.com/product/statistical-and-machine-
learning-data-mining-third-edition-techniques-for-better-
predictive-modeling-and-analysis-of-big-data-third-edition-bruce-
Learning Data Mining with Python Layton
https://textbookfull.com/product/learning-data-mining-with-
python-layton/
https://textbookfull.com/product/learning-data-mining-with-
python-robert-layton/
https://textbookfull.com/product/learn-data-mining-through-excel-
a-step-by-step-approach-for-understanding-machine-learning-
methods-1st-edition-hong-zhou/
https://textbookfull.com/product/handbook-of-statistical-
analysis-and-data-mining-applications-2nd-edition-robert-nisbet/
https://textbookfull.com/product/machine-learning-in-business-an-
introduction-to-the-world-of-data-science-2nd-edition-john-c-
hull/
Claude Sammut • Geoffrey I. Webb
Editors
Encyclopedia of
Machine Learning
and Data Mining
Second Edition
123
Editors
Claude Sammut Geoffrey I. Webb
The University of New South Wales Faculty of Information Technology
Sydney, NSW Monash University
Australia Melbourne, VIC, Australia
Machine learning and data mining are rapidly developing fields. Following
the success of the first edition of the Encyclopedia of Machine Learning,
we are delighted to bring you this updated and expanded edition. We have
expanded the scope, as reflected in the revised title Encyclopedia of Machine
Learning and Data Mining, to encompass more of the broader activity that
surrounds the machine learning process. This includes new articles in such
diverse areas as anomaly detection, online controlled experiments, and record
linkage as well as substantial expansion of existing entries such as data
preparation. We have also included new entries on key recent developments
in core machine learning, such as deep learning. A thorough review has also
led to updating of much of the existing content.
This substantial tome is the product of an intense effort by many individ-
uals. We thank the Editorial Board and the numerous contributors who have
provided the content. We are grateful to the Springer team of Andrew Spencer,
Michael Hermann, and Melissa Fearon who have shepherded us through the
long process of bringing this second edition to print. We are also grateful to
the production staff who have turned the content into its final form.
We are confident that this revised encyclopedia will consolidate the first
edition’s place as a key reference source for the machine learning and data
mining communities.
A
(Muggleton and Bryant 2000) and HAIL (Ray of our theory and consistency refers also to the
et al. 2003). On the other hand, from the point corresponding notion in this logic. The particular
of view of abduction as “inference to the best choice of this underlying formal framework of
explanation” (Josephson and Josephson 1994) the logic is in general a matter that depends on the
link with induction provides a way to distinguish problem or phenomena that we are trying to
between different explanations and to select those model. In many cases, this is based on first
explanations that give a better inductive general- order predicate calculus, as, for example, in the
ization result. approach of theory completion in Muggleton and
A recent application of abduction, on its own Bryant (2000). But other logics can be used, e.g.,
or in combination with induction, is in Systems the nonmonotonic logics of default logic or logic
Biology where we try to model biological programming with negation as failure when the
processes and pathways at different levels. modeling of our problem requires this level of
This challenging domain provides an important expressivity.
development test-bed for these methods of This basic formalization as it stands, does not
knowledge intensive learning (see e.g., King fully capture the explanatory nature of the abduc-
et al. 2004; Papatheodorou et al. 2005; Ray et al. tive explanation H in the sense that it necessarily
2006; Tamaddoni-Nezhad et al. 2004; Zupan conveys some reason why the observations hold.
et al. 2003). It would, for example, allow an observation O
to be explained by itself or in terms of some
other observations rather than in terms of some
Structure of the Learning Task
“deeper” reason for which the observation must
hold according to the theory T . Also, as the
Abduction contributes to the learning task by first
above specification stands, the observation can
explaining, and thus rationalizing, the training
be abductively explained by generating in H
data according to a given and current model
some new (general) theory completely unrelated
of the domain to be learned. These abductive
to the given theory T . In this case, H does not
explanations either form on their own the result
account for the observations O according to the
of learning or they feed into a subsequent phase
given theory T and in this sense it may not be
to generate the final result of learning.
considered as an explanation for O relative to T .
For these reasons, in order to specify a “level”
Abduction in Artificial Intelligence
at which the explanations are required and to un-
Abduction as studied in the area of Artificial
derstand these relative to the given general theory
Intelligence and the perspective of learning
about the domain of interest, the members of an
is mainly defined in a logic-based approach.
explanation are normally restricted to belong to
Other approaches to abduction include set
a special preassigned, domain-specific class of
covering (See, e.g., Reggia 1983) or case-based
sentences called abducible.
explanation, (e.g., Leake 1995). The following
Hence abduction, is typically applied on a
explanation uses a logic-based approach.
model, T , in which we can separate two disjoint
Given a set of sentences T (a theory or model),
sets of predicates: the observable predicates and
and a sentence O (observation), the abductive
the abducible (or open) predicates. The basic
task is the problem of finding a set of sentences
assumption then is that our model T has reached
H (abductive explanation for O) such that:
a sufficient level of comprehension of the domain
1. T [ H ˆ O; such that all the incompleteness of the model
2. T [ H is consistent, can be isolated (under some working hypothe-
ses) in its abducible predicates. The observable
where ˆ denotes the deductive entailment rela- predicates are assumed to be completely defined
tion of the formal logic used in the representation (in T ) in terms of the abducible predicates and
Abduction 3
from entailment, while the second subproblem as given theory T , using just the set of observa-
a problem of learning from interpretations. tions. The observations specify incomplete (usu-
ally extensional) knowledge about the observable A
Abduction and Induction predicates, which we try to generalize into new
The utility of abduction in learning can be en- knowledge. In contrast, the generalizing effect of
hanced significantly when this is integrated with abduction, if at all present, is much more limited.
induction. Several approaches for synthesizing With the given current theory T , that abduction
abduction and induction in learning have been always needs to refer to, we implicitly restrict the
developed, e.g., Ade and Denecker (1995), generalizing power of abduction as we require
Muggleton and Bryant (2000), Yamamoto that the basic model of our domain remains that
(1997), and Flach and Kakas (2000). These of T . Induction has a stronger and genuinely new
approaches aim to develop techniques for generalizing effect on the observable predicates
knowledge intensive learning with complex than abduction. While the purpose of abduction
background theories. One problem to be faced by is to extend the theory with an explanation and
purely inductive techniques, is that the training then reason with it, thus enabling the generalizing
data on which the inductive process operates, potential of the given theory T , in induction the
often contain gaps and inconsistencies. The purpose is to extend the given theory to a new the-
general idea is that abductive reasoning can ory, which can provide new possible observable
feed information into the inductive process consequences.
by using the background theory for inserting This complementarity of abduction and in-
new hypotheses and removing inconsistent data. duction – abduction providing explanations from
Stated differently, abductive inference is used to the theory while induction generalizes to form
complete the training data with hypotheses about new parts of the theory – suggests a basis for
missing or inconsistent data that explain the their integration within the context of theory
example or training data, using the background formation and theory development. A cycle of
theory. This process gives alternative possibilities integration of abduction and induction (Flach and
for assimilating and generalizing this data. Kakas 2000) emerges that is suitable for the task
Induction is a form of synthetic reasoning that of incremental modeling (Fig. 1). Abduction is
typically generates knowledge in the form of new used to transform (and in some sense normalize)
general rules that can provide, either directly, the observations to information on the abducible
or indirectly through the current theory T that predicates. Then, induction takes this as input
they extend, new interrelationships between the and tries to generalize this information to general
predicates of our theory that can include, unlike
abduction, the observable predicates and even in
some cases new predicates. The inductive hy- T′ O
pothesis thus introduces new, hitherto unknown,
links between the relations that we are studying
Induction T∪H O Abduction
thus allowing new predictions on the observable
predicates that would not have been possible be-
fore from the original theory under any abductive T O′
explanation.
Abduction, Fig. 1 The cycle of abductive and inductive
An inductive hypothesis, H , extends, like in knowledge development. The cycle is governed by the
abduction, the existing theory T to a new theory “equation” T [ H ˆ O, where T is the current theory,
T 0 DT [ H , but now H provides new links be- O the observations triggering theory development, and H
tween observables and nonobservables that was the new knowledge generated. On the left-hand side we
have induction, its output feeding into the theory T for
missing or incomplete in the original theory T . later use by abduction on the right; the abductive output in
This is particularly evident from the fact that turn feeds into the observational data O 0 for later use by
induction can be performed even with an empty induction, and so on
6 Abduction
rules for the abducible predicates now treating This is realized in Progol 5.0 and applied to sev-
these as observable predicates for its own pur- eral problems including the discovery of the func-
poses. The cycle can then be repeated by adding tion of genes in a network of metabolic pathways
the learned information on the abducibles back (King et al. 2004), and more recently to the study
in the model as new partial information on the of inhibition in metabolic networks (Tamaddoni-
incomplete abducible predicates. This will affect Nezhad et al. 2006, 2004). In Moyle (2000), an
the abductive explanations of new observations ILP system called ALECTO, integrates a phase of
to be used again in a subsequent phase of in- extraction-case abduction to transform each case
duction. Hence, through this cycle of integration of a training example to an abductive hypothesis
the abductive explanations of the observations with a phase of induction that generalizes these
are added to the theory, not in the (simple) form abductive hypotheses. It has been used to learn
in which they have been generated, but in a robot navigation control programs by completing
generalized form given by a process of induction the specific domain knowledge required, within a
on these. general theory of planning that the robot uses for
A simple example, adapted from Ray et al. its navigation (Moyle 2002).
(2003), that illustrates this cycle of integration of The development of these initial frameworks
abduction and induction is as follows. Suppose that realize the cycle of integration of abduction
that our current model, T , contains the following and induction prompted the study of the prob-
rule and background facts: lem of completeness for finding any hypothe-
sad(X) tired(X), poor(X), ses H that satisfies the basic task of finding a
consistent hypothesis H such that T [ H ˆ
tired(oli), tired(ale), tired(kr), O for a given theory T , and observations O.
academic(oli), academic(ale), academic(kr), Progol was found to be incomplete (Yamamoto
1997) and several new frameworks of integration
student(oli), lecturer(ale), lecturer(kr), of abduction and induction have been proposed
where the only observable predicate is sad=1. such as SOLDR (Ito and Yamamoto 1998), CF-
Given the observations O D fsad.ale/; induction (Inoue 2001), and HAIL (Ray et al.
sad.kr/; not sad.oli /g can we improve our 2003). In particular, HAIL has demonstrated that
model? The incompleteness of our model resides one of the main reasons for the incompleteness
in the predicate poor. This is the only abducible of Progol is that in its cycle of integration of
predicate in our model. Using abduction we can abduction and induction, it uses a very restricted
explain the observations O via the explanation: form of abduction. Lifting some of these re-
strictions, through the employment of methods
E = fpoor(ale), poor(kr), not poor(oli)g.
from abductive logic programming (Kakas et al.
Subsequently, treating this explanation as training 1992), has allowed HAIL to solve a wider class of
data for inductive generalization we can general- problems. HAIL has been extended to a frame-
ize this to get the rule: work, called XHAIL (Ray 2009), for learning
nonmonotonic ILP, allowing it to be applied to
poor(X) lecturer(X)
learn Event Calculus theories for action descrip-
thus (partially) defining the abducible predicate tion (Alrajeh et al. 2009) and complex scientific
poor when we extend our theory with this rule. theories for systems biology (Ray and Bryant
This combination of abduction and induction 2008).
has recently been studied and deployed in several Applications of this integration of abduction
ways within the context of ILP. In particular, and induction and the cycle of knowledge devel-
inverse entailment (Muggleton and Bryant 2000) opment can be found in the recent proceedings of
can be seen as a particular case of integration of the Abduction and Induction in Artificial Intelli-
abductive inference for constructing a “bottom” gence workshops in 2007 (Flach and Kakas 2009)
clause and inductive inference to generalize it. and 2009 (Ray et al. 2009).
Abduction 7
1989; Mitchell 1982) explicitly relied the learner’s version space (Tong and Koller
on version space partitioning. These 2001).
approaches tried to select examples on which 4. Loss minimization (Cohn 1996). Uncertainty A
there was maximal disagreement between sampling can stumble when parts of the
hypotheses in the current version space. learner’s domain are inherently noisy. It
When such examples were labeled, they may be that, regardless of the number of
would invalidate as large a portion of the samples labeled in some neighborhood, it
version space as possible. A limitation of will remain impossible to accurately predict
explicit version space approaches is that, in these. In these cases, it would be desirable to
underconstrained domains, a learner may not only model the learner’s uncertainty over
waste its effort differentiating portions of arbitrary parts of its domain, but also to model
the version space that have little effect on the what effect labeling any future example is
classifier’s predictions, and thus on its error. expected to have on that uncertainty. For some
2. Query by Committee (Seung et al. 1992). In learning algorithms it is feasible to explicitly
query by committee, the experimenter trains compute such estimates (e.g., for locally-
an ensemble of models, either by selecting weighted regression and mixture models,
randomized starting points (e.g., in the case these estimates may be computed in closed
of a neural network) or by bootstrapping the form). It is, therefore, practical to select
training set. Candidate examples are scored examples that directly minimize the expected
based on disagreement among the ensemble loss to the learner, as discussed below under
models – examples with high disagreement in- “Statistical Active Learning.”
dicate areas in the sample space that are under-
determined by the training data, and therefore
potentially valuable to label. Models in the
ensemble may be looked at as samples from Statistical Active Learning
the version space; picking examples where
these models disagree is a way of splitting the Uncertainty sampling and direct loss minimiza-
version space. tion are two examples of statistical active learn-
3. Uncertainty sampling (Lewis and Gail 1994). ing. Both rely on the learner’s ability to statisti-
Uncertainty sampling is a heuristic form of cally model its own uncertainty. When learning
statistical active learning. Rather than sam- with a statistical model, such as a linear regressor
pling different points in the version space by or a mixture of Gaussians (Dasgupta 1999), the
training multiple learners, the learner itself objective is usually to find model parameters
maintains an explicit model of uncertainty that minimize some form of expected loss. When
over its input space. It then selects for labeling active learning is applied to such models, it is
those examples on which it is least confident. natural to also select training data so as to min-
In classification and regression problems, un- imize that same objective. As statistical models
certainty contributes directly to expected loss usually give us estimates on the probability of (as
(as the variance component of the “error = bias yet) unknown values, it is often straightforward
+ variance” decomposition), so that gathering to turn this machinery upon itself to assist in the
examples where the learner has greatest uncer- active learning process (Cohn 1996). The process
tainty is often an effective loss-minimization is usually as follows:
heuristic. This approach has also been found
effective for non-probabilistic models, by sim- 1. Begin by requesting labels for a small random
ply selecting examples that lie near the current subsample of the examples x1 , x2 , K, xn x and
decision boundary. For some learners, such as fit our model to the labeled data.
support vector machines, this heuristic can be 2. For any x in our domain, a statistical model
shown to be an approximate partitioning of lets us estimate both the conditional expec-
12 Active Learning
˙i ki xi ˙i ki .xi x /2
x D ; x2 D ;
n n
The Need for Reference Distributions ˙i ki .xi x /.yi y /
xy D
n
Step (2) above illustrates a complication that
is unique to active learning approaches. Tradi- ˙i ki yi ˙i ki .yi y /2
y D ; y2 D ;
tional “passive” learning usually relies on the n n
assumption that the distribution over which the 2 xy
yjx D y2
learner will be tested is the same as the one x2
from which the training data were drawn. When
the learner is allowed to select its own training We can combine these to express the conditional
data, it still needs some form of access to the expectation of y (our estimate) and its variance
distribution of data on which it will be tested. A as:
pool-based or stream-based learner can use the
2
pool or stream as a proxy for that distribution, but xy 2
yjx
if the learner is allowed (or required) to construct yO D y C .x x /; yO D
x2 n2
its own examples, it risks wasting all its effort on !
X .x x /2 X 2 .xi x /2
resolving portions of the solution space that are ki2 C ki :
of no interest to the problem at hand. x2 x2
i i
Active Learning 13
Our proxy for model error is the variance of our Greedy Versus Batch Active Learning
prediction,
D E integrated over the test distribution
2 It is also worth pointing out that virtually all
yO . As we have assumed a pool-based setting A
in which we have a large number of unlabeled active learning work relies on greedy strategies
examples from that distribution, we can simply – the learner estimates what single example best
compute the above variance over a sample from achieves its objective, requests that one, retrains,
the pool, and use the resulting average as our and repeats. In theory, it is possible to plan some
estimate. number of queries ahead, asking what point is
To perform statistical active learning, we want best to label now, given that N-1 more label-
to compute how our estimated variance will ing opportunities remain. While such strategies
change if we add an (as yet unknown) label have been explored in Operations Research for
yQ for an arbitrary x.QD We very small problem domains, their computational
E will write this new
2 requirements make this approach unfeasible for
expected variance as Q yO . While we do not know
problems of the size typically encountered in
what value yQ will take, our model gives us an
2 machine learning.
estimated mean y. O x/
Q and variance y.x/O
for the
There are cases where retraining the learner
value, as above. We can add this “distributed” y
after every new label would be prohibitively ex-
value to LOESS just as though it were a discrete
D E pensive, or where access to labels is limited by
one, and compute the resulting expectation Q y2O the number of iterations as well as by the total
in closed form. Defining kQ as K.x; Q x/, we write: number of labels (e.g., for a finite number of
clinical trials). In this case, the learner may select
D E
D E 2
Q yjx a set of examples to be labeled on each iteration.
X
.x Q x /2
Q y2O D ki2 C kQ 2 C This batch approach, however, is only useful if
Q 2
.n C k/ Q x2 the learner is able to identify a set of examples
i
!! whose expected contributions are non-redundant,
X .xi Q x /2 Q Q x /2
2 Q 2 .x which substantially complicates the process.
ki Ck ;
Q x2 Q x2
i
Dasgupta S (1999) Learning mixtures of Gaussians. been studied for many decades under the rubric
Found Comput Sci 634–644 of experimental design (Chernoff 1972; Fedorov
Fedorov V (1972) Theory of optimal experiments.
Academic Press, New York
1972). More recently, there has been substantial
Kearns M, Li M, Pitt L, Valiant L (1987) On the interest within the machine learning community
learnability of Boolean formulae. In: Proceedings in the specific task of actively learning binary
of the 19th annual ACM conference on theory of classifiers. This task presents several fundamen-
computing. ACM Press, New York, pp 285–295
Lewis DD, Gail WA (1994) A sequential algorithm
tal statistical and algorithmic challenges, and an
for training text classifiers. In: Proceedings of the understanding of its mathematical underpinnings
17th annual international ACM SIGIR conference, is only gradually emerging. This brief survey will
Dublin, pp 3–12 describe some of the progress that has been made
McCallum A, Nigam K (1998) Employing EM and
pool-based active learning for text classification. In: so far.
Machine learning: proceedings of the fifteenth inter-
national conference (ICML’98), Madison, pp 359–
367 Learning from Labeled and
North DW (1968) A tutorial introduction to decision
theory. IEEE Trans Syst Sci Cybern 4(3) Unlabeled Data
Pitt L, Valiant LG (1988) Computational limitations on
learning from examples. J ACM (JACM) 35(4):965– In the machine learning literature, the task of
984 learning a classifier has traditionally been studied
Robbins H (1952) Some aspects of the sequential
design of experiments. Bull Am Math Soc 55:527– in the framework of supervised learning. This
535 paradigm assumes that there is a training set
Ruff R, Dietterich T (1989) What good are experi- consisting of data points x (from some set X )
ments? In: Proceedings of the sixth international and their labels y (from some set Y), and the
workshop on machine learning, Ithaca
Seung HS, Opper M, Sompolinsky H (1992) Query by goal is to learn a function f W X ! Y, that will
committee. In: Proceedings of the fifth workshop on accurately predict the labels of data points arising
computational learning theory. Morgan Kaufmann, in the future. Over the past 50 years, tremendous
San Mateo, pp 287–294
progress has been made in resolving many of the
Steck H, Jaakkola T (2002) Unsupervised active learn-
ing in large domains. In: Proceeding of the confer- basic questions surrounding this model, such as
ence on uncertainty in AI. http://citeseer.ist.psu.edu/ “how many training points are needed to learn an
steck02unsupervised.html accurate classifier?”
Although this framework is now fairly well
understood, it is a poor fit for many modern
learning tasks because of its assumption that all
Active Learning Theory training points automatically come labeled. In
practice, it is frequently the case that the raw,
Sanjoy Dasgupta
abundant, easily obtained form of data is unla-
University of California, San Diego, La Jolla,
beled, whereas labels must be explicitly procured
CA, USA
and are expensive. In such situations, the reality
is that the learner starts with a large pool of un-
labeled points and must then strategically decide
Definition which ones it wants labeled: how best to spend its
limited budget.
The term active learning applies to a wide range
of situations in which a learner is able to exert Example: Speech recognition. When building
some control over its source of data. For instance, a speech recognizer, the unlabeled training data
when fitting a regression function, the learner consists of raw speech samples, which are very
may itself supply a set of data points at which to easy to collect: just walk around with a micro-
measure response values, in the hope of reducing phone. For all practical purposes, an unlimited
the variance of its estimate. Such problems have quantity of such samples can be obtained. On the
Active Learning Theory 15
other hand, the “label” for each speech sample at random, but it is not hard to show that this
is a segmentation into its constituent phonemes, yields the same label complexity as supervised
and producing even one such label requires sub- learning. A better idea is to choose the query A
stantial human time and attention. Over the past points adaptively: for instance, start by querying
decades, research labs and the government have some random data points to get a rough sense
expended an enormous amount of money, time, of where the decision boundary lies, and then
and effort on creating labeled datasets of English gradually refine the estimate of the boundary
speech. This investment has paid off, but our by specifically querying points in its immediate
ambitions are inevitably moving past what these vicinity. In other words, ask for the labels of
datasets can provide: we would now like, for in- data points whose particular positioning makes
stance, to create recognizers for other languages, them especially informative. Such strategies cer-
or for English in specific contexts. Is there some tainly sound good, but can they be fleshed out
way to avoid more painstaking years of data la- into practical algorithms? And if so, do these
beling, to somehow leverage the easy availability algorithms work well in the sense of producing
of raw speech so as to significantly reduce the good classifiers with fewer labels than would be
number of labels needed? This is the hope of required by supervised learning?
active learning. On account of the enormous practical impor-
Some early results on active learning were in tance of active learning, there are a wide range
the membership query model, where the data is of algorithms and techniques already available,
assumed to be separable (that is, some hypothesis most of which resemble the aggressive, adap-
h perfectly classifies all points) and the learner tive sampling strategy just outlined, and many
is allowed to query the label of any point in the of which show promise in experimental stud-
input space X (rather than being constrained to ies. However, a big problem with this kind of
a prespecified unlabeled set), with the goal of sampling is that very quickly the set of labeled
eventually returning the perfect hypothesis h . points no longer reflects the underlying data dis-
There is a significant body of beautiful theoretical tribution. This makes it hard to show that the
work in this model (Angluin 2001), but early classifiers learned have good statistical proper-
experiments ran into some telling difficulties. ties (for instance, that they converge to an op-
One study (Baum and Lang 1992) found that timal classifier in the limit of infinitely many
when training a neural network for handwritten labels). This survey will only discuss methods
digit recognition, the queries synthesized by the that have proofs of statistical well-foundedness,
learner were such bizarre and unnatural images and whose label complexity can be explicitly
that they were impossible for a human to classify. analyzed.
In such contexts, the membership query model is
of limited practical value; nonetheless, many of
the insights obtained from this model carry over Motivating Examples
to other settings (Hanneke 2007a).
We will fix as our standard model one in which We will start by looking at a few examples that il-
the learner is given a source of unlabeled data, lustrate the enormous potential of active learning
rather than being able to generate these points and that also make it clear why analyses of this
himself. Each point has an associated label, but new model require concepts and intuitions that
the label is initially hidden, and there is a cost are fundamentally different from those that have
for revealing it. The hope is that an accurate already been developed for supervised learning.
classifier can be found by querying just a few
labels, much fewer than would be required by Example: Thresholds on the Line
regular supervised learning. Suppose the data lie on the real line, and the avail-
How can the learner decide which labels to able classifiers are simple thresholding functions,
probe? One option is to select the query points H D fhw W w 2 Rg:
16 Active Learning Theory
(
C1 if x w h3 B2
hw .x/ D
1 if x < w
w
h2
Active Learning Theory, Fig. 2 Models of pool-and evaluated by errP .h/. If we want to get this error below
stream-based active learning. The data are draws from , how many labels are needed, as a function of ?
an underlying distribution PX , and hypotheses h are
mass of h, negative mass of hg. Thus even within H whose quality is measured by its error rate,
this simple hypothesis class, the label complexity errP .h/
can run anywhere from O.log 1=/ to .1=/, In regular supervised learning, it is well known
depending on the specific target hypothesis! that if the VC dimension of H is d , then the num-
ber of labels that will with high probability ensure
Example: An Overabundance of errP .h/ is roughly O.d=/ if the data is sep-
Unlabeled Data arable and O.d= 2 / otherwise (Haussler 1992);
In our two previous examples, the amount of various logarithmic terms are omitted here. For
unlabeled data needed was O.log 1=/, exactly active learning, it is clear from the examples
the usual sample complexity of supervised learn- above that the VC dimension alone does not
ing. But it is sometimes helpful to have signifi- adequately characterize label complexity. Is there
cantly more unlabeled data than this. In Dasgupta a different combinatorial parameter that does?
(2005), a distribution P is described for which
if the amount of unlabeled data is small (below
any prespecified threshold), then the number of
labels needed to learn the target linear separator Generic Results for Separable Data
is .1=/; whereas if the amount of unlabeled For separable data, it is possible to give upper
data is much larger, then only O.log 1=/ labels and lower bounds on label complexity in terms
are needed. This is a situation where most of the of a special parameter known as the splitting
data distribution is fairly uninformative while a index (Dasgupta et al. 2005). This is merely an
miniscule fraction is highly informative. A lot of existence result: the algorithm needed to realize
unlabeled data is needed in order to get even a the upper bound is intractable because it involves
few of the informative points. explicitly maintaining an -cover (a coarse ap-
proximation) of the hypothesis class, and the size
of this cover is in general exponential in the VC
dimension. Nevertheless, it does give us an idea
The Sample Complexity of Active
of the kinds of label complexity we can hope to
Learning
achieve.
We will think of the unlabeled points x1 ; : : : ; xn Example Suppose the hypothesis class consists
as being drawn i.i.d. from an underlying distri- of intervals on the real line: X D R and
bution PX on X (namely, the marginal of the H D fha;b W a; b 2 Rg, where ha;b .x/ D
distribution P on X Y), either all at once (a 1.a x b/. Using the splitting index, the
pool) or one at a time (a stream). The learner label complexity of active learning is seen to be
is only allowed to query the labels of points Q
.minf1=P X .Œa; b /; 1=g C log 1=/ when the
in the pool/stream; that is, it is restricted to target hypothesis is ha;b (Dasgupta 2005). Here
“naturally occurring” data points rather than syn- the Q notation is used to suppress logarithmic
thetic ones (Fig. 2). It returns a hypothesis h 2 terms.
18 Active Learning Theory
Example Suppose X D Rd and H consists of In practice, the biggest limitation of the algo-
linear separators through the origin. If PX is the rithm above is that it assumes the data are sepa-
uniform distribution on the unit sphere, the num- rable. Recent results have shown how to remove
ber of labels needed to learn a hypothesis of error this assumption (Balcan et al. 2006; Dasgupta
Q log 1=/, exponentially smaller
is just .d et al. 2007) and to accommodate classification
Q
than the O.d=/ label complexity of supervised loss functions other than 0 1 loss (Beygelzimer
learning. If PX is not the uniform distribution et al. 2009). Variants of the disagreement coef-
but is everywhere within a multiplicative factor ficient continue to characterize label complexity
> 1 of it, then the label complexity becomes in the agnostic setting (Beygelzimer et al. 2009;
Q
O..d log 1=/ log2 /, provided the amount of Dasgupta et al. 2007).
unlabeled data is increased by a factor of 2
(Dasgupta 2005). A Bayesian Model
The query by committee algorithm (Seung et al.
These results are very encouraging, but the 1992) is based on a Bayesian view of active learn-
question of an efficient active learning algorithm ing. The learner starts with a prior distribution
remains open. We now consider two approaches. on the hypothesis space, and is then exposed to a
stream of unlabeled data. Upon receiving xt , the
learner performs the following steps.
Mildly Selective Sampling
• Draw two hypotheses h; h0 at random from the
The label complexity results mentioned above are posterior over H.
based on querying maximally informative points. • If h.xt / ¤ h0 .xt / then ask for the label of xt
A less aggressive strategy is to be mildly selec- and update the posterior accordingly.
tive, to query all points except those that are quite
clearly uninformative. This is the idea behind one This algorithm queries points that substantially
of the earliest generic active learning schemes shrink the posterior, while at the same time taking
(Cohn et al. 1994). Data points x1 , x2 , . . . arrive account of the data distribution. Various theoret-
in a stream, and for each one the learner makes ical guarantees have been shown for it (Freund
a spot decision about whether or not to request et al. 1997); in particular, in the case of linear
a label. When xt arrives, the learner behaves as separators with a uniform data distribution, it
follows. achieves a label complexity of O.d log 1=/, the
best possible.
• Determine whether both possible labelings, Sampling from the posterior over the hypoth-
(xt ; C/ and (xt ; ), are consistent with the esis class is, in general, computationally pro-
labeled examples seen so far. hibitive. However, for linear separators with a
• If so, ask for the label yt . Otherwise set yt to uniform prior, it can be implemented efficiently
be the unique consistent label. using random walks on convex bodies (Gilad-
Bachrach et al. 2005).
Fortunately, the check required for the first step
can be performed efficiently by making two calls Other Work
to a supervised learner. Thus this is a very simple In this survey, I have touched mostly on active
and elegant active learning scheme, although as learning results of the greatest generality, those
one might expect, it is suboptimal in its label that apply to arbitrary hypothesis classes. There
complexity (Balcan et al. 2007). Interestingly, is also a significant body of more specialized
there is a parameter called the disagreement coef- results.
ficient that characterizes the label complexity of
this scheme and also of some other mildly selec- • Efficient active learning algorithms for spe-
tive learners (Friedman 2009; Hanneke 2007b). cific hypothesis classes.
Adaboost 19
This includes an online learning algorithm for Beygelzimer A, Dasgupta S, Langford J (2009) Im-
linear separators that only queries some of the portance weighted active learning. In: International
conference on machine learning. ACM Press, New
points and yet achieves similar regret bounds York, pp 49–56 A
to algorithms that query all the points (Cesa- Cesa-Bianchi N, Gentile C, Zaniboni L (2004) Worst-
Bianchi et al. 2004). The label complexity of case analysis of selective sampling for linear-
this method is yet to be characterized. threshold algorithms. In: Advances in neural infor-
mation processing systems
• Algorithms and label bounds for linear sepa- Chernoff H (1972) Sequential analysis and optimal
rators under the uniform data distribution. design. CBMS-NSF regional conference series in
This particular setting has been amenable to applied mathematics, vol 8. SIAM, Philadelphia
mathematical analysis. For separable data,it Cohn D, Atlas L, Ladner R (1994) Improving
generalization with active learning. Mach Learn
turns out that a variant of the perceptron al- 15(2):201–221
gorithm achieves the optimal O.d log 1=/ Dasgupta S (2005) Coarse sample complexity bounds
label complexity (Dasgupta 2005). A simple for active learning. Advances in neural information
algorithm is also available for the agnostic processing systems. Morgan Kaufmann, San Mateo
Dasgupta S, Kalai A, Monteleoni C (2005) Analy-
setting (Balcan et al. 2007). sis of perceptron-based active learning. In: 18th
annual conference on learning theory, Bertinoro,
pp 249–263
Conclusion Dasgupta S, Hsu DJ, Monteleoni C (2007) A gen-
eral agnostic active learning algorithm. Advances in
The theoretical frontier of active learning is neural information processing systems
Fedorov VV (1972) Theory of optimal experiments
mostly an unexplored wilderness. Except for a (trans: Studden WJ, Klimko EM). Academic Press,
few specific cases, we do not have a clear sense New York
of how much active learning can reduce label Freund Y, Seung S, Shamir E, Tishby N (1997) Selec-
complexity: whether by just a constant factor, or tive sampling using the query by committee algo-
rithm. Mach Learn J 28:133–168
polynomially, or exponentially. The fundamental Friedman E (2009) Active learning for smooth prob-
statistical and algorithmic challenges involved, lems. In: Conference on learning theory, Montreal,
together with the huge practical importance of pp 343–352
the field, make active learning a particularly Gilad-Bachrach R, Navot A, Tishby N (2005) Query
by committeee made real. Advances in neural infor-
rewarding terrain for investigation. mation processing systems
Hanneke S (2007a) Teaching dimension and the com-
plexity of active learning. In: Conference on learn-
Cross-References ing theory, San Diego, pp 66–81
Hanneke S (2007b) A bound on the label complexity
of agnostic active learning. In: International confer-
Active Learning
ence on machine learning, Corvallis, pp 353–360
Haussler D (1992) Decision-theoretic generalizations
of the PAC model for neural net and other learning
Recommended Reading applications. Inf Comput 100(1):78–150
Seung HS, Opper M, Sompolinsky H (1992) Query
Angluin D (2001) Queries revisited. In: Proceedings by committee. In: Conference on computational
of the 12th international conference on algorithmic learning theory, Victoria, pp 287–294
learning theory, Washington, DC, pp 12–31
Balcan M-F, Beygelzimer A, Langford J (2006) Ag-
nostic active learning. In: International conference
on machine learning. ACM Press, New York, pp 65–
72 Adaboost
Balcan M-F, Broder A, Zhang T (2007) Margin based
active learning. In: Conference on learning theory,
San Diego, pp 35–50 Adaboost is an ensemble learning technique,
Baum EB, Lang K (1992) Query learning can work and the most well-known of the Boosting fam-
poorly when a human oracle is used. In: Interna- ily of algorithms. The algorithm trains models
tional joint conference on neural networks, Balti- sequentially, with a new model trained at each
more
20 Adaptive Control Processes
round. At the end of each round, mis-classified line algorithms, ARTDP is an on-line algorithm
examples are identified and have their emphasis because it uses agent behavior to guide its
increased in a new training set which is then computation. ARTDP is adaptive because it
fed back into the start of the next round, and a does not need a complete and accurate model
new model is trained. The idea is that subsequent of the environment but learns a model from data
models should be able to compensate for errors collected during agent-environment interaction.
made by earlier models. See ensemble learning When a good model is available, Real-Time
for full details. Dynamic Programming (RTDP) is applicable,
which is ARTDP without the model-learning
component.
Adaptive Control Processes
Motivation and Background
Bayesian Reinforcement Learning
The rest of the states will never arise when that with different MDP formulations) of the total cost
policy is being followed, so the policy does not the agent is expected to incur over the future if it
need to specify what the agent should do in those starts in x. If fk .x/ and fkC1 .x/, respectively, A
states. denote the estimate of f .x/ before and after a
ARTDP and RTDP exploit situations where backup, a typical backup operation applied to x
the set of states reachable from the start states is looks like this:
a small subset of the entire state space. They can X
dramatically reduce the amount of computation fkC1 .x/ D mi na2A Œcx .a/C pxy .a/fk .f v/ ;
needed to determine an optimal policy for the y2X
relevant states as compared with the amount of
computation that a conventional DP algorithm where A is the set of possible agent actions,
would require to determine an optimal policy for cx .a/ is the immediate cost the agent incurs for
all the states. These algorithms do this by fo- performing action a in state x, and pxy .a/ prob-
cussing computation around simulated behavioral ability that the environment makes a transition
experiences (if there is a model available capable from state x to state y as a result of the agent’s
of simulating these experiences), or around real action a. This backup operation is associated with
behavioral experiences (if no model is available). the DP algorithm known as value iteration. It
RTDP and ARTDP were introduced by Barto is also the backup operation used by RTDP and
et al. (1995). The starting point was the novel ARTDP.
observation by Bradtke that Korf’s Learning Conventional DP algorithms consist of suc-
Real-Time A* heuristic search algorithm (Korf cessive “sweeps” of the state set. Each sweep
1990) is closely related to DP. RTDP generalizes consists of applying a backup operation to each
Learning Real-Time A to stochastic problems. state. Sweeps continue until the algorithm con-
ARTDP is also closely related to Sutton’s Dyna verges to a solution. Asynchronous DP, which
system (Sutton 1990) and Jalali and Ferguson’s underlies RTDP and ARTDP, does not use sys-
(1989) Transient DP. Theoretical analysis relies tematic sweeps. States can be chosen in any way
on the theory of Asnychronous DP as described whatsoever, and as long as backups continue to
by Bertsekas and Tsitsiklis (1989). be applied to all states (and some other conditions
ARTDP and RTDP are model-based rein- are satisfied), the algorithm will converge. RTDP
forcement learning algorithms, so called because is an instance of asynchronous DP in which the
they take advantage of an environment model, states chosen for backups are determined by the
unlike model-free reinforcement learning algo- agent’s behavior.
rithms such as Q-Learning and Sarsa. The backup operation above is model-based
because it uses known rewards and transition
probabilities, and the values of all the states
Structure of Learning System appear on the right-hand-side of the equation. In
contrast, a sample backup uses the value of just
Backup Operations one sample successor state. RTDP and ARTDP
A basic step of many DP and RL algorithms is are like RL algorithms in that they rely on real or
a backup operation. This is an operation that up- simulated behavioral experience, but unlike many
dates a current estimate of the cost of an MDP’s (but not all) RL algorithms, they use full backups
state. (We use the cost formulation instead of like DP.
reward to be consistent with the original presenta-
tion of the algorithms. In the case of rewards, this Off-Line Versus On-Line
would be called the value of a state and we would A conventional DP algorithm typically executes
maximize instead of minimize.) Suppose X is the off-line. When applied to finding an optimal pol-
set of MDP states. For each state x 2 X , f .x/, icy for an MDP, this means that the DP algo-
the cost of state x, gives a measure (which varies rithm executes to completion before its result
22 Adaptive Real-Time Dynamic Programming
(an optimal policy) is used to control the agent’s cutes concurrently with the agent’s behavior so
behavior. The sweeps of DP sequentially “visit” that the agent’s behavior can influence the DP
the states of the MDP, performing a backup computation. Further, the concurrently executing
operation on each state. But it is important not DP computation can influence the agent’s behav-
to confuse these visits with the behaving agent’s ior. The agent’s visits to states directs the “visits”
visits to states: the agent is not yet behaving to states made by the concurrent asynchronous
while the off-line DP computation is being done. DP computation. At the same time, the action
Hence, the agent’s behavior has no influence on performed by the agent is the action specified
the DP computation. The same is true for off-line by the policy corresponding to the latest results
asynchronous DP. of the DP computation: it is the “greedy” action
RTDP is an on-line, or “real-time,” algorithm. with respect to the current estimate of the cost
It is an asynchronous DP computation that exe- function.
Specify
actions
Asynchronous
Dynamic Programming Behaving Agent
Computation
Specify states
to backup
In the simplest version of RTDP, when a state the environment model eventually converges to
is visited by the agent, the DP computation per- the correct model. If the state and action sets are
forms the model-based backup operation given finite, the simplest way to learn a model is to keep
above on that same state. In general, for each counts of the number of times each transition
step of the agent’s behavior, RTDP can apply the occurs for each action and convert these frequen-
backup operation to each of an arbitrary set of cies to probabilities, thus forming the maximum-
states, provided that the agent’s current state is likelihood model.
included. For example, at each step of behavior,
a limited-horizon look-ahead search can be con- Summary of Theoretical Results
ducted from the agent’s current state, with the When RTDP and ARTDP are applied to stochas-
backup operation applied to each of the states tic optimal path problems, one can prove that
generated in the search. Essentially, RTDP is an under certain conditions they converge to optimal
asynchronous DP computation with the compu- policies without the need to apply backup opera-
tational effort focused along simulated or actual tions to all the states. Indeed, is some problems,
behavioral trajectories. only a small fraction of the states need to be
visited. A stochastic optimal path problem is an
Learning A Model MDP with a nonempty set of start states and
ARTDP is the same as RTDP except that (1) an a nonempty set of goal states. Each transition
environment model is updated using any on-line until a goal state is reached has a nonnegative
model-learning, or system identification, method, immediate cost, and once the agent reaches a
(2) the current environment model is used in goal state, it stays there and thereafter incurs zero
performing the RTDP backup operations, and cost. Each episode of agent experience begins
(3) the agent has to perform exploratory actions with a start state. An optimal policy is one that
occasionally instead of always greedy actions as minimizes the cost of every state, i.e., minimizes
in RTDP. This last step is essential to ensure that f .x/ for all states x. Under some relatively mild
Adaptive Real-Time Dynamic Programming 23
conditions, every optimal policy is guaranteed to in all of these ways would produce analogous
eventually reach a goal state. algorithms that could be used when a good model
A state x is relevant if a start state s and an is not available. A
optimal policy exist such that x can be reached
from s when the agent uses that policy. If we
could somehow know which states are relevant,
Cross-References
we could restrict DP to just these states and
obtain an optimal policy. But this is not possi-
Anytime Algorithm
ble because knowing which states are relevant
Approximate Dynamic Programming
requires knowledge of optimal policies, which
Reinforcement Learning
is what one is seeking. However, under certain
conditions, without requiring repeated visits to
all the irrelevant states, RTDP produces a policy
that is optimal for all the relevant states. The
Recommended Reading
conditions are that (1) the initial cost of every
goal state is zero, (2) there exists at least one Barto A, Bradtke S, Singh S (1995) Learning to act
policy that guarantees that a goal state will be using real-time dynamic programming. Artif Intell
reached with probability one from any start state, 72(1–2):81–138
Bertsekas D, Tsitsiklis J (1989) Parallel and distributed
(3) all immediate costs for transitions from non-
computation: numerical methods. Prentice-Hall, En-
goal states are strictly positive, and (4) none of glewood Cliffs
the initial costs are larger than the actual costs. Bonet B, Geffner H (2003a) Labeled RTDP: improv-
This result is proved in Barto et al. (1995) by ing the convergence of real-time dynamic program-
ming. In: Proceedings of the 13th international
combining aspects of Korf’s (1990) proof for conference on automated planning and scheduling
LRTA with results for asynchronous DP. (ICAPS-2003), Trento
Bonet B, Geffner H (2003b) Faster heuristic search
algorithms for planning with uncertainty and full
Special Cases and Extensions feedback. In: Proceedings of the international joint
A number of special cases and extensions of conference on artificial intelligence (IJCAI-2003),
RTDP have been developed that improve per- Acapulco
Feng Z, Hansen E, Zilberstein S (2003) Symbolic
formance over the basic version. Some exam- generalization for on-line planning. In: Proceedings
ples are as follows. Bonet and Geffner’s (2003a) of the 19th conference on uncertainty in artificial
Labeled RTDP labels states that have already intelligence, Acapulco
been “solved,” allowing faster convergence than Hansen E, Zilberstein S (2001) LAO*: a heuristic
search algorithm that finds solutions with loops.
RTDP. Feng et al. (2003) proposed Symbolic Artif Intell 129:35–62
RTDP, which selects a set of states to update at Jalali A, Ferguson M (1989) Computationally efficient
each step using symbolic model-checking tech- control algorithms for Markov chains. In: Proceed-
niques. The RTDP convergence theorem still ap- ings of the 28th conference on decision and control,
Tampa, pp 1283–1288
plies because this is a special case of RTDP. Korf R (1990) Real-time heuristic search. Artif Intell
Smith and Simmons (2006) developed Focused 42(2–3):189–211
RTDP that maintains a priority value for each Smith T, Simmons R (2006) Focused real-time dy-
state to better direct search and produce faster namic programming for MDPs: squeezing more
out of a heuristic. In: Proceedings of the national
convergence. Hansen and Zilberstein’s (2001) conference on artificial intelligence (AAAI). AAAI
LAO uses some of the same ideas as RTDP Press, Boston
to produce a heuristic search algorithm that can Sutton R (1990) Integrated architectures for learning,
find solutions with loops to non-deterministic planning, and reacting based on approximating dy-
namic programming. In: Proceedings of the 7th in-
heuristic search problems. Many other variants ternational conference on machine learning. Morgan
are possible. Extending ARTDP instead of RTDP Kaufmann, San Mateo, pp 216–224
24 Adaptive Resonance Theory
Adaptive Resonance Theory, Fig. 1 Distributed ART is less than times the size of A. A top-down/bottom-up
(dART) (Carpenter 1997). (a) At the field F0 , complement mismatch triggers a signal that resets the active F2 code.
coding transforms the feature pattern a to the system input (d) Medium-term memories in the F0 -to-F2 dynamic
A, which represents both scaled feature values ai 2 Œ0; 1 weights allow the system to activate a new code y. When
and their complements .1 ai / .i D 1 . . . M /. (b) F2 is only one F2 node remains active following competition,
a competitive field that transforms its input pattern into the code is maximally compressed, or winner-take-all.
the working memory code y. The F2 nodes that remain When jxj jAj, the activation pattern y persists until
active following competition send the pattern of learned the next reset, even if input A changes or F0 -to-F2 signals
top-down expectations to the match field F1 . The pattern habituate. During learning, thresholds ij in paths from
active at F1 becomes x D A ^ , where ^ denotes the F0 to F2 increase according to the dInstar law; and
component-wise minimum, or fuzzy intersection. (c) A thresholds j i in paths from F2 to F1 increase according
parameter 2 Œ0; 1, called vigilance, sets the matching to the dOutstar law
criterion. The system registers a mismatch if the size of x
to be slow, and activation does not persist once y persists until an active reset signal (Fig. 1c)
inputs are removed. The ART coding field is a prepares the coding field to register a new
competitive network where, typically, one or a F0 -to-F2 input. Early ART networks (Carpenter
few nodes in the normalized F2 pattern y sustain and Grossberg 1987; Carpenter et al. 1991a,
persistent activation, even as their generating 1992) employed localist, or winner-take-all,
inputs shift, habituate, or vanish. The pattern coding, whereby strongly competitive feedback
26 Adaptive Resonance Theory
results in only one F2 node staying active until ubiquitous computational design known as op-
the next reset. With fast as well as slow learning, ponent processing (Hurvich and Jameson 1957).
memory stability in these early networks relied Balancing an entity against its opponent, as in
on their winner-take-all architectures. opponent colors such as red vs. green or agonist-
Achieving stable fast learning with distributed antagonist muscle pairs, allows a system to
code representations presents a computational act upon relative quantities, even as absolute
challenge to any learning network. In order to magnitudes fluctuate unpredictably. In ART
meet this challenge, distributed ART (Carpenter systems, complement coding is analogous to
1997) introduced a new network configuration retinal on-cells and off-cells (Schiller 1982).
(Fig. 1) in which system fields are identified with When the learning system is presented with
cortical layers (Carpenter 2001). New learning a set of input features a .a1 . . . ai . . . aM /,
laws (dInstar and dOutstar) that realize stable complement coding doubles the number of input
fast learning with distributed coding predict adap- components, presenting to the network an input
tive dynamics between cortical layers. A that concatenates the original feature vector
Distributed ART (dART) systems employ a and its complement (Fig. 1a).
new unit of long-term memory, which replaces Complement coding produces normalized in-
the traditional multiplicative weight (Encyclo- puts A that allow a model to encode features that
pedia cross reference) with a dynamic weight are consistently absent on an equal basis with
(Carpenter 1994). In a path from the F2 coding features that are consistently present. Features
node j to the F1 matching node i, the dynamic that are sometimes absent and sometimes present
weight equals the amount by which coding node when a given F2 node is highly active are re-
activation yj exceeds an adaptive threshold j i . garded as uninformative with respect to that node,
The total signal i from F2 to the i th F1 node and the corresponding present and absent top-
is the sum of these dynamic weights, and F1 down feature expectations shrink to zero. When
node activation xi equals the minimum of the top- a new input activates this node, these features
down expectation i and the bottom-up input Ai . are suppressed at the match field F1 (Fig. 1b).
During dOutstar learning, the top-down pattern If the active code then produces an error signal,
converges toward the matched pattern x. attentional biasing can enhance the salience of
When coding node activation yj is below j i , input features that it had previously ignored, as
the dynamic weight is zero and no learning occurs described below.
in that path, even if yj is positive. This property
is critical for stable fast learning with distributed Matching, Attention, and Search
codes. Although the dInstar and dOutstar laws are A neural computation central to both scientific
compatible with F2 patterns y that are arbitrarily and technological analyses is the ART matching
distributed, in practice, following an initial learn- rule (Carpenter and Grossberg 1987), which con-
ing phase, most changes in paths to and from a trols how attention is focused on critical feature
coding node j occur only when its activation yj patterns via dynamic matching of a bottom-up
is large. This type of learning is therefore called sensory input with a top-down learned expecta-
quasi-localist. In the special case where coding is tion. Bottom-up/top-down pattern matching and
winner-take-all, the dynamic weight is equivalent attentional focusing are, perhaps, the primary
to a multiplicative weight that formally equals the features common to all ART models across their
complement of the adaptive threshold. many variations. Active input features that are not
confirmed by top-down expectations are inhib-
Complement Coding: Learning Both Absent ited (Fig. 1b). The remaining activation pattern
Features and Present Features defines a focus of attention, which, in turn, deter-
ART networks employ a preprocessing step mines what feature patterns are learned. Basing
called complement coding (Carpenter et al. memories on attended features rather than whole
1991b), which models the nervous system’s patterns supports the design goal of encoding sta-
Another random document with
no related content on Scribd:
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back