You are on page 1of 53

Encyclopedia of Machine Learning and

Data Mining 2nd Edition Claude


Sammut
Visit to download the full and correct content document:
https://textbookfull.com/product/encyclopedia-of-machine-learning-and-data-mining-2
nd-edition-claude-sammut/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Encyclopedia of Machine Learning and Data Mining 2nd


2nd Edition Claude Sammut

https://textbookfull.com/product/encyclopedia-of-machine-
learning-and-data-mining-2nd-2nd-edition-claude-sammut/

Machine Learning and Data Mining in Aerospace


Technology Aboul Ella Hassanien

https://textbookfull.com/product/machine-learning-and-data-
mining-in-aerospace-technology-aboul-ella-hassanien/

Data Mining Practical Machine Learning Tools and


Techniques Fourth Edition Ian H. Witten

https://textbookfull.com/product/data-mining-practical-machine-
learning-tools-and-techniques-fourth-edition-ian-h-witten/

Statistical and Machine-Learning Data Mining, Third


Edition: Techniques for Better Predictive Modeling and
Analysis of Big Data, Third Edition Bruce Ratner

https://textbookfull.com/product/statistical-and-machine-
learning-data-mining-third-edition-techniques-for-better-
predictive-modeling-and-analysis-of-big-data-third-edition-bruce-
Learning Data Mining with Python Layton

https://textbookfull.com/product/learning-data-mining-with-
python-layton/

Learning Data Mining with Python Robert Layton

https://textbookfull.com/product/learning-data-mining-with-
python-robert-layton/

Learn Data Mining Through Excel: A Step-by-step


Approach for Understanding Machine Learning Methods 1st
Edition Hong Zhou

https://textbookfull.com/product/learn-data-mining-through-excel-
a-step-by-step-approach-for-understanding-machine-learning-
methods-1st-edition-hong-zhou/

Handbook of Statistical Analysis and Data Mining


Applications 2nd Edition Robert Nisbet

https://textbookfull.com/product/handbook-of-statistical-
analysis-and-data-mining-applications-2nd-edition-robert-nisbet/

Machine Learning in Business - An Introduction to the


World of Data Science 2nd Edition John C. Hull

https://textbookfull.com/product/machine-learning-in-business-an-
introduction-to-the-world-of-data-science-2nd-edition-john-c-
hull/
Claude Sammut • Geoffrey I. Webb
Editors

Encyclopedia of
Machine Learning
and Data Mining
Second Edition

With 263 Figures and 34 Tables

123
Editors
Claude Sammut Geoffrey I. Webb
The University of New South Wales Faculty of Information Technology
Sydney, NSW Monash University
Australia Melbourne, VIC, Australia

ISBN 978-1-4899-7685-7 ISBN 978-1-4899-7687-1 (eBook)


ISBN 978-1-4899-7686-4 (print and electronic bundle)
DOI 10.1007/978-1-4899-7687-1

Library of Congress Control Number: 2016958560

© Springer Science+Business Media New York 2011, 2017


Preface

Machine learning and data mining are rapidly developing fields. Following
the success of the first edition of the Encyclopedia of Machine Learning,
we are delighted to bring you this updated and expanded edition. We have
expanded the scope, as reflected in the revised title Encyclopedia of Machine
Learning and Data Mining, to encompass more of the broader activity that
surrounds the machine learning process. This includes new articles in such
diverse areas as anomaly detection, online controlled experiments, and record
linkage as well as substantial expansion of existing entries such as data
preparation. We have also included new entries on key recent developments
in core machine learning, such as deep learning. A thorough review has also
led to updating of much of the existing content.
This substantial tome is the product of an intense effort by many individ-
uals. We thank the Editorial Board and the numerous contributors who have
provided the content. We are grateful to the Springer team of Andrew Spencer,
Michael Hermann, and Melissa Fearon who have shepherded us through the
long process of bringing this second edition to print. We are also grateful to
the production staff who have turned the content into its final form.
We are confident that this revised encyclopedia will consolidate the first
edition’s place as a key reference source for the machine learning and data
mining communities.
A

Motivation and Background


A/B Testing
Abduction is, along with induction, a synthetic
 Online Controlled Experiments and A/B Test-
form of reasoning whereby it generates, in its
ing explanations, new information not hitherto con-
tained in the current theory with which the rea-
soning is performed. As such, it has a natural re-
Abduction lation to learning, and in particular to knowledge
intensive learning, where the new information
Antonis C. Kakas generated aims to complete, at least partially, the
University of Cyprus, Nicosia, Cyprus current knowledge (or model) of the problem
domain as described in the given theory.
Early uses of abduction in the context of
Definition machine learning concentrated on how abduction
can be used as a theory revision operator for
Abduction is a form of reasoning, sometimes de- identifying where the current theory could be
scribed as “deduction in reverse,” whereby given revised in order to accommodate the new learn-
a rule that “A f ol lows f rom B” and the ob- ing data. This includes the work of Michalski
served result of “A” we infer the condition “B” (1993), Ourston and Mooney (1994), and Ade
of the rule. More generally, given a theory, T , et al. (1994). Another early link of abduction to
modeling a domain of interest and an obser- learning was given by the  explanation based
vation, “A;” we infer a hypothesis “B” such learning method (DeJong and Mooney 1986),
that the observation follows deductively from T where the abductive explanations of the learning
augmented with “B:” We think of “B” as a pos- data (training examples) are generalized to all
sible explanation for the observation according cases. An extensive survey of the role of abduc-
to the given theory that contains our rule. This tion in Machine Learning during this early period
new information and its consequences (or ram- can be found in Bergadano et al. (2000).
ifications) according to the given theory can be Following this, it was realized (Flach and
considered as the result of a (or part of a) learning Kakas 2000) that the role of abduction in learn-
process based on the given theory and driven by ing could be strengthened by linking it to in-
the observations that are explained by abduction. duction, culminating in a hybrid integrated ap-
Abduction can be combined with  induction in proach to learning where abduction and induction
different ways to enhance this learning process. are tightly integrated to provide powerful learn-
ing frameworks such as the ones of Progol 5.0
2 Abduction

(Muggleton and Bryant 2000) and HAIL (Ray of our theory and consistency refers also to the
et al. 2003). On the other hand, from the point corresponding notion in this logic. The particular
of view of abduction as “inference to the best choice of this underlying formal framework of
explanation” (Josephson and Josephson 1994) the logic is in general a matter that depends on the
link with induction provides a way to distinguish problem or phenomena that we are trying to
between different explanations and to select those model. In many cases, this is based on  first
explanations that give a better inductive general- order predicate calculus, as, for example, in the
ization result. approach of theory completion in Muggleton and
A recent application of abduction, on its own Bryant (2000). But other logics can be used, e.g.,
or in combination with induction, is in Systems the nonmonotonic logics of default logic or logic
Biology where we try to model biological programming with negation as failure when the
processes and pathways at different levels. modeling of our problem requires this level of
This challenging domain provides an important expressivity.
development test-bed for these methods of This basic formalization as it stands, does not
knowledge intensive learning (see e.g., King fully capture the explanatory nature of the abduc-
et al. 2004; Papatheodorou et al. 2005; Ray et al. tive explanation H in the sense that it necessarily
2006; Tamaddoni-Nezhad et al. 2004; Zupan conveys some reason why the observations hold.
et al. 2003). It would, for example, allow an observation O
to be explained by itself or in terms of some
other observations rather than in terms of some
Structure of the Learning Task
“deeper” reason for which the observation must
hold according to the theory T . Also, as the
Abduction contributes to the learning task by first
above specification stands, the observation can
explaining, and thus rationalizing, the training
be abductively explained by generating in H
data according to a given and current model
some new (general) theory completely unrelated
of the domain to be learned. These abductive
to the given theory T . In this case, H does not
explanations either form on their own the result
account for the observations O according to the
of learning or they feed into a subsequent phase
given theory T and in this sense it may not be
to generate the final result of learning.
considered as an explanation for O relative to T .
For these reasons, in order to specify a “level”
Abduction in Artificial Intelligence
at which the explanations are required and to un-
Abduction as studied in the area of Artificial
derstand these relative to the given general theory
Intelligence and the perspective of learning
about the domain of interest, the members of an
is mainly defined in a logic-based approach.
explanation are normally restricted to belong to
Other approaches to abduction include set
a special preassigned, domain-specific class of
covering (See, e.g., Reggia 1983) or case-based
sentences called abducible.
explanation, (e.g., Leake 1995). The following
Hence abduction, is typically applied on a
explanation uses a logic-based approach.
model, T , in which we can separate two disjoint
Given a set of sentences T (a theory or model),
sets of predicates: the observable predicates and
and a sentence O (observation), the abductive
the abducible (or open) predicates. The basic
task is the problem of finding a set of sentences
assumption then is that our model T has reached
H (abductive explanation for O) such that:
a sufficient level of comprehension of the domain
1. T [ H ˆ O; such that all the incompleteness of the model
2. T [ H is consistent, can be isolated (under some working hypothe-
ses) in its abducible predicates. The observable
where ˆ denotes the deductive entailment rela- predicates are assumed to be completely defined
tion of the formal logic used in the representation (in T ) in terms of the abducible predicates and
Abduction 3

other background auxiliary predicates; any in- Abductive Concept Learning


completeness in their representation comes from Abduction allows us to reason in the face of
the incompleteness in the abducible predicates. In incomplete information. As such when we have A
practice, the empirical observations that drive the learning problems where the background data on
learning task are described using the observable the training examples is incomplete the use of
predicates. Observations are represented by for- abduction can enhance the learning capabilities.
mulae that refer only to the observable predicates Abductive concept learning (ACL) (Kakas and
(and possibly some background auxiliary predi- Riguzzi 2000) is a learning framework that allows
cates) typically by ground atomic facts on these us to learn from incomplete information and to
observable predicates. The abducible predicates later be able to classify new cases that again
describe underlying (theoretical) relations in our could be incompletely specified. Under ACL, we
model that are not observable directly but can, learn abductive theories, hT; A; ICi with abduc-
through the model T , bring about observable tion playing a central role in the covering relation
information. of the learning problem. The abductive theories
The assumptions on the abducible predicates learned in ACL contain both rules, in T , for the
used for building up the explanations may be concept(s) to be learned as well as general clauses
subject to restrictions that are expressed through acting as integrity constraints in IC.
integrity constraints. These represent additional Practical problems that can be addressed with
knowledge that we have on our domain express- ACL: (1) concept learning from incomplete back-
ing general properties of the domain that remain ground data where some of the background pred-
valid no matter how the theory is to be extended icates are incompletely specified and (2) concept
in the process of abduction and associated learn- learning from incomplete background data to-
ing. Therefore, in general, an abductive theory gether with given integrity constraints that pro-
is a triple, denoted by hT; A; ICi, where T is vide some information on the incompleteness
the background theory, A is a set of abducible of the data. The treatment of incompleteness
predicates, and IC is a set of integrity constraints. through abduction is integrated within the learn-
Then, in the definition of an abductive expla- ing process. This allows the possibility of learn-
nation given above, one more requirement is ing more compact theories that can alleviate the
added: problem of over fitting due to the incompleteness
in the data. A specific subcase of these two prob-
3. T [ H satisfies IC. lems and important third application problem of
ACL is that of (3) multiple predicate learning,
The satisfaction of integrity constraints can be where each predicate is required to be learned
formally understood in several ways (see Kakas from the incomplete data for the other predicates.
et al. 1992 and references therein). Note that the Here the abductive reasoning can be used to
integrity constraints reduce the number of expla- suitably connect and integrate the learning of the
nations for a set of observations filtering out those different predicates. This can help to overcome
explanations that do not satisfy them. Based on some of the nonlocality difficulties of multiple
this notion of abductive explanation a credulous predicate learning, such as order-dependence and
form of abductive entailment is defined. Given global consistency of the learned theory.
an abductive theory, T D hT; A; ICi, and an ACL is defined as an extension of  Inductive
observation O then, O is abductively entailed Logic Programming (ILP) where both the back-
by T , denoted by T ˆA O, if there exists an ground knowledge and the learned theory are
abductive explanation of O in T . abductive theories. The central formal definition
This notion of abductive entailment can then of ACL is given as follows where examples are
form the basis of a coverage relation for learning atomic ground facts on the target predicate(s) to
in the face of incomplete information. be learned.
4 Abduction

Definition 1 (Abductive Concept Learning) E C D ff at her.john; mary/; f at her


Given .david; st eve/g;
E  D ff at her.kat hy; el len/; f at her
• A set of positive examples E C .john; st eve/g:
• A set of negative examples E 
In this case, a possible hypotheses T 0 D hP [
• An abductive theory T D hP; A; I i as back-
P 0 ; A; I 0 i learned by ACL would consist of
ground theory
• An hypothesis space T D hP; Ii consisting P 0 D ff at her.X; Y / parent .X; Y /;
of a space of possible programs P and a space male.X /g;
of possible constraints I I0 D f male.X /; f emale.X /g:

Find This hypothesis satisfies the definition of ACL


A set of rules P 0 2 P and a set of constraints because:
I 2 I such that the new abductive theory
0

T 0 D hP [ P 0 ; A; I [ I 0 i satisfies the following 1. T 0 ˆA f at her.john; mary/; f at her


conditions .david; st eve/ with  D fmale.david /g.
2. T 0 ²A f at her.kat hy; el len/, as the only
possible explanation for this goal, namely
• T 0 ˆA E C fmale.kat hy/g is made inconsistent by the
• 8e  2 E  , T 0 ²A e  learned integrity constraint in I 0 .
3. T 0 ²A f at her.john; st eve/, as this has no
where E C stands for the conjunction of all posi- possible abductive explanations.
tive examples.
Hence, despite the fact that the background
An individual example e is said to be covered theory is incomplete (in its abducible predicates),
by a theory T 0 if T 0 ˆA e. In effect, this ACL can find an appropriate solution to the
definition replaces the deductive entailment as the learning problem by suitably extending the
example coverage relation in the ILP problem background theory with abducible assumptions.
with abductive entailment to define the ACL Note that the learned theory without the
learning problem. integrity constraint would not satisfy the
The fact that the conjunction of positive ex- definition of ACL, because there would exist
amples must be covered means that, for every an abductive explanation for the negative
positive example, there must exist an abduc- example f at her.kat hy; el len/, namely  D
tive explanation and the explanations for all the fmale.kat hy/g. This explanation is prohibited
positive examples must be consistent with each in the complete theory by the learned constraint
other. For negative examples, it is required that together with the fact f emale.kat hy/.
no abductive explanation exists for any of them.
ACL can be illustrated as follows. The algorithm and learning system for ACL
is based on a decomposition of this problem into
Example 1 Suppose we want to learn the concept two subproblems: (1) learning the rules in P 0
f at her. Let the background theory be T D together with appropriate explanations for the
hP; A; ;i where: training examples and (2) learning integrity con-
straints driven by the explanations generated in
P D fparent .john; mary/; male.john/;
the first part. This decomposition allows ACL to
parent .david; st eve/;
be developed by combining the two IPL settings
parent .kat hy; el len/; f emale.kat hy/g;
of explanatory (predictive) learning and confir-
A D fmale; f emaleg.
matory (descriptive) learning. In fact, the first
Let the training examples be: subproblem can be seen as a problem of learning
Abduction 5

from entailment, while the second subproblem as given theory T , using just the set of observa-
a problem of learning from interpretations. tions. The observations specify incomplete (usu-
ally extensional) knowledge about the observable A
Abduction and Induction predicates, which we try to generalize into new
The utility of abduction in learning can be en- knowledge. In contrast, the generalizing effect of
hanced significantly when this is integrated with abduction, if at all present, is much more limited.
induction. Several approaches for synthesizing With the given current theory T , that abduction
abduction and induction in learning have been always needs to refer to, we implicitly restrict the
developed, e.g., Ade and Denecker (1995), generalizing power of abduction as we require
Muggleton and Bryant (2000), Yamamoto that the basic model of our domain remains that
(1997), and Flach and Kakas (2000). These of T . Induction has a stronger and genuinely new
approaches aim to develop techniques for generalizing effect on the observable predicates
knowledge intensive learning with complex than abduction. While the purpose of abduction
background theories. One problem to be faced by is to extend the theory with an explanation and
purely inductive techniques, is that the training then reason with it, thus enabling the generalizing
data on which the inductive process operates, potential of the given theory T , in induction the
often contain gaps and inconsistencies. The purpose is to extend the given theory to a new the-
general idea is that abductive reasoning can ory, which can provide new possible observable
feed information into the inductive process consequences.
by using the background theory for inserting This complementarity of abduction and in-
new hypotheses and removing inconsistent data. duction – abduction providing explanations from
Stated differently, abductive inference is used to the theory while induction generalizes to form
complete the training data with hypotheses about new parts of the theory – suggests a basis for
missing or inconsistent data that explain the their integration within the context of theory
example or training data, using the background formation and theory development. A cycle of
theory. This process gives alternative possibilities integration of abduction and induction (Flach and
for assimilating and generalizing this data. Kakas 2000) emerges that is suitable for the task
Induction is a form of synthetic reasoning that of incremental modeling (Fig. 1). Abduction is
typically generates knowledge in the form of new used to transform (and in some sense normalize)
general rules that can provide, either directly, the observations to information on the abducible
or indirectly through the current theory T that predicates. Then, induction takes this as input
they extend, new interrelationships between the and tries to generalize this information to general
predicates of our theory that can include, unlike
abduction, the observable predicates and even in
some cases new predicates. The inductive hy- T′ O
pothesis thus introduces new, hitherto unknown,
links between the relations that we are studying
Induction T∪H O Abduction
thus allowing new predictions on the observable
predicates that would not have been possible be-
fore from the original theory under any abductive T O′
explanation.
Abduction, Fig. 1 The cycle of abductive and inductive
An inductive hypothesis, H , extends, like in knowledge development. The cycle is governed by the
abduction, the existing theory T to a new theory “equation” T [ H ˆ O, where T is the current theory,
T 0 DT [ H , but now H provides new links be- O the observations triggering theory development, and H
tween observables and nonobservables that was the new knowledge generated. On the left-hand side we
have induction, its output feeding into the theory T for
missing or incomplete in the original theory T . later use by abduction on the right; the abductive output in
This is particularly evident from the fact that turn feeds into the observational data O 0 for later use by
induction can be performed even with an empty induction, and so on
6 Abduction

rules for the abducible predicates now treating This is realized in Progol 5.0 and applied to sev-
these as observable predicates for its own pur- eral problems including the discovery of the func-
poses. The cycle can then be repeated by adding tion of genes in a network of metabolic pathways
the learned information on the abducibles back (King et al. 2004), and more recently to the study
in the model as new partial information on the of inhibition in metabolic networks (Tamaddoni-
incomplete abducible predicates. This will affect Nezhad et al. 2006, 2004). In Moyle (2000), an
the abductive explanations of new observations ILP system called ALECTO, integrates a phase of
to be used again in a subsequent phase of in- extraction-case abduction to transform each case
duction. Hence, through this cycle of integration of a training example to an abductive hypothesis
the abductive explanations of the observations with a phase of induction that generalizes these
are added to the theory, not in the (simple) form abductive hypotheses. It has been used to learn
in which they have been generated, but in a robot navigation control programs by completing
generalized form given by a process of induction the specific domain knowledge required, within a
on these. general theory of planning that the robot uses for
A simple example, adapted from Ray et al. its navigation (Moyle 2002).
(2003), that illustrates this cycle of integration of The development of these initial frameworks
abduction and induction is as follows. Suppose that realize the cycle of integration of abduction
that our current model, T , contains the following and induction prompted the study of the prob-
rule and background facts: lem of completeness for finding any hypothe-
sad(X) tired(X), poor(X), ses H that satisfies the basic task of finding a
consistent hypothesis H such that T [ H ˆ
tired(oli), tired(ale), tired(kr), O for a given theory T , and observations O.
academic(oli), academic(ale), academic(kr), Progol was found to be incomplete (Yamamoto
1997) and several new frameworks of integration
student(oli), lecturer(ale), lecturer(kr), of abduction and induction have been proposed
where the only observable predicate is sad=1. such as SOLDR (Ito and Yamamoto 1998), CF-
Given the observations O D fsad.ale/; induction (Inoue 2001), and HAIL (Ray et al.
sad.kr/; not sad.oli /g can we improve our 2003). In particular, HAIL has demonstrated that
model? The incompleteness of our model resides one of the main reasons for the incompleteness
in the predicate poor. This is the only abducible of Progol is that in its cycle of integration of
predicate in our model. Using abduction we can abduction and induction, it uses a very restricted
explain the observations O via the explanation: form of abduction. Lifting some of these re-
strictions, through the employment of methods
E = fpoor(ale), poor(kr), not poor(oli)g.
from abductive logic programming (Kakas et al.
Subsequently, treating this explanation as training 1992), has allowed HAIL to solve a wider class of
data for inductive generalization we can general- problems. HAIL has been extended to a frame-
ize this to get the rule: work, called XHAIL (Ray 2009), for learning
nonmonotonic ILP, allowing it to be applied to
poor(X) lecturer(X)
learn Event Calculus theories for action descrip-
thus (partially) defining the abducible predicate tion (Alrajeh et al. 2009) and complex scientific
poor when we extend our theory with this rule. theories for systems biology (Ray and Bryant
This combination of abduction and induction 2008).
has recently been studied and deployed in several Applications of this integration of abduction
ways within the context of ILP. In particular, and induction and the cycle of knowledge devel-
inverse entailment (Muggleton and Bryant 2000) opment can be found in the recent proceedings of
can be seen as a particular case of integration of the Abduction and Induction in Artificial Intelli-
abductive inference for constructing a “bottom” gence workshops in 2007 (Flach and Kakas 2009)
clause and inductive inference to generalize it. and 2009 (Ray et al. 2009).
Abduction 7

Abduction in Systems Biology Ade H, Malfait B, Raedt LD (1994) Ruth: an ILP


Abduction has found a rich field of application in theory revision system. In: ISMIS94. Springer,
Berlin
the domain of systems biology and the declarative Alrajeh D, Ray O, Russo A, Uchitel S (2009) Using ab- A
modeling of computational biology. In a project duction and induction for operational requirements
called, Robot scientist (King et al. 2004), Progol elaboration. J Appl Logic 7(3):275–288
5.0 was used to generate abductive hypotheses Bergadano F, Cutello V, Gunetti D (2000) Abduc-
tion in machine learning. In: Gabbay D, Kruse R
about the function of genes. Similarly, learn- (eds) Handbook of defeasible reasoning and un-
ing the function of genes using abduction has certainty management systems, vol 4. Kluver Aca-
been studied in GenePath (Zupan et al. 2003) demic Press, Dordrecht, pp 197–229
where experimental genetic data is explained in del Cerro LF, Inoue K (eds) (2014) Logical
modeling of biological systems. Wiley/ISTE, Hobo-
order to facilitate the analysis of genetic net- ken/London
works. Also in Papatheodorou et al. (2005) ab- DeJong G, Mooney R (1986) Explanation-based learn-
duction is used to learn gene interactions and ing: an alternate view. Mach Learn 1:145–176
genetic pathways from microarray experimental Doncescu A, Inoue K, Yamamoto Y (2007) Knowl-
edge based discovery in systems biology using cf-
data. Abduction and its integration with induction induction. In: Okuno HG, Ali M (eds) IEA/AIE.
has been used in the study of inhibitory effect Springer, Heidelberg, pp 395–404
of toxins in metabolic networks (Tamaddoni- Flach P, Kakas A (2000) Abductive and inductive
Nezhad et al. 2004, 2006) taking into account reasoning: background and issues. In: Flach PA,
Kakas AC (eds) Abductive and inductive reasoning.
also the temporal variation that the inhibitory Pure and applied logic. Kluwer, Dordrecht
effect can have. Another bioinformatics appli- Flach PA, Kakas AC (eds) (2009) Abduction and
cation of abduction (Ray et al. 2006) concerns induction in artificial intelligence [special issue]. J
the modeling of human immunodeficiency virus Appl Logic 7(3):251
Inoue K (2001) Inverse entailment for full clausal theo-
(HIV) drug resistance and using this in order ries. In: LICS-2001 workshop on logic and learning
to assist medical practitioners in the selection Ito K, Yamamoto A (1998) Finding hypotheses from
of antiretroviral drugs for patients infected with examples by computing the least generlisation of
HIV. Also, the recently developed frameworks of bottom clauses. In: Proceedings of discovery sci-
ence’98. Springer, Berlin, pp 303–314
XHAIL and CF-induction have been applied to Josephson J, Josephson S (eds) (1994) Abductive infer-
several problems in systems biology, see e.g., Ray ence: computation, philosophy, technology. Cam-
(2009), Ray and Bryant (2008), and Doncescu bridge University Press, New York
et al. (2007), respectively. Finally, the recent book Kakas A, Kowalski R, Toni F (1992) Abductive logic
programming. J Logic Comput 2(6):719–770
edited by Cerro and Inoue (2014) on the logical Kakas A, Riguzzi F (2000) Abductive concept learn-
modeling of biological systems contains several ing. New Gener Comput 18:243–294
articles on the application of abduction in systems King R, Whelan K, Jones F, Reiser P, Bryant C, Mug-
biology. gleton S et al (2004) Functional genomic hypothesis
generation and experimentation by a robot scientist.
Nature 427:247–252
Leake D (1995) Abduction, experience and goals: a
model for everyday abductive explanation. J Exp
Cross-References Theor Artif Intell 7:407–428
Michalski RS (1993) Inferential theory of learning as
a conceptual basis for multistrategy learning. Mach
 Explanation-Based Learning Learn 11:111–151
 Inductive Logic Programming Moyle S (2002) Using theory completion to learn a
robot navigation control program. In: Proceedings
of the 12th international conference on inductive
logic programming. Springer, Berlin, pp 182–197
Moyle SA (2000) An investigation into theory com-
Recommended Reading pletion techniques in inductive logic programming.
PhD thesis, Oxford University Computing Labora-
Ade H, Denecker M (1995) AILP: abductive inductive tory, University of Oxford
logic programming. In: Mellish CS (ed) IJCAI. Muggleton S (1995) Inverse entailment and Progol.
Morgan Kaufmann, San Francisco, pp 1201–1209 New Gener Comput 13:245–286
8 Absolute Error Loss

Muggleton S, Bryant C (2000) Theory completion


using inverse entailment. In: Proceedings of the Absolute Error Loss
tenth international workshop on inductive logic pro-
gramming (ILP-00). Springer, Berlin, pp 130–146  Mean Absolute Error
Ourston D, Mooney RJ (1994) Theory refinement
combining analytical and empirical methods. Artif
Intell 66:311–344
Papatheodorou I, Kakas A, Sergot M (2005) Inference
of gene relations from microarray data by abduction. Accuracy
In: Proceedings of the eighth international con-
ference on logic programming and non-monotonic
reasoning (LPNMR’05), vol 3662. Springer, Berlin,
Definition
pp389–393
Ray O (2009) Nonmonotonic abductive inductive Accuracy refers to a measure of the degree to
learning. J Appl Logic 7(3):329–340 which the predictions of a model matches the
Ray O, Antoniades A, Kakas A, Demetriades I (2006)
reality being modeled. The term accuracy is often
Abductive logic programming in the clinical man-
agement of HIV/AIDS. In: Brewka G, Coradeschi applied in the context of  classification models.
S, Perini A, Traverso P (eds) Proceedings of the In this context, accuracy = P(.X / D Y ), where
17th European conference on artificial intelligence. XY is a joint distribution and the classification
Frontiers in artificial intelligence and applications,
vol 141. IOS Press, Amsterdam, pp 437–441
model  is a function X ! Y . Sometimes, this
Ray O, Broda K, Russo A (2003) Hybrid abductive quantity is expressed as a percentage rather than
inductive learning: a generalisation of Progol. In: a value between 0.0 and 1.0.
Proceedings of the 13th international conference The accuracy of a model is often assessed or
on inductive logic programming. Lecture notes in
artificial intelligence, vol 2835. Springer, Berlin,
estimated by applying it to test data for which the
pp 311–328  labels (Y values) are known. The accuracy of a
Ray O, Bryant C (2008) Inferring the function of genes classifier on test data may be calculated as num-
from synthetic lethal mutations. In: Proceedings of ber of correctly classified objects/total number of
the second international conference on complex,
intelligent and software intensive systems. IEEE objects. Alternatively, a smoothing function may
Computer Society, Washington, DC, pp 667–671 be applied, such as a  Laplace estimate or an m-
Ray O, Flach PA, Kakas AC (eds) (2009) Abduction estimate.
and induction in artificial intelligence. In: Proceed- Accuracy is directly related to  error rate,
ings of IJCAI 2009 workshop
Reggia J (1983) Diagnostic experts systems based such that accuracy D 1:0 – error rate (or when
on a set-covering model. Int J Man-Mach Stud expressed as a percentage, accuracy D 100 –
19(5):437–460 error rate).
Tamaddoni-Nezhad A, Chaleil R, Kakas A, Muggleton
S (2006) Application of abductive ILP to learning
metabolic network inhibition from temporal data. Cross-References
Mach Learn 64(1–3):209–230
Tamaddoni-Nezhad A, Kakas A, Muggleton S, Pazos F
(2004) Modelling inhibition in metabolic pathways  Confusion Matrix
through abduction and induction. In: Proceedings of  Mean Absolute Error
the 14th international conference on inductive logic  Model Evaluation
programming. Springer, Berlin, pp 305–322
Yamamoto A (1997) Which hypotheses can be found  Resubstitution Estimate
with inverse entailment? In: Proceedings of the sev-
enth international workshop on inductive logic pro-
gramming. Lecture notes in artificial intelligence,
vol 1297. Springer, Berlin, pp 296–308
Zupan B, Bratko I, Demsar J, Juvan P, Halter J, Kuspa ACO
A et al (2003) Genepath: a system for automated
construction of genetic networks from mutant data.
Bioinformatics 19(3):383–389  Ant Colony Optimization
Active Learning 9

to minimize its loss on future test cases. There


Actions are many theoretical and practical results demon-
strating that, when applied properly, active learn- A
In a  Markov decision process, actions are the ing can greatly reduce the number of training
available choices for the decision-maker at any examples, and even the computational effort re-
given decision epoch, in any given state. quired for a learner to achieve good generaliza-
tion.
A toy example that is often used to illustrate
the utility of active learning is that of learning
Active Learning
a threshold function over a one-dimensional
interval. Given C= labels for N points drawn
David Cohn
uniformly over the interval, the expected error
Mountain View, CA, USA
between the true value of the threshold and any
Edinburgh, UK
learner’s best guess is bounded by O.1=N /.
Given the opportunity to sequentially select
the position of points to be labeled, however,
Definition
a learner can pursue a binary search strategy,
obtaining a best guess that is within O.1=2N / of
The term Active Learning is generally used to
the true threshold value.
refer to a learning problem or system where the
This toy example illustrates the sequential
learner has some role in determining on what
nature of example selection that is a component
data it will be trained. This is in contrast to
of most (but not all) active learning strategies: the
Passive Learning, where the learner is simply
learner makes use of initial information to discard
presented with a  training set over which it has
parts of the solution space, and to focus future
no control. Active learning is often used in set-
data acquisition on distinguishing parts that are
tings where obtaining  labeled data is expensive
still viable.
or time-consuming; by sequentially identifying
which examples are most likely to be useful,
an active learner can sometimes achieve good Related Problems
performance, using far less  training data than
would otherwise be required. The term “active learning” is usually applied
in supervised learning settings, though there
are many related problems in other branches of
Structure of Learning System machine learning and beyond. The “exploration”
component of the “exploration/exploitation”
In many machine learning problems, the train- strategy in reinforcement learning is one such
ing data are treated as a fixed and given part example. The learner must take actions to gain
of the problem definition. In practice, however, information, and must decide what actions
the training data are often not fixed beforehand. will give him/her the information that will
Rather, the learner has an opportunity to play a best minimize future loss. A number of fields
role in deciding what data will be acquired for of Operations Research predate and parallel
training. This process is usually referred to as machine learning work on active learning,
“active learning,” recognizing that the learner is including Decision Theory (North 1968), Value
an active participant in the training process. of Information Computation, Bandit problems
The typical goal in active learning is to select (Robbins 1952), and Optimal Experiment Design
training examples that best enable the learner (Fedorov 1972; Box and Draper 1987).
10 Active Learning

Active Learning Scenarios most likely to give it additional information on


subsequent queries.
When active learning is used for classification
or regression, there are three common settings: Stream-Based Active Learning
constructive active learning, pool-based active Stream-based active learning resembles pool-
learning, and stream-based active learning (also based learning in many ways, except that the
called selective sampling). learner only has access to the unlabeled instances
as a stream; when an instance arrives, the learner
Constructive Active Learning must decide whether to ask for its label or let
In constructive active learning, the learner is it go.
allowed to propose arbitrary points in the input
space as examples to be labeled. While this in Other Forms of Active Learning
theory gives the learner the most power to ex- By virtue of the broad definition of active learn-
plore, it is often not practical. One obstacle is ing, there is no real limit on the possible set-
the observation that most learning systems train tings for framing it. Angluin’s early work on
on only a reduced representation of the instances learning regular sets (Angluin 1987) employed
they are presented with: text classifiers on bags a “counterexample” oracle: the learner would
of words (rather than fully-structured text) and propose a solution, and the oracle would either
speech recognizers on formants (rather than raw declare it correct, or divulge a counterexample
audio). While a learning system may be able – an instance on which the proposed and true
to identify what pattern of formants would be solutions disagreed. Jin and Si (2003) describe a
most informative to label, there is no reliable Bayesian method for selecting informative items
way to generate audio that a human could rec- to recommend when learning a collaborative fil-
ognize (and label) from the desired formants tering model, and Steck and Jaakkola (2002)
alone. describe a method best described as unsupervised
active learning to build Bayesian networks in
Pool-Based Active Learning large domains.
Pool-based active learning (McCallum and While most active learning work assumes that
Nigam 1998) is popular in domains such as the cost of obtaining a label is independent of the
text classification and speech recognition where instance to be labeled, there are many scenarios
unlabeled data are plentiful and cheap, but labels where this is not the case. A mobile robot taking
are expensive and slow to acquire. In pool-based surface measurements must first travel to the
active learning, the learner may not propose point it wishes to sample, making distant points
arbitrary points to label, but instead has access more expensive than nearby ones. In some cases,
to a set of unlabeled examples, and is allowed to the cost of a query (e.g., the difficulty of traveling
select which of them to request labels for. to a remote point to sample it) may not even be
A special case of pool-based learning is trans- known until it is made, requiring the learner to
ductive active learning, where the test distribution learn a model of that as well. In these situations,
is exactly the set of unlabeled examples. The the sequential nature of active learning is greatly
goal then is to sequentially select and label a accentuated, and a learner faces the additional
small number of examples that will best allow challenges of planning under uncertainty (see
predicting the labels of those points that remain “Greedy vs. Batch Active Learning,” below).
unlabeled.
A theme that is common to both constructive
and pool-based active learning is the principle of Common Active Learning Strategies
sequential experimentation. Information gained
from early queries allows the learner to focus 1. Version space partitioning. The earliest prac-
its search on portions of the domain that are tical active learning work (Ruff and Dietterich
Active Learning 11

1989; Mitchell 1982) explicitly relied the learner’s version space (Tong and Koller
on  version space partitioning. These 2001).
approaches tried to select examples on which 4. Loss minimization (Cohn 1996). Uncertainty A
there was maximal disagreement between sampling can stumble when parts of the
hypotheses in the current version space. learner’s domain are inherently noisy. It
When such examples were labeled, they may be that, regardless of the number of
would invalidate as large a portion of the samples labeled in some neighborhood, it
version space as possible. A limitation of will remain impossible to accurately predict
explicit version space approaches is that, in these. In these cases, it would be desirable to
underconstrained domains, a learner may not only model the learner’s uncertainty over
waste its effort differentiating portions of arbitrary parts of its domain, but also to model
the version space that have little effect on the what effect labeling any future example is
classifier’s predictions, and thus on its error. expected to have on that uncertainty. For some
2. Query by Committee (Seung et al. 1992). In learning algorithms it is feasible to explicitly
query by committee, the experimenter trains compute such estimates (e.g., for locally-
an ensemble of models, either by selecting weighted regression and mixture models,
randomized starting points (e.g., in the case these estimates may be computed in closed
of a neural network) or by bootstrapping the form). It is, therefore, practical to select
training set. Candidate examples are scored examples that directly minimize the expected
based on disagreement among the ensemble loss to the learner, as discussed below under
models – examples with high disagreement in- “Statistical Active Learning.”
dicate areas in the sample space that are under-
determined by the training data, and therefore
potentially valuable to label. Models in the
ensemble may be looked at as samples from Statistical Active Learning
the version space; picking examples where
these models disagree is a way of splitting the Uncertainty sampling and direct loss minimiza-
version space. tion are two examples of statistical active learn-
3. Uncertainty sampling (Lewis and Gail 1994). ing. Both rely on the learner’s ability to statisti-
Uncertainty sampling is a heuristic form of cally model its own uncertainty. When learning
statistical active learning. Rather than sam- with a statistical model, such as a linear regressor
pling different points in the version space by or a mixture of Gaussians (Dasgupta 1999), the
training multiple learners, the learner itself objective is usually to find model parameters
maintains an explicit model of uncertainty that minimize some form of expected loss. When
over its input space. It then selects for labeling active learning is applied to such models, it is
those examples on which it is least confident. natural to also select training data so as to min-
In classification and regression problems, un- imize that same objective. As statistical models
certainty contributes directly to expected loss usually give us estimates on the probability of (as
(as the variance component of the “error = bias yet) unknown values, it is often straightforward
+ variance” decomposition), so that gathering to turn this machinery upon itself to assist in the
examples where the learner has greatest uncer- active learning process (Cohn 1996). The process
tainty is often an effective loss-minimization is usually as follows:
heuristic. This approach has also been found
effective for non-probabilistic models, by sim- 1. Begin by requesting labels for a small random
ply selecting examples that lie near the current subsample of the examples x1 , x2 , K, xn x and
decision boundary. For some learners, such as fit our model to the labeled data.
support vector machines, this heuristic can be 2. For any x in our domain, a statistical model
shown to be an approximate partitioning of lets us estimate both the conditional expec-
12 Active Learning

2 A Detailed Example: Statistical Active


tation y.x/
O and y.x/
O
, the variance of that
expectation. We estimate our current loss by Learning with LOESS
drawing a new random sample of unlabeled
2 LOESS (Cleveland et al. 1988) is a simple form
data, and computing the averaged y.x/O
.
3. We now consider a candidate point x, Q and of locally-weighted regression using a kernel
ask what reduction in loss we would obtain function. When asked to predict the unknown
if we had labeled it y. Q If we knew its label output y corresponding to a given input x,
with certainty, we could simply add the point LOESS computes a  linear regression over
to the training set, retrain, and compute the known (x, y) pairs, in which it gives pair (xi ,
new expected loss. While we do not know the yi ) weight according to the proximity of xi to x.
true y,
Q we could, in theory, compute the new We will write this weighting as a kernel function,
expected loss for every possible yQ and average K.xi ; x/, or simplify it to ki when there is no
those losses, weighting them by our model’s chance of confusion.
estimate of p.yj Q y/.
Q In practice, this is nor- In the active learning setting, we will assume
mally unfeasible; however, for some statistical that we have a large supply of unlabeled examples
models, such as locally-weighted regression drawn from the test distribution, along with labels
and mixtures of Gaussians, we can compute for a small number of them. We wish to label a
the distribution of p.yj Q y/
Q and its effect on small number more so as to minimize the mean
2 squared error (MSE) of our model. MSE can be
y.x/
O
in closed form (Cohn 1996).
4. Given the ability to estimate the expected decomposed into two terms: squared  bias and
effect of obtaining label yQ for candidate x, Q variance. If we make the (inaccurate but simpli-
we repeat this computation for a sample of fying) assumption that LOESS is approximately
M candidates, and then request a label for the unbiased for the problem at hand, minimizing
candidate with the largest expected decrease MSE reduces to minimizing the variance of our
in loss. We add the newly-labeled example estimates.
to our training set, retrain, and begin look- Given n labeled pairs, and a prediction to
ing at candidate points to add on the next make for input x, LOESS computes the following
iteration. covariance statistics around x:

˙i ki xi ˙i ki .xi  x /2
x D ; x2 D ;
n n
The Need for Reference Distributions ˙i ki .xi  x /.yi  y /
xy D
n
Step (2) above illustrates a complication that
is unique to active learning approaches. Tradi- ˙i ki yi ˙i ki .yi  y /2
y D ; y2 D ;
tional “passive” learning usually relies on the n n
assumption that the distribution over which the 2 xy
yjx D y2 
learner will be tested is the same as the one x2
from which the training data were drawn. When
the learner is allowed to select its own training We can combine these to express the conditional
data, it still needs some form of access to the expectation of y (our estimate) and its variance
distribution of data on which it will be tested. A as:
pool-based or stream-based learner can use the
2
pool or stream as a proxy for that distribution, but xy 2
yjx
if the learner is allowed (or required) to construct yO D y C .x   x /; yO D
x2 n2
its own examples, it risks wasting all its effort on !
X .x  x /2 X 2 .xi  x /2
resolving portions of the solution space that are  ki2 C ki :
of no interest to the problem at hand. x2 x2
i i
Active Learning 13

Our proxy for model error is the variance of our Greedy Versus Batch Active Learning
prediction,
D E integrated over the test distribution
2 It is also worth pointing out that virtually all
yO . As we have assumed a pool-based setting A
in which we have a large number of unlabeled active learning work relies on greedy strategies
examples from that distribution, we can simply – the learner estimates what single example best
compute the above variance over a sample from achieves its objective, requests that one, retrains,
the pool, and use the resulting average as our and repeats. In theory, it is possible to plan some
estimate. number of queries ahead, asking what point is
To perform statistical active learning, we want best to label now, given that N-1 more label-
to compute how our estimated variance will ing opportunities remain. While such strategies
change if we add an (as yet unknown) label have been explored in Operations Research for
yQ for an arbitrary x.QD We very small problem domains, their computational
E will write this new
2 requirements make this approach unfeasible for
expected variance as Q yO . While we do not know
problems of the size typically encountered in
what value yQ will take, our model gives us an
2 machine learning.
estimated mean y. O x/
Q and variance y.x/O
for the
There are cases where retraining the learner
value, as above. We can add this “distributed” y
after every new label would be prohibitively ex-
value to LOESS just as though it were a discrete
D E pensive, or where access to labels is limited by
one, and compute the resulting expectation Q y2O the number of iterations as well as by the total
in closed form. Defining kQ as K.x; Q x/, we write: number of labels (e.g., for a finite number of
clinical trials). In this case, the learner may select
D E
D E 2
Q yjx a set of examples to be labeled on each iteration.
X
.x  Q x /2
Q y2O D ki2 C kQ 2 C This batch approach, however, is only useful if
Q 2
.n C k/ Q x2 the learner is able to identify a set of examples
i
!! whose expected contributions are non-redundant,
X .xi  Q x /2 Q  Q x /2
2 Q 2 .x which substantially complicates the process.
 ki Ck ;
Q x2 Q x2
i

where the component expectations are computed Cross-References


as follows:
˝ ˛  Active Learning Theory
D E ˝ ˛ 2
Q xy
2
Q yjx D Q y2  ;
Q x2
Q 2 C .y. Recommended Reading
˝ ˛ ny2
nk. y1x Q   y /2 /
O x/
Q y2 D C ;
n C kQ Q 2
.n C k/ Angluin D (1987) Learning regular sets from queries
and counterexamples. Inf Comput 75(2):87–106
nx C kQ xQ Angluin D (1988) Queries and concept learning. Mach
Q x D ; Learn 2:319–342
n C kQ Box GEP, Draper N (1987) Empirical model-building
and response surfaces. Wiley, New York
˝ ˛ nxy Q xQ  x /.y.
nk. O x/
Q  y / Cleveland W, Devlin S, Gross E (1988) Regression by
Q xy D C ; local fitting. J Econom 37:87–114
nCk Q Q
.n C k/ 2
Cohn D, Atlas L, Ladner R (1990) Training connec-
Q xQ  x /2 tionist networks with queries and selective sam-
nx2 nk. pling. In: Touretzky D (ed) Advances in neural in-
Q x2 D C ;
n C kQ .n C k/Q 2 formation processing systems. Morgan Kaufmann,
San Mateo
Cohn D, Ghahramani Z, Jordan MI (1996) Active
˝ 2 ˛ ˝ ˛2 n2 kQ 2 yjx
2
.xQ  x /2 learning with statistical models. J Artif Intell Res
Q xy D Q xy C :
.n C k/ Q 4 4:129–145. http://citeseer.ist.psu.edu/321503.html
14 Active Learning Theory

Dasgupta S (1999) Learning mixtures of Gaussians. been studied for many decades under the rubric
Found Comput Sci 634–644 of experimental design (Chernoff 1972; Fedorov
Fedorov V (1972) Theory of optimal experiments.
Academic Press, New York
1972). More recently, there has been substantial
Kearns M, Li M, Pitt L, Valiant L (1987) On the interest within the machine learning community
learnability of Boolean formulae. In: Proceedings in the specific task of actively learning binary
of the 19th annual ACM conference on theory of classifiers. This task presents several fundamen-
computing. ACM Press, New York, pp 285–295
Lewis DD, Gail WA (1994) A sequential algorithm
tal statistical and algorithmic challenges, and an
for training text classifiers. In: Proceedings of the understanding of its mathematical underpinnings
17th annual international ACM SIGIR conference, is only gradually emerging. This brief survey will
Dublin, pp 3–12 describe some of the progress that has been made
McCallum A, Nigam K (1998) Employing EM and
pool-based active learning for text classification. In: so far.
Machine learning: proceedings of the fifteenth inter-
national conference (ICML’98), Madison, pp 359–
367 Learning from Labeled and
North DW (1968) A tutorial introduction to decision
theory. IEEE Trans Syst Sci Cybern 4(3) Unlabeled Data
Pitt L, Valiant LG (1988) Computational limitations on
learning from examples. J ACM (JACM) 35(4):965– In the machine learning literature, the task of
984 learning a classifier has traditionally been studied
Robbins H (1952) Some aspects of the sequential
design of experiments. Bull Am Math Soc 55:527– in the framework of supervised learning. This
535 paradigm assumes that there is a training set
Ruff R, Dietterich T (1989) What good are experi- consisting of data points x (from some set X )
ments? In: Proceedings of the sixth international and their labels y (from some set Y), and the
workshop on machine learning, Ithaca
Seung HS, Opper M, Sompolinsky H (1992) Query by goal is to learn a function f W X ! Y, that will
committee. In: Proceedings of the fifth workshop on accurately predict the labels of data points arising
computational learning theory. Morgan Kaufmann, in the future. Over the past 50 years, tremendous
San Mateo, pp 287–294
progress has been made in resolving many of the
Steck H, Jaakkola T (2002) Unsupervised active learn-
ing in large domains. In: Proceeding of the confer- basic questions surrounding this model, such as
ence on uncertainty in AI. http://citeseer.ist.psu.edu/ “how many training points are needed to learn an
steck02unsupervised.html accurate classifier?”
Although this framework is now fairly well
understood, it is a poor fit for many modern
learning tasks because of its assumption that all
Active Learning Theory training points automatically come labeled. In
practice, it is frequently the case that the raw,
Sanjoy Dasgupta
abundant, easily obtained form of data is unla-
University of California, San Diego, La Jolla,
beled, whereas labels must be explicitly procured
CA, USA
and are expensive. In such situations, the reality
is that the learner starts with a large pool of un-
labeled points and must then strategically decide
Definition which ones it wants labeled: how best to spend its
limited budget.
The term active learning applies to a wide range
of situations in which a learner is able to exert Example: Speech recognition. When building
some control over its source of data. For instance, a speech recognizer, the unlabeled training data
when fitting a regression function, the learner consists of raw speech samples, which are very
may itself supply a set of data points at which to easy to collect: just walk around with a micro-
measure response values, in the hope of reducing phone. For all practical purposes, an unlimited
the variance of its estimate. Such problems have quantity of such samples can be obtained. On the
Active Learning Theory 15

other hand, the “label” for each speech sample at random, but it is not hard to show that this
is a segmentation into its constituent phonemes, yields the same label complexity as supervised
and producing even one such label requires sub- learning. A better idea is to choose the query A
stantial human time and attention. Over the past points adaptively: for instance, start by querying
decades, research labs and the government have some random data points to get a rough sense
expended an enormous amount of money, time, of where the decision boundary lies, and then
and effort on creating labeled datasets of English gradually refine the estimate of the boundary
speech. This investment has paid off, but our by specifically querying points in its immediate
ambitions are inevitably moving past what these vicinity. In other words, ask for the labels of
datasets can provide: we would now like, for in- data points whose particular positioning makes
stance, to create recognizers for other languages, them especially informative. Such strategies cer-
or for English in specific contexts. Is there some tainly sound good, but can they be fleshed out
way to avoid more painstaking years of data la- into practical algorithms? And if so, do these
beling, to somehow leverage the easy availability algorithms work well in the sense of producing
of raw speech so as to significantly reduce the good classifiers with fewer labels than would be
number of labels needed? This is the hope of required by supervised learning?
active learning. On account of the enormous practical impor-
Some early results on active learning were in tance of active learning, there are a wide range
the membership query model, where the data is of algorithms and techniques already available,
assumed to be separable (that is, some hypothesis most of which resemble the aggressive, adap-
h perfectly classifies all points) and the learner tive sampling strategy just outlined, and many
is allowed to query the label of any point in the of which show promise in experimental stud-
input space X (rather than being constrained to ies. However, a big problem with this kind of
a prespecified unlabeled set), with the goal of sampling is that very quickly the set of labeled
eventually returning the perfect hypothesis h . points no longer reflects the underlying data dis-
There is a significant body of beautiful theoretical tribution. This makes it hard to show that the
work in this model (Angluin 2001), but early classifiers learned have good statistical proper-
experiments ran into some telling difficulties. ties (for instance, that they converge to an op-
One study (Baum and Lang 1992) found that timal classifier in the limit of infinitely many
when training a neural network for handwritten labels). This survey will only discuss methods
digit recognition, the queries synthesized by the that have proofs of statistical well-foundedness,
learner were such bizarre and unnatural images and whose label complexity can be explicitly
that they were impossible for a human to classify. analyzed.
In such contexts, the membership query model is
of limited practical value; nonetheless, many of
the insights obtained from this model carry over Motivating Examples
to other settings (Hanneke 2007a).
We will fix as our standard model one in which We will start by looking at a few examples that il-
the learner is given a source of unlabeled data, lustrate the enormous potential of active learning
rather than being able to generate these points and that also make it clear why analyses of this
himself. Each point has an associated label, but new model require concepts and intuitions that
the label is initially hidden, and there is a cost are fundamentally different from those that have
for revealing it. The hope is that an accurate already been developed for supervised learning.
classifier can be found by querying just a few
labels, much fewer than would be required by Example: Thresholds on the Line
regular supervised learning. Suppose the data lie on the real line, and the avail-
How can the learner decide which labels to able classifiers are simple thresholding functions,
probe? One option is to select the query points H D fhw W w 2 Rg:
16 Active Learning Theory

(
C1 if x  w h3 B2
hw .x/ D
1 if x < w

w
h2

To make things precise, let us denote the


(unknown) underlying distribution on the data B1
.X; Y / 2 R  fC1; 1g by P, and let us suppose
that we want a hypothesis h 2 H whose error
with respect to P, namely errP D P.h.X / ¤ Y /,
is at most some . How many labels do we need? h1
In supervised learning, such issues are well h0
understood. The standard machinery of sample
complexity (using VC theory) tells us that if Active Learning Theory, Fig. 1 P is supported on the
circumference of a circle. Each Bi is an arc of probability
the data are separable – that is, if they can be mass 
perfectly classified by some hypothesis in H –
then we need approximately 1= random labeled
examples from P , and it is enough to return any of the one-dimensional case do not generalize:
classifier consistent with them. there are some target hypotheses in H for which
Now suppose we instead draw 1= unlabeled .1=/ labels are needed to find a classifier with
samples from P. If we lay these points down error rate less than , no matter what active
on the line, their hidden labels are a sequence learning scheme is used.
of s followed by a sequence of Cs, and the To see this, consider the following possible
goal is to discover the point w at which the target hypotheses (Fig. 1):
transition occurs. This can be accomplished with
a simple binary search which asks for just log • h0 : all points are positive.
1= labels: first ask for the label of the median • hi .1  i  1=/: all points are positive except
point; if it is C, move to the 25th percentile point, for a small slice Bi of probability mass .
otherwise move to the 75th percentile point; and
so on. Thus, for this hypothesis class, active The slices Bi are explicitly chosen to be disjoint,
learning gives an exponential improvement in with the result that .1=/ labels are needed
the number of labels needed, from 1= to just to distinguish between these hypotheses. For in-
log 1=. For instance, if supervised learning re- stance, suppose nature chooses a target hypothe-
quires a million labels, active learning requires sis at random from among the hi ; 1  i  1=.
just log 1;000;000  20, literally! Then, to identify this target with probability at
It is a tantalizing possibility that even for least 1=2, it is necessary to query points in at least
more complicated hypothesis classes H, a sort of (about) half the Bi s.
generalized binary search is possible. A natural Thus for these particular target hypotheses,
next step is to consider linear separators in two active learning offers little improvement in sam-
dimensions. ple complexity over regular supervised learning.
What about other target hypotheses in H, for
Example: Linear Separators in R2 instance those in which the positive and negative
Let H be the hypothesis class of linear separators regions are more evenly balanced? It is quite
in R2 , and suppose the data is distributed accord- easy (Dasgupta 2005) to devise an active learning
ing to some density supported on the perimeter of scheme which asks for O.minf1=i.h/; 1=g/ C
the unit circle. It turns out that the positive results O.log 1=/ labels, where i.h/ D min fpositive
Active Learning Theory 17

Active Learning Theory, Fig. 2 Models of pool-and evaluated by errP .h/. If we want to get this error below
stream-based active learning. The data are draws from , how many labels are needed, as a function of ?
an underlying distribution PX , and hypotheses h are

mass of h, negative mass of hg. Thus even within H whose quality is measured by its error rate,
this simple hypothesis class, the label complexity errP .h/
can run anywhere from O.log 1=/ to .1=/, In regular supervised learning, it is well known
depending on the specific target hypothesis! that if the VC dimension of H is d , then the num-
ber of labels that will with high probability ensure
Example: An Overabundance of errP .h/   is roughly O.d=/ if the data is sep-
Unlabeled Data arable and O.d= 2 / otherwise (Haussler 1992);
In our two previous examples, the amount of various logarithmic terms are omitted here. For
unlabeled data needed was O.log 1=/, exactly active learning, it is clear from the examples
the usual sample complexity of supervised learn- above that the VC dimension alone does not
ing. But it is sometimes helpful to have signifi- adequately characterize label complexity. Is there
cantly more unlabeled data than this. In Dasgupta a different combinatorial parameter that does?
(2005), a distribution P is described for which
if the amount of unlabeled data is small (below
any prespecified threshold), then the number of
labels needed to learn the target linear separator Generic Results for Separable Data
is .1=/; whereas if the amount of unlabeled For separable data, it is possible to give upper
data is much larger, then only O.log 1=/ labels and lower bounds on label complexity in terms
are needed. This is a situation where most of the of a special parameter known as the splitting
data distribution is fairly uninformative while a index (Dasgupta et al. 2005). This is merely an
miniscule fraction is highly informative. A lot of existence result: the algorithm needed to realize
unlabeled data is needed in order to get even a the upper bound is intractable because it involves
few of the informative points. explicitly maintaining an -cover (a coarse ap-
proximation) of the hypothesis class, and the size
of this cover is in general exponential in the VC
dimension. Nevertheless, it does give us an idea
The Sample Complexity of Active
of the kinds of label complexity we can hope to
Learning
achieve.
We will think of the unlabeled points x1 ; : : : ; xn Example Suppose the hypothesis class consists
as being drawn i.i.d. from an underlying distri- of intervals on the real line: X D R and
bution PX on X (namely, the marginal of the H D fha;b W a; b 2 Rg, where ha;b .x/ D
distribution P on X  Y), either all at once (a 1.a  x  b/. Using the splitting index, the
pool) or one at a time (a stream). The learner label complexity of active learning is seen to be
is only allowed to query the labels of points Q
.minf1=P X .Œa; b /; 1=g C log 1=/ when the
in the pool/stream; that is, it is restricted to target hypothesis is ha;b (Dasgupta 2005). Here
“naturally occurring” data points rather than syn- the Q notation is used to suppress logarithmic
thetic ones (Fig. 2). It returns a hypothesis h 2 terms.
18 Active Learning Theory

Example Suppose X D Rd and H consists of In practice, the biggest limitation of the algo-
linear separators through the origin. If PX is the rithm above is that it assumes the data are sepa-
uniform distribution on the unit sphere, the num- rable. Recent results have shown how to remove
ber of labels needed to learn a hypothesis of error this assumption (Balcan et al. 2006; Dasgupta
Q log 1=/, exponentially smaller
  is just .d et al. 2007) and to accommodate classification
Q
than the O.d=/ label complexity of supervised loss functions other than 0  1 loss (Beygelzimer
learning. If PX is not the uniform distribution et al. 2009). Variants of the disagreement coef-
but is everywhere within a multiplicative factor ficient continue to characterize label complexity
 > 1 of it, then the label complexity becomes in the agnostic setting (Beygelzimer et al. 2009;
Q
O..d log 1=/ log2 /, provided the amount of Dasgupta et al. 2007).
unlabeled data is increased by a factor of 2
(Dasgupta 2005). A Bayesian Model
The query by committee algorithm (Seung et al.
These results are very encouraging, but the 1992) is based on a Bayesian view of active learn-
question of an efficient active learning algorithm ing. The learner starts with a prior distribution
remains open. We now consider two approaches. on the hypothesis space, and is then exposed to a
stream of unlabeled data. Upon receiving xt , the
learner performs the following steps.
Mildly Selective Sampling
• Draw two hypotheses h; h0 at random from the
The label complexity results mentioned above are posterior over H.
based on querying maximally informative points. • If h.xt / ¤ h0 .xt / then ask for the label of xt
A less aggressive strategy is to be mildly selec- and update the posterior accordingly.
tive, to query all points except those that are quite
clearly uninformative. This is the idea behind one This algorithm queries points that substantially
of the earliest generic active learning schemes shrink the posterior, while at the same time taking
(Cohn et al. 1994). Data points x1 , x2 , . . . arrive account of the data distribution. Various theoret-
in a stream, and for each one the learner makes ical guarantees have been shown for it (Freund
a spot decision about whether or not to request et al. 1997); in particular, in the case of linear
a label. When xt arrives, the learner behaves as separators with a uniform data distribution, it
follows. achieves a label complexity of O.d log 1=/, the
best possible.
• Determine whether both possible labelings, Sampling from the posterior over the hypoth-
(xt ; C/ and (xt ; ), are consistent with the esis class is, in general, computationally pro-
labeled examples seen so far. hibitive. However, for linear separators with a
• If so, ask for the label yt . Otherwise set yt to uniform prior, it can be implemented efficiently
be the unique consistent label. using random walks on convex bodies (Gilad-
Bachrach et al. 2005).
Fortunately, the check required for the first step
can be performed efficiently by making two calls Other Work
to a supervised learner. Thus this is a very simple In this survey, I have touched mostly on active
and elegant active learning scheme, although as learning results of the greatest generality, those
one might expect, it is suboptimal in its label that apply to arbitrary hypothesis classes. There
complexity (Balcan et al. 2007). Interestingly, is also a significant body of more specialized
there is a parameter called the disagreement coef- results.
ficient that characterizes the label complexity of
this scheme and also of some other mildly selec- • Efficient active learning algorithms for spe-
tive learners (Friedman 2009; Hanneke 2007b). cific hypothesis classes.
Adaboost 19

This includes an online learning algorithm for Beygelzimer A, Dasgupta S, Langford J (2009) Im-
linear separators that only queries some of the portance weighted active learning. In: International
conference on machine learning. ACM Press, New
points and yet achieves similar regret bounds York, pp 49–56 A
to algorithms that query all the points (Cesa- Cesa-Bianchi N, Gentile C, Zaniboni L (2004) Worst-
Bianchi et al. 2004). The label complexity of case analysis of selective sampling for linear-
this method is yet to be characterized. threshold algorithms. In: Advances in neural infor-
mation processing systems
• Algorithms and label bounds for linear sepa- Chernoff H (1972) Sequential analysis and optimal
rators under the uniform data distribution. design. CBMS-NSF regional conference series in
This particular setting has been amenable to applied mathematics, vol 8. SIAM, Philadelphia
mathematical analysis. For separable data,it Cohn D, Atlas L, Ladner R (1994) Improving
generalization with active learning. Mach Learn
turns out that a variant of the perceptron al- 15(2):201–221
gorithm achieves the optimal O.d log 1=/ Dasgupta S (2005) Coarse sample complexity bounds
label complexity (Dasgupta 2005). A simple for active learning. Advances in neural information
algorithm is also available for the agnostic processing systems. Morgan Kaufmann, San Mateo
Dasgupta S, Kalai A, Monteleoni C (2005) Analy-
setting (Balcan et al. 2007). sis of perceptron-based active learning. In: 18th
annual conference on learning theory, Bertinoro,
pp 249–263
Conclusion Dasgupta S, Hsu DJ, Monteleoni C (2007) A gen-
eral agnostic active learning algorithm. Advances in
The theoretical frontier of active learning is neural information processing systems
Fedorov VV (1972) Theory of optimal experiments
mostly an unexplored wilderness. Except for a (trans: Studden WJ, Klimko EM). Academic Press,
few specific cases, we do not have a clear sense New York
of how much active learning can reduce label Freund Y, Seung S, Shamir E, Tishby N (1997) Selec-
complexity: whether by just a constant factor, or tive sampling using the query by committee algo-
rithm. Mach Learn J 28:133–168
polynomially, or exponentially. The fundamental Friedman E (2009) Active learning for smooth prob-
statistical and algorithmic challenges involved, lems. In: Conference on learning theory, Montreal,
together with the huge practical importance of pp 343–352
the field, make active learning a particularly Gilad-Bachrach R, Navot A, Tishby N (2005) Query
by committeee made real. Advances in neural infor-
rewarding terrain for investigation. mation processing systems
Hanneke S (2007a) Teaching dimension and the com-
plexity of active learning. In: Conference on learn-
Cross-References ing theory, San Diego, pp 66–81
Hanneke S (2007b) A bound on the label complexity
of agnostic active learning. In: International confer-
 Active Learning
ence on machine learning, Corvallis, pp 353–360
Haussler D (1992) Decision-theoretic generalizations
of the PAC model for neural net and other learning
Recommended Reading applications. Inf Comput 100(1):78–150
Seung HS, Opper M, Sompolinsky H (1992) Query
Angluin D (2001) Queries revisited. In: Proceedings by committee. In: Conference on computational
of the 12th international conference on algorithmic learning theory, Victoria, pp 287–294
learning theory, Washington, DC, pp 12–31
Balcan M-F, Beygelzimer A, Langford J (2006) Ag-
nostic active learning. In: International conference
on machine learning. ACM Press, New York, pp 65–
72 Adaboost
Balcan M-F, Broder A, Zhang T (2007) Margin based
active learning. In: Conference on learning theory,
San Diego, pp 35–50 Adaboost is an  ensemble learning technique,
Baum EB, Lang K (1992) Query learning can work and the most well-known of the  Boosting fam-
poorly when a human oracle is used. In: Interna- ily of algorithms. The algorithm trains models
tional joint conference on neural networks, Balti- sequentially, with a new model trained at each
more
20 Adaptive Control Processes

round. At the end of each round, mis-classified line algorithms, ARTDP is an on-line algorithm
examples are identified and have their emphasis because it uses agent behavior to guide its
increased in a new training set which is then computation. ARTDP is adaptive because it
fed back into the start of the next round, and a does not need a complete and accurate model
new model is trained. The idea is that subsequent of the environment but learns a model from data
models should be able to compensate for errors collected during agent-environment interaction.
made by earlier models. See  ensemble learning When a good model is available,  Real-Time
for full details. Dynamic Programming (RTDP) is applicable,
which is ARTDP without the model-learning
component.
Adaptive Control Processes
Motivation and Background
 Bayesian Reinforcement Learning

RTDP combines strengths of heuristic search and


DP. Like heuristic search – and unlike conven-
Adaptive Learning tional DP – it does not have to evaluate the
entire state space in order to produce an optimal
solution. Like DP – and unlike most heuristic
 Metalearning
search algorithms – it is applicable to nondeter-
ministic problems. Additionally, RTDP’s perfor-
mance as an  anytime algorithm is better than
Adaptive Real-Time Dynamic conventional DP and heuristic search algorithms.
Programming ARTDP extends these strengths to problems for
which a good model is not initially available.
Andrew G. Barto In artificial intelligence, control engineering,
University of Massachusetts, Amherst, MA, and operations research, many problems require
USA finding a policy (or control rule) that determines
how an agent (or controller) should generate ac-
tions in response to the states of its environment
(the controlled system). When a “cost” or a “re-
Synonyms
ward” is associated with each step of the agent’s
behavior, policies can be compared according to
ARTDP
how much cost or reward they are expected to
accumulate over time.
The usual formulation for problems like this in
the discrete-time case is the  Markov Decision
Definition Process (MDP). The objective is to find a policy
that minimizes (maximizes) a measure of the
Adaptive Real-Time Dynamic Programming total cost (reward) over time, assuming that the
(ARTDP) is an algorithm that allows an agent agent–environment interaction can begin in any
to improve its behavior while interacting over of the possible states. In other cases, there is
time with an incompletely known dynamic a designated set of “start states” that is much
environment. It can also be viewed as a heuristic smaller than the entire state set (e.g., the initial
search algorithm for finding shortest paths in board configuration in a board game). In these
incompletely known stochastic domains. ARTDP cases, any given policy only has to be defined
is based on  Dynamic Programming (DP), but for the set of states that can be reached from the
unlike conventional DP, which consists of off- starting states when the agent is using that policy.
Adaptive Real-Time Dynamic Programming 21

The rest of the states will never arise when that with different MDP formulations) of the total cost
policy is being followed, so the policy does not the agent is expected to incur over the future if it
need to specify what the agent should do in those starts in x. If fk .x/ and fkC1 .x/, respectively, A
states. denote the estimate of f .x/ before and after a
ARTDP and RTDP exploit situations where backup, a typical backup operation applied to x
the set of states reachable from the start states is looks like this:
a small subset of the entire state space. They can X
dramatically reduce the amount of computation fkC1 .x/ D mi na2A Œcx .a/C pxy .a/fk .f v/ ;
needed to determine an optimal policy for the y2X
relevant states as compared with the amount of
computation that a conventional DP algorithm where A is the set of possible agent actions,
would require to determine an optimal policy for cx .a/ is the immediate cost the agent incurs for
all the states. These algorithms do this by fo- performing action a in state x, and pxy .a/ prob-
cussing computation around simulated behavioral ability that the environment makes a transition
experiences (if there is a model available capable from state x to state y as a result of the agent’s
of simulating these experiences), or around real action a. This backup operation is associated with
behavioral experiences (if no model is available). the DP algorithm known as  value iteration. It
RTDP and ARTDP were introduced by Barto is also the backup operation used by RTDP and
et al. (1995). The starting point was the novel ARTDP.
observation by Bradtke that Korf’s Learning Conventional DP algorithms consist of suc-
Real-Time A* heuristic search algorithm (Korf cessive “sweeps” of the state set. Each sweep
1990) is closely related to DP. RTDP generalizes consists of applying a backup operation to each
Learning Real-Time A to stochastic problems. state. Sweeps continue until the algorithm con-
ARTDP is also closely related to Sutton’s Dyna verges to a solution. Asynchronous DP, which
system (Sutton 1990) and Jalali and Ferguson’s underlies RTDP and ARTDP, does not use sys-
(1989) Transient DP. Theoretical analysis relies tematic sweeps. States can be chosen in any way
on the theory of Asnychronous DP as described whatsoever, and as long as backups continue to
by Bertsekas and Tsitsiklis (1989). be applied to all states (and some other conditions
ARTDP and RTDP are  model-based rein- are satisfied), the algorithm will converge. RTDP
forcement learning algorithms, so called because is an instance of asynchronous DP in which the
they take advantage of an environment model, states chosen for backups are determined by the
unlike  model-free reinforcement learning algo- agent’s behavior.
rithms such as  Q-Learning and Sarsa. The backup operation above is model-based
because it uses known rewards and transition
probabilities, and the values of all the states
Structure of Learning System appear on the right-hand-side of the equation. In
contrast, a sample backup uses the value of just
Backup Operations one sample successor state. RTDP and ARTDP
A basic step of many DP and RL algorithms is are like RL algorithms in that they rely on real or
a backup operation. This is an operation that up- simulated behavioral experience, but unlike many
dates a current estimate of the cost of an MDP’s (but not all) RL algorithms, they use full backups
state. (We use the cost formulation instead of like DP.
reward to be consistent with the original presenta-
tion of the algorithms. In the case of rewards, this Off-Line Versus On-Line
would be called the value of a state and we would A conventional DP algorithm typically executes
maximize instead of minimize.) Suppose X is the off-line. When applied to finding an optimal pol-
set of MDP states. For each state x 2 X , f .x/, icy for an MDP, this means that the DP algo-
the cost of state x, gives a measure (which varies rithm executes to completion before its result
22 Adaptive Real-Time Dynamic Programming

(an optimal policy) is used to control the agent’s cutes concurrently with the agent’s behavior so
behavior. The sweeps of DP sequentially “visit” that the agent’s behavior can influence the DP
the states of the MDP, performing a backup computation. Further, the concurrently executing
operation on each state. But it is important not DP computation can influence the agent’s behav-
to confuse these visits with the behaving agent’s ior. The agent’s visits to states directs the “visits”
visits to states: the agent is not yet behaving to states made by the concurrent asynchronous
while the off-line DP computation is being done. DP computation. At the same time, the action
Hence, the agent’s behavior has no influence on performed by the agent is the action specified
the DP computation. The same is true for off-line by the policy corresponding to the latest results
asynchronous DP. of the DP computation: it is the “greedy” action
RTDP is an on-line, or “real-time,” algorithm. with respect to the current estimate of the cost
It is an asynchronous DP computation that exe- function.

Specify
actions

Asynchronous
Dynamic Programming Behaving Agent
Computation

Specify states
to backup

In the simplest version of RTDP, when a state the environment model eventually converges to
is visited by the agent, the DP computation per- the correct model. If the state and action sets are
forms the model-based backup operation given finite, the simplest way to learn a model is to keep
above on that same state. In general, for each counts of the number of times each transition
step of the agent’s behavior, RTDP can apply the occurs for each action and convert these frequen-
backup operation to each of an arbitrary set of cies to probabilities, thus forming the maximum-
states, provided that the agent’s current state is likelihood model.
included. For example, at each step of behavior,
a limited-horizon look-ahead search can be con- Summary of Theoretical Results
ducted from the agent’s current state, with the When RTDP and ARTDP are applied to stochas-
backup operation applied to each of the states tic optimal path problems, one can prove that
generated in the search. Essentially, RTDP is an under certain conditions they converge to optimal
asynchronous DP computation with the compu- policies without the need to apply backup opera-
tational effort focused along simulated or actual tions to all the states. Indeed, is some problems,
behavioral trajectories. only a small fraction of the states need to be
visited. A stochastic optimal path problem is an
Learning A Model MDP with a nonempty set of start states and
ARTDP is the same as RTDP except that (1) an a nonempty set of goal states. Each transition
environment model is updated using any on-line until a goal state is reached has a nonnegative
model-learning, or system identification, method, immediate cost, and once the agent reaches a
(2) the current environment model is used in goal state, it stays there and thereafter incurs zero
performing the RTDP backup operations, and cost. Each episode of agent experience begins
(3) the agent has to perform exploratory actions with a start state. An optimal policy is one that
occasionally instead of always greedy actions as minimizes the cost of every state, i.e., minimizes
in RTDP. This last step is essential to ensure that f .x/ for all states x. Under some relatively mild
Adaptive Real-Time Dynamic Programming 23

conditions, every optimal policy is guaranteed to in all of these ways would produce analogous
eventually reach a goal state. algorithms that could be used when a good model
A state x is relevant if a start state s and an is not available. A
optimal policy exist such that x can be reached
from s when the agent uses that policy. If we
could somehow know which states are relevant,
Cross-References
we could restrict DP to just these states and
obtain an optimal policy. But this is not possi-
 Anytime Algorithm
ble because knowing which states are relevant
 Approximate Dynamic Programming
requires knowledge of optimal policies, which
 Reinforcement Learning
is what one is seeking. However, under certain
conditions, without requiring repeated visits to
all the irrelevant states, RTDP produces a policy
that is optimal for all the relevant states. The
Recommended Reading
conditions are that (1) the initial cost of every
goal state is zero, (2) there exists at least one Barto A, Bradtke S, Singh S (1995) Learning to act
policy that guarantees that a goal state will be using real-time dynamic programming. Artif Intell
reached with probability one from any start state, 72(1–2):81–138
Bertsekas D, Tsitsiklis J (1989) Parallel and distributed
(3) all immediate costs for transitions from non-
computation: numerical methods. Prentice-Hall, En-
goal states are strictly positive, and (4) none of glewood Cliffs
the initial costs are larger than the actual costs. Bonet B, Geffner H (2003a) Labeled RTDP: improv-
This result is proved in Barto et al. (1995) by ing the convergence of real-time dynamic program-
ming. In: Proceedings of the 13th international
combining aspects of Korf’s (1990) proof for conference on automated planning and scheduling
LRTA with results for asynchronous DP. (ICAPS-2003), Trento
Bonet B, Geffner H (2003b) Faster heuristic search
algorithms for planning with uncertainty and full
Special Cases and Extensions feedback. In: Proceedings of the international joint
A number of special cases and extensions of conference on artificial intelligence (IJCAI-2003),
RTDP have been developed that improve per- Acapulco
Feng Z, Hansen E, Zilberstein S (2003) Symbolic
formance over the basic version. Some exam- generalization for on-line planning. In: Proceedings
ples are as follows. Bonet and Geffner’s (2003a) of the 19th conference on uncertainty in artificial
Labeled RTDP labels states that have already intelligence, Acapulco
been “solved,” allowing faster convergence than Hansen E, Zilberstein S (2001) LAO*: a heuristic
search algorithm that finds solutions with loops.
RTDP. Feng et al. (2003) proposed Symbolic Artif Intell 129:35–62
RTDP, which selects a set of states to update at Jalali A, Ferguson M (1989) Computationally efficient
each step using symbolic model-checking tech- control algorithms for Markov chains. In: Proceed-
niques. The RTDP convergence theorem still ap- ings of the 28th conference on decision and control,
Tampa, pp 1283–1288
plies because this is a special case of RTDP. Korf R (1990) Real-time heuristic search. Artif Intell
Smith and Simmons (2006) developed Focused 42(2–3):189–211
RTDP that maintains a priority value for each Smith T, Simmons R (2006) Focused real-time dy-
state to better direct search and produce faster namic programming for MDPs: squeezing more
out of a heuristic. In: Proceedings of the national
convergence. Hansen and Zilberstein’s (2001) conference on artificial intelligence (AAAI). AAAI
LAO uses some of the same ideas as RTDP Press, Boston
to produce a heuristic search algorithm that can Sutton R (1990) Integrated architectures for learning,
find solutions with loops to non-deterministic planning, and reacting based on approximating dy-
namic programming. In: Proceedings of the 7th in-
heuristic search problems. Many other variants ternational conference on machine learning. Morgan
are possible. Extending ARTDP instead of RTDP Kaufmann, San Mateo, pp 216–224
24 Adaptive Resonance Theory

object recognition, social cognition, object


Adaptive Resonance Theory and spatial attention, scene understanding,
space-time integration, episodic memory,
Gail A. Carpenter1 and Stephen Grossberg2
1 navigation, object tracking, system-level
Department of Mathematics & Center for
analysis of mental disorders, and machine
Adaptive Systems, Boston University, Boston,
consciousness.
MA, USA
2
Center for Adaptive Systems, Graduate
Program in Cognitive and Neural Systems, Adaptive Resonance Theory
Department of Mathematics, Boston University,
Boston, MA, USA Adaptive resonance theory (ART) neural net-
works model real-time hypothesis testing, search,
learning, recognition, and prediction. Since the
Abstract 1980s, these models of human cognitive infor-
Computational models based on cognitive and mation processing have served as computational
neural systems are now deeply embedded in engines for a variety of neuromorphic technolo-
the standard repertoire of machine learning gies (http://techlab.bu.edu/resources/articles/C5).
and data mining methods, with intelligent This article points to a broader range of tech-
learning systems enhancing performance in nology transfers that bring new methods to new
nearly every existing application area. Beyond problem domains. It describes applications of
data mining, this article shows how models three specific systems, ART knowledge discov-
based on adaptive resonance theory (ART) ery, self-supervised ART, and biased ART, and
may provide entirely new questions and summarizes future application areas for large-
practical solutions for technological appli- scale, brain-based model systems.
cations. ART models carry out hypothesis
testing, search, and incremental fast or slow, ART Design Elements
self-stabilizing learning, recognition, and In this article, ART refers generally to a theory
prediction in response to large nonstationary of cognitive information processing and to an
databases (big data). Three computational inclusive family of neural models. Design prin-
examples, each based on the distributed ART ciples derived from scientific analyses and design
neural network, frame questions and illustrate constraints imposed by targeted applications have
how a learning system (each with no free jointly guided the development of variants of the
parameters) may enhance the analysis of basic systems.
large-scale data. Performance of each task
is simulated on a common mapping platform,
Stable Fast Learning with Distributed and
a remote sensing dataset called the Boston
Winner-Take-All Coding
Testbed, available online along with open-
ART systems permit fast online learning,
source system code. Key design elements
whereby long-term memories reach their
of ART models and links to software for
asymptotes on each input trial. With slow
each system are included. The article further
learning, memories change only slightly on each
points to future applications for integrative
trial. One characteristic that distinguishes classes
ART-based systems that have already been
of ART systems from one another is the nature of
computationally specified and simulated. New
their patterns of persistent activation at the coding
application directions include autonomous
field F2 (Fig. 1). The coding field is functionally
robotics, general-purpose machine vision,
analogous to the hidden layer of multilayer
audition, speech recognition, language
perceptrons (Encyclopedia cross reference).
acquisition, eye movement control, visual
At the perceptron hidden layer, activation is
search, figure-ground separation, invariant
distributed across many nodes, learning needs
Adaptive Resonance Theory 25

Adaptive Resonance Theory, Fig. 1 Distributed ART is less than  times the size of A. A top-down/bottom-up
(dART) (Carpenter 1997). (a) At the field F0 , complement mismatch triggers a signal that resets the active F2 code.
coding transforms the feature pattern a to the system input (d) Medium-term memories in the F0 -to-F2 dynamic
A, which represents both scaled feature values ai 2 Œ0; 1 weights allow the system to activate a new code y. When
and their complements .1  ai / .i D 1 . . . M /. (b) F2 is only one F2 node remains active following competition,
a competitive field that transforms its input pattern into the code is maximally compressed, or winner-take-all.
the working memory code y. The F2 nodes that remain When jxj   jAj, the activation pattern y persists until
active following competition send the pattern  of learned the next reset, even if input A changes or F0 -to-F2 signals
top-down expectations to the match field F1 . The pattern habituate. During learning, thresholds ij in paths from
active at F1 becomes x D A ^  , where ^ denotes the F0 to F2 increase according to the dInstar law; and
component-wise minimum, or fuzzy intersection. (c) A thresholds j i in paths from F2 to F1 increase according
parameter  2 Œ0; 1, called vigilance, sets the matching to the dOutstar law
criterion. The system registers a mismatch if the size of x

to be slow, and activation does not persist once y persists until an active reset signal (Fig. 1c)
inputs are removed. The ART coding field is a prepares the coding field to register a new
competitive network where, typically, one or a F0 -to-F2 input. Early ART networks (Carpenter
few nodes in the normalized F2 pattern y sustain and Grossberg 1987; Carpenter et al. 1991a,
persistent activation, even as their generating 1992) employed localist, or winner-take-all,
inputs shift, habituate, or vanish. The pattern coding, whereby strongly competitive feedback
26 Adaptive Resonance Theory

results in only one F2 node staying active until ubiquitous computational design known as op-
the next reset. With fast as well as slow learning, ponent processing (Hurvich and Jameson 1957).
memory stability in these early networks relied Balancing an entity against its opponent, as in
on their winner-take-all architectures. opponent colors such as red vs. green or agonist-
Achieving stable fast learning with distributed antagonist muscle pairs, allows a system to
code representations presents a computational act upon relative quantities, even as absolute
challenge to any learning network. In order to magnitudes fluctuate unpredictably. In ART
meet this challenge, distributed ART (Carpenter systems, complement coding is analogous to
1997) introduced a new network configuration retinal on-cells and off-cells (Schiller 1982).
(Fig. 1) in which system fields are identified with When the learning system is presented with
cortical layers (Carpenter 2001). New learning a set of input features a  .a1 . . . ai . . . aM /,
laws (dInstar and dOutstar) that realize stable complement coding doubles the number of input
fast learning with distributed coding predict adap- components, presenting to the network an input
tive dynamics between cortical layers. A that concatenates the original feature vector
Distributed ART (dART) systems employ a and its complement (Fig. 1a).
new unit of long-term memory, which replaces Complement coding produces normalized in-
the traditional multiplicative weight (Encyclo- puts A that allow a model to encode features that
pedia cross reference) with a dynamic weight are consistently absent on an equal basis with
(Carpenter 1994). In a path from the F2 coding features that are consistently present. Features
node j to the F1 matching node i, the dynamic that are sometimes absent and sometimes present
weight equals the amount by which coding node when a given F2 node is highly active are re-
activation yj exceeds an adaptive threshold j i . garded as uninformative with respect to that node,
The total signal i from F2 to the i th F1 node and the corresponding present and absent top-
is the sum of these dynamic weights, and F1 down feature expectations shrink to zero. When
node activation xi equals the minimum of the top- a new input activates this node, these features
down expectation i and the bottom-up input Ai . are suppressed at the match field F1 (Fig. 1b).
During dOutstar learning, the top-down pattern  If the active code then produces an error signal,
converges toward the matched pattern x. attentional biasing can enhance the salience of
When coding node activation yj is below j i , input features that it had previously ignored, as
the dynamic weight is zero and no learning occurs described below.
in that path, even if yj is positive. This property
is critical for stable fast learning with distributed Matching, Attention, and Search
codes. Although the dInstar and dOutstar laws are A neural computation central to both scientific
compatible with F2 patterns y that are arbitrarily and technological analyses is the ART matching
distributed, in practice, following an initial learn- rule (Carpenter and Grossberg 1987), which con-
ing phase, most changes in paths to and from a trols how attention is focused on critical feature
coding node j occur only when its activation yj patterns via dynamic matching of a bottom-up
is large. This type of learning is therefore called sensory input with a top-down learned expecta-
quasi-localist. In the special case where coding is tion. Bottom-up/top-down pattern matching and
winner-take-all, the dynamic weight is equivalent attentional focusing are, perhaps, the primary
to a multiplicative weight that formally equals the features common to all ART models across their
complement of the adaptive threshold. many variations. Active input features that are not
confirmed by top-down expectations are inhib-
Complement Coding: Learning Both Absent ited (Fig. 1b). The remaining activation pattern
Features and Present Features defines a focus of attention, which, in turn, deter-
ART networks employ a preprocessing step mines what feature patterns are learned. Basing
called complement coding (Carpenter et al. memories on attended features rather than whole
1991b), which models the nervous system’s patterns supports the design goal of encoding sta-
Another random document with
no related content on Scribd:
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back

You might also like